🔬 Tidyverse Series – Post 5: Data Joins & Merging with dplyr
Link to heading
🛠 Why Do We Need Joins? Link to heading
In real-world data analysis, information is often spread across multiple datasets. {dplyr} provides intuitive functions to combine datasets efficiently, whether you’re linking experimental results to metadata or merging multi-omics data.
🔹 Why Use dplyr Joins?
Link to heading
✔️ Easier than base R’s merge() function
✔️ Consistent, readable syntax for different join types
✔️ Optimized for performance with large datasets
📚 Essential dplyr Joins
Link to heading
dplyr provides six types of joins to help merge data effectively. Each serves a specific purpose:
| Join Type | Keeps Rows From | Matches Needed? | Missing Data Handling |
|---|---|---|---|
inner_join() |
Both datasets | Yes | Only keeps matching rows |
left_join() |
First dataset | No | Keeps all rows from the first dataset, fills NA for unmatched rows |
right_join() |
Second dataset | No | Keeps all rows from the second dataset, fills NA for unmatched rows |
full_join() |
Both datasets | No | Combines all rows from both datasets, filling NA where needed |
semi_join() |
First dataset | Yes | Keeps only rows in the first dataset that have a match in the second dataset |
anti_join() |
First dataset | No | Keeps only rows in the first dataset that don’t have a match in the second dataset |
Let’s break these down with real-world examples using gene expression data and metadata tables.
📊 Example: Merging Experimental Data with Metadata Link to heading
Imagine we have gene expression data and a metadata table linking samples to conditions.
➡️ Gene Expression Data (df_expression)
Link to heading
| Sample | Gene | Expression |
|---|---|---|
| S1 | TP53 | 12.3 |
| S2 | BRCA1 | 8.9 |
| S3 | EGFR | 15.2 |
➡️ Metadata Table (df_metadata)
Link to heading
| Sample | Condition |
|---|---|
| S1 | Control |
| S2 | Treatment |
| S4 | Treatment |
🔄 Performing Different Joins in dplyr
Link to heading
1️⃣ Left Join: Keep all expression data, add metadata Link to heading
library(dplyr)
df_expression %>%
left_join(df_metadata, by = "Sample")
🔹 Result: Link to heading
| Sample | Gene | Expression | Condition |
|---|---|---|---|
| S1 | TP53 | 12.3 | Control |
| S2 | BRCA1 | 8.9 | Treatment |
| S3 | EGFR | 15.2 | NA |
✔️ S3 is missing in the metadata, so the Condition column is NA.
2️⃣ Inner Join: Keep only matching samples Link to heading
df_expression %>%
inner_join(df_metadata, by = "Sample")
🔹 Result: Link to heading
| Sample | Gene | Expression | Condition |
|---|---|---|---|
| S1 | TP53 | 12.3 | Control |
| S2 | BRCA1 | 8.9 | Treatment |
✔️ Only S1 and S2 are kept because S3 was not found in df_metadata.
3️⃣ Full Join: Retain all records from both tables Link to heading
df_expression %>%
full_join(df_metadata, by = "Sample")
🔹 Result: Link to heading
| Sample | Gene | Expression | Condition |
|---|---|---|---|
| S1 | TP53 | 12.3 | Control |
| S2 | BRCA1 | 8.9 | Treatment |
| S3 | EGFR | 15.2 | NA |
| S4 | NA | NA | Treatment |
✔️ S3 (missing metadata) and S4 (missing expression data) are both included, filling NA where necessary.
4️⃣ Semi Join: Keep only expression records with matching metadata Link to heading
df_expression %>%
semi_join(df_metadata, by = "Sample")
🔹 Result: Link to heading
| Sample | Gene | Expression |
|---|---|---|
| S1 | TP53 | 12.3 |
| S2 | BRCA1 | 8.9 |
✔️ Keeps only samples that have matching metadata, discarding S3.
5️⃣ Anti Join: Find missing metadata records Link to heading
df_expression %>%
anti_join(df_metadata, by = "Sample")
🔹 Result: Link to heading
| Sample | Gene | Expression |
|---|---|---|
| S3 | EGFR | 15.2 |
✔️ Only S3 is kept, as it was missing in the metadata table.
📉 Handling Missing Data After Joins Link to heading
After performing joins, you may encounter NA values. Here’s how to handle them effectively:
1️⃣ Replace Missing Values with Defaults Link to heading
df_joined %>%
mutate(Condition = replace_na(Condition, "Unknown"))
✅ Fills NA values in Condition with "Unknown".
2️⃣ Remove Rows with Missing Data Link to heading
df_joined %>%
drop_na()
✅ Removes all rows where any column contains NA.
3️⃣ Filter Only Complete Cases Link to heading
df_joined %>%
filter(!is.na(Condition))
✅ Keeps only rows where Condition is NOT NA.
📈 Key Takeaways Link to heading
✅ {dplyr} joins make dataset merging intuitive and efficient.
✅ Different joins serve different purposes—choose wisely!
✅ Handling missing data after joins is crucial for accurate analysis.
✅ Functions like replace_na(), drop_na(), and filter() help clean merged datasets.
📌 Next up: Handling Categorical Data in R with forcats! Stay tuned! 🚀
👇 How do you handle merging datasets in your workflow? Let’s discuss!
#Tidyverse #dplyr #Joins #RStats #DataScience #Bioinformatics #OpenScience #ComputationalBiology