🔬 Tidyverse Series – Post 5: Data Joins & Merging with dplyr
Link to heading
🛠 Why Do We Need Joins? Link to heading
In real-world data analysis, information is often spread across multiple datasets. {dplyr}
provides intuitive functions to combine datasets efficiently, whether you’re linking experimental results to metadata or merging multi-omics data.
🔹 Why Use dplyr
Joins?
Link to heading
✔️ Easier than base R’s merge()
function
✔️ Consistent, readable syntax for different join types
✔️ Optimized for performance with large datasets
📚 Essential dplyr
Joins
Link to heading
dplyr
provides six types of joins to help merge data effectively. Each serves a specific purpose:
Join Type | Keeps Rows From | Matches Needed? | Missing Data Handling |
---|---|---|---|
inner_join() |
Both datasets | Yes | Only keeps matching rows |
left_join() |
First dataset | No | Keeps all rows from the first dataset, fills NA for unmatched rows |
right_join() |
Second dataset | No | Keeps all rows from the second dataset, fills NA for unmatched rows |
full_join() |
Both datasets | No | Combines all rows from both datasets, filling NA where needed |
semi_join() |
First dataset | Yes | Keeps only rows in the first dataset that have a match in the second dataset |
anti_join() |
First dataset | No | Keeps only rows in the first dataset that don’t have a match in the second dataset |
Let’s break these down with real-world examples using gene expression data and metadata tables.
📊 Example: Merging Experimental Data with Metadata Link to heading
Imagine we have gene expression data and a metadata table linking samples to conditions.
➡️ Gene Expression Data (df_expression
)
Link to heading
Sample | Gene | Expression |
---|---|---|
S1 | TP53 | 12.3 |
S2 | BRCA1 | 8.9 |
S3 | EGFR | 15.2 |
➡️ Metadata Table (df_metadata
)
Link to heading
Sample | Condition |
---|---|
S1 | Control |
S2 | Treatment |
S4 | Treatment |
🔄 Performing Different Joins in dplyr
Link to heading
1️⃣ Left Join: Keep all expression data, add metadata Link to heading
library(dplyr)
df_expression %>%
left_join(df_metadata, by = "Sample")
🔹 Result: Link to heading
Sample | Gene | Expression | Condition |
---|---|---|---|
S1 | TP53 | 12.3 | Control |
S2 | BRCA1 | 8.9 | Treatment |
S3 | EGFR | 15.2 | NA |
✔️ S3
is missing in the metadata, so the Condition column is NA
.
2️⃣ Inner Join: Keep only matching samples Link to heading
df_expression %>%
inner_join(df_metadata, by = "Sample")
🔹 Result: Link to heading
Sample | Gene | Expression | Condition |
---|---|---|---|
S1 | TP53 | 12.3 | Control |
S2 | BRCA1 | 8.9 | Treatment |
✔️ Only S1
and S2
are kept because S3
was not found in df_metadata
.
3️⃣ Full Join: Retain all records from both tables Link to heading
df_expression %>%
full_join(df_metadata, by = "Sample")
🔹 Result: Link to heading
Sample | Gene | Expression | Condition |
---|---|---|---|
S1 | TP53 | 12.3 | Control |
S2 | BRCA1 | 8.9 | Treatment |
S3 | EGFR | 15.2 | NA |
S4 | NA | NA | Treatment |
✔️ S3
(missing metadata) and S4
(missing expression data) are both included, filling NA
where necessary.
4️⃣ Semi Join: Keep only expression records with matching metadata Link to heading
df_expression %>%
semi_join(df_metadata, by = "Sample")
🔹 Result: Link to heading
Sample | Gene | Expression |
---|---|---|
S1 | TP53 | 12.3 |
S2 | BRCA1 | 8.9 |
✔️ Keeps only samples that have matching metadata, discarding S3
.
5️⃣ Anti Join: Find missing metadata records Link to heading
df_expression %>%
anti_join(df_metadata, by = "Sample")
🔹 Result: Link to heading
Sample | Gene | Expression |
---|---|---|
S3 | EGFR | 15.2 |
✔️ Only S3
is kept, as it was missing in the metadata table.
📉 Handling Missing Data After Joins Link to heading
After performing joins, you may encounter NA
values. Here’s how to handle them effectively:
1️⃣ Replace Missing Values with Defaults Link to heading
df_joined %>%
mutate(Condition = replace_na(Condition, "Unknown"))
✅ Fills NA
values in Condition
with "Unknown"
.
2️⃣ Remove Rows with Missing Data Link to heading
df_joined %>%
drop_na()
✅ Removes all rows where any column contains NA
.
3️⃣ Filter Only Complete Cases Link to heading
df_joined %>%
filter(!is.na(Condition))
✅ Keeps only rows where Condition
is NOT NA
.
📈 Key Takeaways Link to heading
✅ {dplyr}
joins make dataset merging intuitive and efficient.
✅ Different joins serve different purposes—choose wisely!
✅ Handling missing data after joins is crucial for accurate analysis.
✅ Functions like replace_na()
, drop_na()
, and filter()
help clean merged datasets.
📌 Next up: Handling Categorical Data in R with forcats
! Stay tuned! 🚀
👇 How do you handle merging datasets in your workflow? Let’s discuss!
#Tidyverse #dplyr #Joins #RStats #DataScience #Bioinformatics #OpenScience #ComputationalBiology