🔬 Tidyverse Series – Post 5: Data Joins & Merging with dplyr Link to heading

🛠 Why Do We Need Joins? Link to heading

In real-world data analysis, information is often spread across multiple datasets. {dplyr} provides intuitive functions to combine datasets efficiently, whether you’re linking experimental results to metadata or merging multi-omics data.

🔹 Why Use dplyr Joins? Link to heading

✔️ Easier than base R’s merge() function
✔️ Consistent, readable syntax for different join types
✔️ Optimized for performance with large datasets


📚 Essential dplyr Joins Link to heading

dplyr provides six types of joins to help merge data effectively. Each serves a specific purpose:

Join Type Keeps Rows From Matches Needed? Missing Data Handling
inner_join() Both datasets Yes Only keeps matching rows
left_join() First dataset No Keeps all rows from the first dataset, fills NA for unmatched rows
right_join() Second dataset No Keeps all rows from the second dataset, fills NA for unmatched rows
full_join() Both datasets No Combines all rows from both datasets, filling NA where needed
semi_join() First dataset Yes Keeps only rows in the first dataset that have a match in the second dataset
anti_join() First dataset No Keeps only rows in the first dataset that don’t have a match in the second dataset

Let’s break these down with real-world examples using gene expression data and metadata tables.


📊 Example: Merging Experimental Data with Metadata Link to heading

Imagine we have gene expression data and a metadata table linking samples to conditions.

➡️ Gene Expression Data (df_expression) Link to heading

Sample Gene Expression
S1 TP53 12.3
S2 BRCA1 8.9
S3 EGFR 15.2

➡️ Metadata Table (df_metadata) Link to heading

Sample Condition
S1 Control
S2 Treatment
S4 Treatment

🔄 Performing Different Joins in dplyr Link to heading

1️⃣ Left Join: Keep all expression data, add metadata Link to heading

library(dplyr)
df_expression %>%
  left_join(df_metadata, by = "Sample")

🔹 Result: Link to heading

Sample Gene Expression Condition
S1 TP53 12.3 Control
S2 BRCA1 8.9 Treatment
S3 EGFR 15.2 NA

✔️ S3 is missing in the metadata, so the Condition column is NA.

2️⃣ Inner Join: Keep only matching samples Link to heading

df_expression %>%
  inner_join(df_metadata, by = "Sample")

🔹 Result: Link to heading

Sample Gene Expression Condition
S1 TP53 12.3 Control
S2 BRCA1 8.9 Treatment

✔️ Only S1 and S2 are kept because S3 was not found in df_metadata.

3️⃣ Full Join: Retain all records from both tables Link to heading

df_expression %>%
  full_join(df_metadata, by = "Sample")

🔹 Result: Link to heading

Sample Gene Expression Condition
S1 TP53 12.3 Control
S2 BRCA1 8.9 Treatment
S3 EGFR 15.2 NA
S4 NA NA Treatment

✔️ S3 (missing metadata) and S4 (missing expression data) are both included, filling NA where necessary.

4️⃣ Semi Join: Keep only expression records with matching metadata Link to heading

df_expression %>%
  semi_join(df_metadata, by = "Sample")

🔹 Result: Link to heading

Sample Gene Expression
S1 TP53 12.3
S2 BRCA1 8.9

✔️ Keeps only samples that have matching metadata, discarding S3.

5️⃣ Anti Join: Find missing metadata records Link to heading

df_expression %>%
  anti_join(df_metadata, by = "Sample")

🔹 Result: Link to heading

Sample Gene Expression
S3 EGFR 15.2

✔️ Only S3 is kept, as it was missing in the metadata table.


📉 Handling Missing Data After Joins Link to heading

After performing joins, you may encounter NA values. Here’s how to handle them effectively:

1️⃣ Replace Missing Values with Defaults Link to heading

df_joined %>%
  mutate(Condition = replace_na(Condition, "Unknown"))

Fills NA values in Condition with "Unknown".

2️⃣ Remove Rows with Missing Data Link to heading

df_joined %>%
  drop_na()

Removes all rows where any column contains NA.

3️⃣ Filter Only Complete Cases Link to heading

df_joined %>%
  filter(!is.na(Condition))

Keeps only rows where Condition is NOT NA.


📈 Key Takeaways Link to heading

{dplyr} joins make dataset merging intuitive and efficient.
Different joins serve different purposes—choose wisely!
Handling missing data after joins is crucial for accurate analysis.
Functions like replace_na(), drop_na(), and filter() help clean merged datasets.

📌 Next up: Handling Categorical Data in R with forcats! Stay tuned! 🚀

👇 How do you handle merging datasets in your workflow? Let’s discuss!

#Tidyverse #dplyr #Joins #RStats #DataScience #Bioinformatics #OpenScience #ComputationalBiology