🔬 Tidyverse Series – Post 3: Reshaping & Cleaning Data with tidyr
Link to heading
🛠 Why tidyr
?
Link to heading
Data often comes in messy, inconsistent, or improperly structured formats. tidyr
is designed to reshape, clean, and structure data into a tidy format that’s easy to analyze and visualize. Whether you need to pivot, separate, unite, or handle missing values, tidyr
makes it seamless.
✔️ Why Use tidyr
?
Link to heading
- Transforms messy data into structured formats
- Works perfectly with
dplyr
for smooth data wrangling - Simplifies complex reshaping tasks
Let’s explore the key functions in tidyr
, with detailed explanations, code examples, and expected outputs!
📚 Essential tidyr
Functions
Link to heading
➡️ pivot_longer()
: Convert Wide Data to Long Format
Link to heading
In many datasets, values are stored in wide format, making them difficult to analyze. pivot_longer()
reshapes wide data into long format, making it easier to filter, summarize, and visualize.
🔹 Example: Reshaping Gene Expression Data Link to heading
Before (wide format
)
Link to heading
Gene | Sample_1 | Sample_2 | Sample_3 |
---|---|---|---|
TP53 | 12.3 | 10.5 | 14.2 |
BRCA1 | 8.9 | 9.2 | 10.1 |
Using pivot_longer()
Link to heading
library(tidyr)
library(dplyr)
df_long <- df %>%
pivot_longer(cols = starts_with("Sample"),
names_to = "Sample",
values_to = "Expression")
After (long format
)
Link to heading
Gene | Sample | Expression |
---|---|---|
TP53 | Sample_1 | 12.3 |
TP53 | Sample_2 | 10.5 |
TP53 | Sample_3 | 14.2 |
✅ Now, this structure allows for easy filtering and statistical analysis!
➡️ pivot_wider()
: Convert Long Data to Wide Format
Link to heading
Sometimes, data stored in long format needs to be expanded back into wide format.
Example: Converting Long Format Back to Wide Link to heading
df_wide <- df_long %>%
pivot_wider(names_from = "Sample",
values_from = "Expression")
📌 This will recreate the original wide format, reversing the pivot_longer()
operation.
➡️ separate()
: Splitting One Column into Multiple Columns
Link to heading
Often, a single column contains multiple pieces of information that should be split into separate columns.
Example: Splitting Sample Names into Condition & Replicate Link to heading
df_separated <- df_long %>%
separate(Sample, into = c("Condition", "Replicate"), sep = "_")
Before Link to heading
Gene | Sample | Expression |
---|---|---|
TP53 | Control_1 | 12.3 |
TP53 | Control_2 | 10.5 |
After Link to heading
Gene | Condition | Replicate | Expression |
---|---|---|---|
TP53 | Control | 1 | 12.3 |
TP53 | Control | 2 | 10.5 |
✅ Now, Condition and Replicate are separate columns, making analysis easier.
➡️ unite()
: Combining Multiple Columns into One
Link to heading
unite()
is the opposite of separate()
. It merges multiple columns into a single column, with a specified separator.
Example: Creating a Unique Identifier from Multiple Columns Link to heading
df_united <- df_separated %>%
unite("Sample_ID", Condition, Replicate, sep = "_")
Before Link to heading
Gene | Condition | Replicate |
---|---|---|
TP53 | Control | 1 |
TP53 | Control | 2 |
After Link to heading
Gene | Sample_ID |
---|---|
TP53 | Control_1 |
TP53 | Control_2 |
✅ Now, the Condition and Replicate columns are combined into a single Sample_ID column.
➡️ drop_na()
: Removing Missing Values
Link to heading
Handling missing values is essential to ensure clean data.
Example: Removing Rows with Missing Values Link to heading
df_clean <- df %>%
drop_na()
✅ This removes all rows that contain missing (NA
) values.
➡️ replace_na()
: Replacing Missing Values
Link to heading
Instead of removing missing values, you might want to replace them with a default value.
Example: Replacing Missing Values with Zero Link to heading
df_filled <- df %>%
replace_na(list(Expression = 0))
✅ This replaces all NA
values in the Expression
column with 0
.
📊 Complete Workflow: Cleaning & Reshaping Data Link to heading
Let’s go through a complete example, from messy data to clean, structured data.
library(tidyr)
library(dplyr)
# Sample messy dataset
df <- data.frame(
Gene = c("TP53", "BRCA1", "EGFR"),
Control_1 = c(12.3, NA, 7.8),
Control_2 = c(10.5, 9.2, 8.9)
)
# Reshape & clean
df_cleaned <- df %>%
pivot_longer(cols = starts_with("Control"), names_to = "Sample", values_to = "Expression") %>%
separate(Sample, into = c("Condition", "Replicate"), sep = "_") %>%
drop_na()
✅ This pipeline reshapes, cleans, and structures the dataset, making it easier to analyze.
📈 Key Takeaways Link to heading
✅ tidyr
is essential for reshaping and cleaning data.
✅ pivot_longer()
and pivot_wider()
make restructuring seamless.
✅ separate()
and unite()
allow flexible column manipulation.
✅ Handling missing values is easy with drop_na()
and replace_na()
.
✅ Works perfectly alongside dplyr
for efficient data workflows.
📌 Next up: Combining Data Efficiently – Joins & Merging with dplyr
! Stay tuned! 🚀
👇 How often do you reshape data in your analysis? Let’s discuss!
#Tidyverse #tidyr #RStats #DataScience #Bioinformatics #OpenScience #ComputationalBiology