🔬 Tidyverse Series – Post 3: Reshaping & Cleaning Data with tidyr Link to heading

🛠 Why tidyr? Link to heading

Data often comes in messy, inconsistent, or improperly structured formats. tidyr is designed to reshape, clean, and structure data into a tidy format that’s easy to analyze and visualize. Whether you need to pivot, separate, unite, or handle missing values, tidyr makes it seamless.

✔️ Why Use tidyr? Link to heading

  • Transforms messy data into structured formats
  • Works perfectly with dplyr for smooth data wrangling
  • Simplifies complex reshaping tasks

Let’s explore the key functions in tidyr, with detailed explanations, code examples, and expected outputs!


📚 Essential tidyr Functions Link to heading

➡️ pivot_longer(): Convert Wide Data to Long Format Link to heading

In many datasets, values are stored in wide format, making them difficult to analyze. pivot_longer() reshapes wide data into long format, making it easier to filter, summarize, and visualize.

🔹 Example: Reshaping Gene Expression Data Link to heading

Before (wide format) Link to heading

Gene Sample_1 Sample_2 Sample_3
TP53 12.3 10.5 14.2
BRCA1 8.9 9.2 10.1

Using pivot_longer() Link to heading

library(tidyr)
library(dplyr)

df_long <- df %>%
  pivot_longer(cols = starts_with("Sample"),
               names_to = "Sample",
               values_to = "Expression")

After (long format) Link to heading

Gene Sample Expression
TP53 Sample_1 12.3
TP53 Sample_2 10.5
TP53 Sample_3 14.2

✅ Now, this structure allows for easy filtering and statistical analysis!


➡️ pivot_wider(): Convert Long Data to Wide Format Link to heading

Sometimes, data stored in long format needs to be expanded back into wide format.

Example: Converting Long Format Back to Wide Link to heading

df_wide <- df_long %>%
  pivot_wider(names_from = "Sample",
              values_from = "Expression")

📌 This will recreate the original wide format, reversing the pivot_longer() operation.


➡️ separate(): Splitting One Column into Multiple Columns Link to heading

Often, a single column contains multiple pieces of information that should be split into separate columns.

Example: Splitting Sample Names into Condition & Replicate Link to heading

df_separated <- df_long %>%
  separate(Sample, into = c("Condition", "Replicate"), sep = "_")

Before Link to heading

Gene Sample Expression
TP53 Control_1 12.3
TP53 Control_2 10.5

After Link to heading

Gene Condition Replicate Expression
TP53 Control 1 12.3
TP53 Control 2 10.5

✅ Now, Condition and Replicate are separate columns, making analysis easier.


➡️ unite(): Combining Multiple Columns into One Link to heading

unite() is the opposite of separate(). It merges multiple columns into a single column, with a specified separator.

Example: Creating a Unique Identifier from Multiple Columns Link to heading

df_united <- df_separated %>%
  unite("Sample_ID", Condition, Replicate, sep = "_")

Before Link to heading

Gene Condition Replicate
TP53 Control 1
TP53 Control 2

After Link to heading

Gene Sample_ID
TP53 Control_1
TP53 Control_2

✅ Now, the Condition and Replicate columns are combined into a single Sample_ID column.


➡️ drop_na(): Removing Missing Values Link to heading

Handling missing values is essential to ensure clean data.

Example: Removing Rows with Missing Values Link to heading

df_clean <- df %>%
  drop_na()

✅ This removes all rows that contain missing (NA) values.


➡️ replace_na(): Replacing Missing Values Link to heading

Instead of removing missing values, you might want to replace them with a default value.

Example: Replacing Missing Values with Zero Link to heading

df_filled <- df %>%
  replace_na(list(Expression = 0))

✅ This replaces all NA values in the Expression column with 0.


📊 Complete Workflow: Cleaning & Reshaping Data Link to heading

Let’s go through a complete example, from messy data to clean, structured data.

library(tidyr)
library(dplyr)

# Sample messy dataset
df <- data.frame(
  Gene = c("TP53", "BRCA1", "EGFR"),
  Control_1 = c(12.3, NA, 7.8),
  Control_2 = c(10.5, 9.2, 8.9)
)

# Reshape & clean
df_cleaned <- df %>%
  pivot_longer(cols = starts_with("Control"), names_to = "Sample", values_to = "Expression") %>%
  separate(Sample, into = c("Condition", "Replicate"), sep = "_") %>%
  drop_na()

✅ This pipeline reshapes, cleans, and structures the dataset, making it easier to analyze.


📈 Key Takeaways Link to heading

tidyr is essential for reshaping and cleaning data.
pivot_longer() and pivot_wider() make restructuring seamless.
separate() and unite() allow flexible column manipulation.
✅ Handling missing values is easy with drop_na() and replace_na().
✅ Works perfectly alongside dplyr for efficient data workflows.

📌 Next up: Combining Data Efficiently – Joins & Merging with dplyr! Stay tuned! 🚀

👇 How often do you reshape data in your analysis? Let’s discuss!

#Tidyverse #tidyr #RStats #DataScience #Bioinformatics #OpenScience #ComputationalBiology