🔬 Tidyverse Series – Post 14: Bringing It All Together – A Full Tidyverse Workflow Link to heading

🛠 Why Combine Tidyverse Packages? Link to heading

Each Tidyverse package serves a specific purpose, but the real power emerges when they work together in a unified data pipeline. A complete data analysis workflow typically involves:

✔️ dplyr – Data manipulation
✔️ tidyr – Reshaping & cleaning
✔️ ggplot2 – Data visualization
✔️ forcats – Handling categorical variables
✔️ lubridate – Working with dates & times
✔️ stringr – String manipulation

By leveraging these tools in combination, we can clean, transform, visualize, and analyze data efficiently in a structured workflow.

📚 Case Study: Analyzing Flight Data Link to heading

Let’s explore how multiple Tidyverse packages work together using an airline flight dataset from nycflights13.

➡️ Step 1: Load Required Packages & Data Link to heading

library(tidyverse)
library(nycflights13)
library(lubridate)

flights <- nycflights13::flights

✅ We import the flights dataset, which contains departure times, delays, carriers, and other flight information.

🚀 Step 2: Data Cleaning & Transformation (dplyr & tidyr) Link to heading

Before analysis, we must clean and structure the dataset:

flights_cleaned <- flights %>%
  filter(!is.na(dep_delay)) %>%  # Remove missing departure delays
  mutate(
    dep_hour = floor(dep_time / 100),  # Convert departure time to hours
    flight_date = make_date(year, month, day)  # Create a date column
  ) %>%
  select(flight_date, dep_hour, carrier, origin, dep_delay)

✅ Removes missing values
✅ Extracts departure hours for time-based analysis
✅ Creates a structured flight date column using {lubridate}

📊 Step 3: Handling Categorical Variables (forcats) Link to heading

Some airlines have very few flights, making analysis harder. {forcats} helps by grouping smaller categories into ‘Other’:

flights_cleaned <- flights_cleaned %>%
  mutate(carrier = fct_lump_n(carrier, n = 5))  # Keep top 5 carriers

✅ Groups smaller airlines under ‘Other’, simplifying visualizations.

🔍 Step 4: Text Processing – Identifying Flights from JFK (stringr) Link to heading

If we need to analyze flights only from JFK, {stringr} makes it easy:

jfk_flights <- flights_cleaned %>%
  filter(str_detect(origin, "JFK"))

✅ Finds all flights departing from JFK using regex-based filtering.

📈 Step 5: Analyzing Delays by Carrier Link to heading

Now, we summarize delays by airline carrier:

carrier_delays <- flights_cleaned %>%
  group_by(carrier) %>%
  summarize(avg_delay = mean(dep_delay, na.rm = TRUE))

✅ Provides a quick summary of airline delays.

📊 Step 6: Visualizing Trends with ggplot2 Link to heading

➡️ Departure Delay Patterns by Hour Link to heading

flights_cleaned %>%
  ggplot(aes(x = dep_hour, y = dep_delay, color = carrier)) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = FALSE) +
  theme_minimal() +
  labs(title = "Average Departure Delay by Hour",
       x = "Hour of Departure",
       y = "Departure Delay (minutes)",
       color = "Carrier")

✅ Reveals delay patterns by departure time and airline carrier.

📌 Key Takeaways Link to heading

✅ The Tidyverse is most powerful when used as an integrated system.
✅ dplyr, tidyr, ggplot2, forcats, lubridate, and stringr work together to streamline analysis.
✅ Data cleaning, transformation, and visualization become seamless.
✅ Modular workflows make complex analyses simple and reproducible.

📌 Next up: Capstone Post – A Real-World Tidyverse Case Study! Stay tuned! 🚀

👇 How do you combine Tidyverse packages in your workflow? Let’s discuss!

#Tidyverse #DataScience #RStats #DataVisualization #Bioinformatics #OpenScience #ComputationalBiology