🔬 Tidyverse Series – Post 10: String Manipulation Made Easy with stringr Link to heading

🛠 Why {stringr}? Link to heading

Working with text data in base R can be cumbersome, especially when dealing with pattern matching, searching, and replacing text. {stringr} streamlines these operations by providing:

✔️ A consistent, easy-to-remember syntax (str_* functions)
✔️ Built-in regex support for complex pattern matching
✔️ Seamless integration with the Tidyverse
✔️ Efficient and scalable operations for large datasets

If you’ve ever struggled with gsub(), substr(), or paste(), {stringr} is your new best friend!


📚 Key {stringr} Functions Link to heading

Function Purpose
str_detect() Check if a pattern exists in a string
str_subset() Extract matching strings
str_replace() Replace text based on a pattern
str_extract() Extract specific parts of a string
str_length() Count the number of characters
str_to_lower(), str_to_upper() Change text case
str_split() Split a string into multiple parts
str_c() Concatenate (combine) strings

📊 Example 1: Detecting Patterns in Text with str_detect() Link to heading

Imagine we have a dataset of patient records, and we need to identify which descriptions mention cancer.

➡️ Sample Data Link to heading

library(dplyr)
library(stringr)

df <- tibble(
  ID = c(1, 2, 3, 4),
  Description = c("Patient diagnosed with lung cancer",
                  "No signs of malignancy",
                  "Early-stage breast cancer detected",
                  "Regular check-up, no issues")
)

➡️ Detect if ‘cancer’ is mentioned Link to heading

df <- df %>%
  mutate(Cancer_Flag = str_detect(Description, "cancer"))

Result: Link to heading

ID Description Cancer_Flag
1 Patient diagnosed with lung cancer TRUE
2 No signs of malignancy FALSE
3 Early-stage breast cancer detected TRUE
4 Regular check-up, no issues FALSE

✅ Quickly flags rows mentioning “cancer”, regardless of case or position.


📊 Example 2: Extracting Gene Names from Text with str_extract() Link to heading

Let’s say we have gene mutation reports, and we want to extract gene symbols from descriptions.

➡️ Sample Data Link to heading

df <- tibble(
  Report = c("Mutation in TP53 gene leads to cancer",
             "BRCA1 mutations increase cancer risk",
             "EGFR is linked to lung cancer")
)

➡️ Extracting Gene Symbols Link to heading

df <- df %>%
  mutate(Gene = str_extract(Report, "[A-Z0-9]+"))

Result: Link to heading

Report Gene
Mutation in TP53 gene leads to cancer TP53
BRCA1 mutations increase cancer risk BRCA1
EGFR is linked to lung cancer EGFR

✅ Extracts gene symbols while ignoring surrounding text.


📊 Example 3: Replacing Text with str_replace() Link to heading

We often need to standardize terminology in datasets.

➡️ Convert ’tumor’ to ‘cancer’ Link to heading

df <- df %>%
  mutate(Report = str_replace(Report, "tumor", "cancer"))

✅ Replaces ’tumor’ with ‘cancer’ across all records.


📊 Example 4: Splitting and Concatenating Strings Link to heading

➡️ Splitting Full Names into First & Last Name Link to heading

df <- tibble(Name = c("John Doe", "Jane Smith", "Alice Johnson"))

df <- df %>%
  mutate(Name_Split = str_split(Name, " ", simplify = TRUE))

str_split() separates full names into first and last name components.


📊 Example 5: Changing Text Case with str_to_upper() and str_to_lower() Link to heading

➡️ Convert all names to uppercase Link to heading

df <- df %>%
  mutate(Name_Upper = str_to_upper(Name))

✅ Standardizes text for case-insensitive comparisons.


📈 Complete Workflow: Cleaning Text Data with {stringr} Link to heading

Let’s put everything together to clean and standardize clinical text data.

library(dplyr)
library(stringr)

df <- tibble(
  Patient_ID = c(101, 102, 103),
  Diagnosis = c("Stage II lung cancer", "Breast tumor detected", "High BP - needs monitoring")
)

df <- df %>%
  mutate(
    Diagnosis_Clean = str_replace(Diagnosis, "tumor", "cancer"),
    Diagnosis_Flag = str_detect(Diagnosis, "cancer"),
    Diagnosis_Length = str_length(Diagnosis)
  )

Standardizes terminology, detects keywords, and measures text length in one step.


📌 Key Takeaways Link to heading

{stringr} makes text processing in R intuitive and consistent.
str_detect(), str_extract(), and str_replace() simplify pattern matching.
str_split() and str_c() enable string manipulation at scale.
Regex-powered functions make text analysis fast and flexible.

📌 Next up: The Power of Tidy Text Analysis with tidytext! Stay tuned! 🚀

👇 What’s your biggest challenge with text data in R? Let’s discuss!

#Tidyverse #stringr #RStats #DataScience #Bioinformatics #OpenScience #ComputationalBiology