🔬 Tidyverse Series – Post 10: String Manipulation Made Easy with stringr
Link to heading
🛠 Why {stringr}
?
Link to heading
Working with text data in base R can be cumbersome, especially when dealing with pattern matching, searching, and replacing text. {stringr}
streamlines these operations by providing:
✔️ A consistent, easy-to-remember syntax (str_*
functions)
✔️ Built-in regex support for complex pattern matching
✔️ Seamless integration with the Tidyverse
✔️ Efficient and scalable operations for large datasets
If you’ve ever struggled with gsub()
, substr()
, or paste()
, {stringr}
is your new best friend!
📚 Key {stringr}
Functions
Link to heading
Function | Purpose |
---|---|
str_detect() |
Check if a pattern exists in a string |
str_subset() |
Extract matching strings |
str_replace() |
Replace text based on a pattern |
str_extract() |
Extract specific parts of a string |
str_length() |
Count the number of characters |
str_to_lower() , str_to_upper() |
Change text case |
str_split() |
Split a string into multiple parts |
str_c() |
Concatenate (combine) strings |
📊 Example 1: Detecting Patterns in Text with str_detect()
Link to heading
Imagine we have a dataset of patient records, and we need to identify which descriptions mention cancer.
➡️ Sample Data Link to heading
library(dplyr)
library(stringr)
df <- tibble(
ID = c(1, 2, 3, 4),
Description = c("Patient diagnosed with lung cancer",
"No signs of malignancy",
"Early-stage breast cancer detected",
"Regular check-up, no issues")
)
➡️ Detect if ‘cancer’ is mentioned Link to heading
df <- df %>%
mutate(Cancer_Flag = str_detect(Description, "cancer"))
Result: Link to heading
ID | Description | Cancer_Flag |
---|---|---|
1 | Patient diagnosed with lung cancer | TRUE |
2 | No signs of malignancy | FALSE |
3 | Early-stage breast cancer detected | TRUE |
4 | Regular check-up, no issues | FALSE |
✅ Quickly flags rows mentioning “cancer”, regardless of case or position.
📊 Example 2: Extracting Gene Names from Text with str_extract()
Link to heading
Let’s say we have gene mutation reports, and we want to extract gene symbols from descriptions.
➡️ Sample Data Link to heading
df <- tibble(
Report = c("Mutation in TP53 gene leads to cancer",
"BRCA1 mutations increase cancer risk",
"EGFR is linked to lung cancer")
)
➡️ Extracting Gene Symbols Link to heading
df <- df %>%
mutate(Gene = str_extract(Report, "[A-Z0-9]+"))
Result: Link to heading
Report | Gene |
---|---|
Mutation in TP53 gene leads to cancer | TP53 |
BRCA1 mutations increase cancer risk | BRCA1 |
EGFR is linked to lung cancer | EGFR |
✅ Extracts gene symbols while ignoring surrounding text.
📊 Example 3: Replacing Text with str_replace()
Link to heading
We often need to standardize terminology in datasets.
➡️ Convert ’tumor’ to ‘cancer’ Link to heading
df <- df %>%
mutate(Report = str_replace(Report, "tumor", "cancer"))
✅ Replaces ’tumor’ with ‘cancer’ across all records.
📊 Example 4: Splitting and Concatenating Strings Link to heading
➡️ Splitting Full Names into First & Last Name Link to heading
df <- tibble(Name = c("John Doe", "Jane Smith", "Alice Johnson"))
df <- df %>%
mutate(Name_Split = str_split(Name, " ", simplify = TRUE))
✅ str_split()
separates full names into first and last name components.
📊 Example 5: Changing Text Case with str_to_upper()
and str_to_lower()
Link to heading
➡️ Convert all names to uppercase Link to heading
df <- df %>%
mutate(Name_Upper = str_to_upper(Name))
✅ Standardizes text for case-insensitive comparisons.
📈 Complete Workflow: Cleaning Text Data with {stringr}
Link to heading
Let’s put everything together to clean and standardize clinical text data.
library(dplyr)
library(stringr)
df <- tibble(
Patient_ID = c(101, 102, 103),
Diagnosis = c("Stage II lung cancer", "Breast tumor detected", "High BP - needs monitoring")
)
df <- df %>%
mutate(
Diagnosis_Clean = str_replace(Diagnosis, "tumor", "cancer"),
Diagnosis_Flag = str_detect(Diagnosis, "cancer"),
Diagnosis_Length = str_length(Diagnosis)
)
✅ Standardizes terminology, detects keywords, and measures text length in one step.
📌 Key Takeaways Link to heading
✅ {stringr}
makes text processing in R intuitive and consistent.
✅ str_detect()
, str_extract()
, and str_replace()
simplify pattern matching.
✅ str_split()
and str_c()
enable string manipulation at scale.
✅ Regex-powered functions make text analysis fast and flexible.
📌 Next up: The Power of Tidy Text Analysis with tidytext
! Stay tuned! 🚀
👇 What’s your biggest challenge with text data in R? Let’s discuss!
#Tidyverse #stringr #RStats #DataScience #Bioinformatics #OpenScience #ComputationalBiology