🔬 Tidyverse Series – Post 10: String Manipulation Made Easy with `stringr` Link to heading

🛠 Why `{stringr}`? Link to heading

Working with text data in base R can be cumbersome, especially when dealing with pattern matching, searching, and replacing text. {stringr} streamlines these operations by providing:

✔️ A consistent, easy-to-remember syntax (str_* functions)
✔️ Built-in regex support for complex pattern matching
✔️ Seamless integration with the Tidyverse
✔️ Efficient and scalable operations for large datasets

If you’ve ever struggled with gsub(), substr(), or paste(), {stringr} is your new best friend!

📚 Key `{stringr}` Functions Link to heading

Function	Purpose
`str_detect()`	Check if a pattern exists in a string
`str_subset()`	Extract matching strings
`str_replace()`	Replace text based on a pattern
`str_extract()`	Extract specific parts of a string
`str_length()`	Count the number of characters
`str_to_lower()`, `str_to_upper()`	Change text case
`str_split()`	Split a string into multiple parts
`str_c()`	Concatenate (combine) strings

📊 Example 1: Detecting Patterns in Text with `str_detect()` Link to heading

Imagine we have a dataset of patient records, and we need to identify which descriptions mention cancer.

➡️ Sample Data Link to heading

library(dplyr)
library(stringr)

df <- tibble(
  ID = c(1, 2, 3, 4),
  Description = c("Patient diagnosed with lung cancer",
                  "No signs of malignancy",
                  "Early-stage breast cancer detected",
                  "Regular check-up, no issues")
)

➡️ Detect if ‘cancer’ is mentioned Link to heading

df <- df %>%
  mutate(Cancer_Flag = str_detect(Description, "cancer"))

Result: Link to heading

ID	Description	Cancer_Flag
1	Patient diagnosed with lung cancer	TRUE
2	No signs of malignancy	FALSE
3	Early-stage breast cancer detected	TRUE
4	Regular check-up, no issues	FALSE

✅ Quickly flags rows mentioning “cancer”, regardless of case or position.

📊 Example 2: Extracting Gene Names from Text with `str_extract()` Link to heading

Let’s say we have gene mutation reports, and we want to extract gene symbols from descriptions.

➡️ Sample Data Link to heading

df <- tibble(
  Report = c("Mutation in TP53 gene leads to cancer",
             "BRCA1 mutations increase cancer risk",
             "EGFR is linked to lung cancer")
)

➡️ Extracting Gene Symbols Link to heading

df <- df %>%
  mutate(Gene = str_extract(Report, "[A-Z0-9]+"))

Result: Link to heading

Report	Gene
Mutation in TP53 gene leads to cancer	TP53
BRCA1 mutations increase cancer risk	BRCA1
EGFR is linked to lung cancer	EGFR

✅ Extracts gene symbols while ignoring surrounding text.

📊 Example 3: Replacing Text with `str_replace()` Link to heading

We often need to standardize terminology in datasets.

➡️ Convert ’tumor’ to ‘cancer’ Link to heading

df <- df %>%
  mutate(Report = str_replace(Report, "tumor", "cancer"))

✅ Replaces ’tumor’ with ‘cancer’ across all records.

📊 Example 4: Splitting and Concatenating Strings Link to heading

➡️ Splitting Full Names into First & Last Name Link to heading

df <- tibble(Name = c("John Doe", "Jane Smith", "Alice Johnson"))

df <- df %>%
  mutate(Name_Split = str_split(Name, " ", simplify = TRUE))

✅ str_split() separates full names into first and last name components.

📊 Example 5: Changing Text Case with `str_to_upper()` and `str_to_lower()` Link to heading

➡️ Convert all names to uppercase Link to heading

df <- df %>%
  mutate(Name_Upper = str_to_upper(Name))

✅ Standardizes text for case-insensitive comparisons.

📈 Complete Workflow: Cleaning Text Data with `{stringr}` Link to heading

Let’s put everything together to clean and standardize clinical text data.

library(dplyr)
library(stringr)

df <- tibble(
  Patient_ID = c(101, 102, 103),
  Diagnosis = c("Stage II lung cancer", "Breast tumor detected", "High BP - needs monitoring")
)

df <- df %>%
  mutate(
    Diagnosis_Clean = str_replace(Diagnosis, "tumor", "cancer"),
    Diagnosis_Flag = str_detect(Diagnosis, "cancer"),
    Diagnosis_Length = str_length(Diagnosis)
  )

✅ Standardizes terminology, detects keywords, and measures text length in one step.

📌 Key Takeaways Link to heading

✅ {stringr} makes text processing in R intuitive and consistent.
✅ str_detect(), str_extract(), and str_replace() simplify pattern matching.
✅ str_split() and str_c() enable string manipulation at scale.
✅ Regex-powered functions make text analysis fast and flexible.

📌 Next up: The Power of Tidy Text Analysis with tidytext! Stay tuned! 🚀

👇 What’s your biggest challenge with text data in R? Let’s discuss!

#Tidyverse #stringr #RStats #DataScience #Bioinformatics #OpenScience #ComputationalBiology

🔬 Tidyverse Series – Post 10: String Manipulation Made Easy with stringr Link to heading

🛠 Why {stringr}? Link to heading

📚 Key {stringr} Functions Link to heading

📊 Example 1: Detecting Patterns in Text with str_detect() Link to heading

➡️ Sample Data Link to heading

➡️ Detect if ‘cancer’ is mentioned Link to heading

Result: Link to heading

📊 Example 2: Extracting Gene Names from Text with str_extract() Link to heading

➡️ Sample Data Link to heading

➡️ Extracting Gene Symbols Link to heading

Result: Link to heading

📊 Example 3: Replacing Text with str_replace() Link to heading

➡️ Convert ’tumor’ to ‘cancer’ Link to heading

📊 Example 4: Splitting and Concatenating Strings Link to heading

➡️ Splitting Full Names into First & Last Name Link to heading

📊 Example 5: Changing Text Case with str_to_upper() and str_to_lower() Link to heading

➡️ Convert all names to uppercase Link to heading

📈 Complete Workflow: Cleaning Text Data with {stringr} Link to heading

📌 Key Takeaways Link to heading

🔬 Tidyverse Series – Post 10: String Manipulation Made Easy with `stringr` Link to heading

🛠 Why `{stringr}`? Link to heading

📚 Key `{stringr}` Functions Link to heading

📊 Example 1: Detecting Patterns in Text with `str_detect()` Link to heading

📊 Example 2: Extracting Gene Names from Text with `str_extract()` Link to heading

📊 Example 3: Replacing Text with `str_replace()` Link to heading

📊 Example 5: Changing Text Case with `str_to_upper()` and `str_to_lower()` Link to heading

📈 Complete Workflow: Cleaning Text Data with `{stringr}` Link to heading