🧬 Mastering Bulk RNA-seq Analysis in R – Post 3: Data Import & Preprocessing Link to heading
🎯 From Files to Analysis: The Essential Workflow Link to heading
Ready to transform your RNA-seq count files into DESeq2-ready objects? This is where theory meets practice—taking real data from your sequencing pipeline and preparing it for statistical analysis.
The good news? DESeq2 makes this surprisingly straightforward once you understand the essential components.
🔍 What You Need Link to heading
1. Count Matrix Link to heading
- Rows: Genes
- Columns: Samples
- Values: Integer counts (no decimals!)
2. Sample Metadata Link to heading
- Rows: Samples (matching count matrix column names)
- Columns: Experimental factors (condition, batch, etc.)
That’s it! Let’s put them together.
🔧 The Core Workflow Link to heading
Step 1: Load Libraries and Data Link to heading
library(DESeq2)
# Import count matrix
counts <- read.csv("count_matrix.csv", row.names = 1)
counts <- as.matrix(counts)
storage.mode(counts) <- "integer"
# Import sample metadata
metadata <- read.csv("sample_info.csv", row.names = 1, stringsAsFactors = TRUE)
# Verify sample names match
all(colnames(counts) == rownames(metadata))
Step 2: Create DESeqDataSet Link to heading
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = metadata,
design = ~ condition
)
Step 3: Essential Preprocessing Link to heading
# Remove low-count genes
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
# Set reference level for comparisons
dds$condition <- relevel(dds$condition, ref = "control")
Done! You now have an analysis-ready DESeqDataSet.
⚡ Common Input Scenarios Link to heading
From featureCounts Link to heading
# Remove annotation columns (keep only count data)
counts_raw <- read.table("featurecounts_output.txt", header = TRUE, row.names = 1)
counts <- as.matrix(counts_raw[, 6:ncol(counts_raw)])
From tximport (transcript-level data) Link to heading
library(tximport)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene_map)
dds <- DESeqDataSetFromTximport(txi, colData = metadata, design = ~ condition)
⚠️ Critical Pitfalls to Avoid Link to heading
1. Sample Name Mismatches Link to heading
# Always verify this returns TRUE
identical(colnames(counts), rownames(metadata))
2. Non-Integer Counts Link to heading
# DESeq2 needs raw counts, not normalized values
all(counts == round(counts)) # Should be TRUE
3. Wrong Reference Level Link to heading
# Control should be first in factor levels
levels(dds$condition) # Check this!
dds$condition <- relevel(dds$condition, ref = "control")
✅ Quick Quality Check Link to heading
# Basic statistics
print(dds)
colSums(counts(dds)) # Sequencing depth per sample
rowSums(counts(dds) > 0) # Number of detected genes
# Save for analysis
saveRDS(dds, "analysis_ready_dds.rds")
🚀 You’re Ready! Link to heading
Your DESeqDataSet object now contains: - ✅ Properly formatted count data - ✅ Linked sample metadata - ✅ Correct experimental design - ✅ Filtered, analysis-ready genes
Next up: Post 4 will cover experimental design considerations and handling complex designs with batch effects and multiple factors.
Ready to run your first differential expression analysis? The statistical magic begins with DESeq(dds)
! 🎯
💬 Share Your Thoughts! Link to heading
What’s your most challenging data import scenario? Drop a comment below! 👇
#DESeq2 #RNAseq #DataImport #Bioinformatics #GeneExpression #DataPreprocessing #ComputationalBiology #BulkRNAseq #DataAnalysis #RStats