🧬 Mastering Bulk RNA-seq Analysis in R – Post 3: Data Import & Preprocessing Link to heading

🎯 From Files to Analysis: The Essential Workflow Link to heading

Ready to transform your RNA-seq count files into DESeq2-ready objects? This is where theory meets practice—taking real data from your sequencing pipeline and preparing it for statistical analysis.

The good news? DESeq2 makes this surprisingly straightforward once you understand the essential components.


🔍 What You Need Link to heading

1. Count Matrix Link to heading

  • Rows: Genes
  • Columns: Samples
  • Values: Integer counts (no decimals!)

2. Sample Metadata Link to heading

  • Rows: Samples (matching count matrix column names)
  • Columns: Experimental factors (condition, batch, etc.)

That’s it! Let’s put them together.


🔧 The Core Workflow Link to heading

Step 1: Load Libraries and Data Link to heading

library(DESeq2)

# Import count matrix
counts <- read.csv("count_matrix.csv", row.names = 1)
counts <- as.matrix(counts)
storage.mode(counts) <- "integer"

# Import sample metadata  
metadata <- read.csv("sample_info.csv", row.names = 1, stringsAsFactors = TRUE)

# Verify sample names match
all(colnames(counts) == rownames(metadata))

Step 2: Create DESeqDataSet Link to heading

dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData = metadata,
  design = ~ condition
)

Step 3: Essential Preprocessing Link to heading

# Remove low-count genes
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]

# Set reference level for comparisons
dds$condition <- relevel(dds$condition, ref = "control")

Done! You now have an analysis-ready DESeqDataSet.


⚡ Common Input Scenarios Link to heading

From featureCounts Link to heading

# Remove annotation columns (keep only count data)
counts_raw <- read.table("featurecounts_output.txt", header = TRUE, row.names = 1)
counts <- as.matrix(counts_raw[, 6:ncol(counts_raw)])

From tximport (transcript-level data) Link to heading

library(tximport)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene_map)
dds <- DESeqDataSetFromTximport(txi, colData = metadata, design = ~ condition)

⚠️ Critical Pitfalls to Avoid Link to heading

1. Sample Name Mismatches Link to heading

# Always verify this returns TRUE
identical(colnames(counts), rownames(metadata))

2. Non-Integer Counts Link to heading

# DESeq2 needs raw counts, not normalized values
all(counts == round(counts))  # Should be TRUE

3. Wrong Reference Level Link to heading

# Control should be first in factor levels
levels(dds$condition)  # Check this!
dds$condition <- relevel(dds$condition, ref = "control")

✅ Quick Quality Check Link to heading

# Basic statistics
print(dds)
colSums(counts(dds))  # Sequencing depth per sample
rowSums(counts(dds) > 0)  # Number of detected genes

# Save for analysis
saveRDS(dds, "analysis_ready_dds.rds")

🚀 You’re Ready! Link to heading

Your DESeqDataSet object now contains: - ✅ Properly formatted count data - ✅ Linked sample metadata - ✅ Correct experimental design - ✅ Filtered, analysis-ready genes

Next up: Post 4 will cover experimental design considerations and handling complex designs with batch effects and multiple factors.

Ready to run your first differential expression analysis? The statistical magic begins with DESeq(dds)! 🎯


💬 Share Your Thoughts! Link to heading

What’s your most challenging data import scenario? Drop a comment below! 👇

#DESeq2 #RNAseq #DataImport #Bioinformatics #GeneExpression #DataPreprocessing #ComputationalBiology #BulkRNAseq #DataAnalysis #RStats