🧬 Mastering Bulk RNA-seq Analysis in R – Post 12: QC Heatmaps for Sample Quality! Link to heading
🔥 The Unsung Heroes of Quality Control Link to heading
After using PCA to get the big picture view of your data in Post 11, it’s time to zoom in on the fine details. While PCA shows you overall patterns, quality control heatmaps reveal the precise relationships between every pair of samples in your dataset.
These aren’t the gene expression heatmaps we covered earlier - these are diagnostic tools that focus entirely on sample quality and relationships. Think of them as the detailed inspection that follows PCA’s initial screening.
🎯 The Two QC Heatmap Champions Link to heading
🔥 Sample Correlation Heatmap Link to heading
What it shows: How similar each sample is to every other sample Values: Range from -1 to +1 (higher = more similar) Purpose: Reveals which samples have similar expression profiles
📊 Sample Distance Heatmap Link to heading
What it shows: Euclidean distances between samples in expression space Colors: Darker = more similar (counterintuitive but standard) Purpose: Shows clustering relationships and outliers
Both heatmaps should tell the same story, just with different visual approaches.
💡 Creating Your QC Heatmaps Link to heading
Basic Implementation Link to heading
library(DESeq2)
library(pheatmap)
# Use transformed data (essential!)
vst_data <- vst(dds, blind = FALSE)
# Sample correlation heatmap
sample_cor <- cor(assay(vst_data))
pheatmap(sample_cor,
main = "Sample Correlation Heatmap",
annotation_col = colData(dds)[c("condition")])
# Sample distance heatmap
sample_distances <- dist(t(assay(vst_data)))
sample_dist_matrix <- as.matrix(sample_distances)
pheatmap(sample_dist_matrix,
main = "Sample Distance Heatmap",
annotation_col = colData(dds)[c("condition")])
🌟 What Success Looks Like Link to heading
Sample Correlation Patterns Link to heading
High correlations within conditions (>0.8-0.9)
- Biological replicates should correlate strongly
- Shows consistent sample preparation and biology
Lower correlations between conditions (<0.7)
- Different treatments should be less similar
- Demonstrates your experimental manipulation worked
Clear block pattern
- Visual blocks of high correlation matching your experimental design
- Clean separation between treatment groups
Sample Distance Patterns Link to heading
Dark squares within conditions
- Samples from the same group cluster together
- Tight, consistent groupings
Light squares between conditions
- Clear separation between different treatments
- Hierarchical clustering follows biological logic
🚩 Red Flags and Troubleshooting Link to heading
Major Warning Signs Link to heading
Low correlations within replicates (<0.7)
- Possible sample mix-ups or technical failures
- Investigate individual samples immediately
One sample correlates poorly with its group
- Check for sample degradation or processing errors
- Consider removing if clearly problematic
Samples cluster by batch, not condition
- Batch effects are stronger than biological effects
- Need to model batch in your DESeq2 design
No clear clustering pattern
- Weak experimental effects or high technical noise
- May need more samples or stronger treatments
Quick Diagnostic Steps Link to heading
# Identify problematic samples
cor_matrix <- cor(assay(vst_data))
min_correlations <- apply(cor_matrix, 1, function(x) min(x[x < 1]))
potential_outliers <- names(which(min_correlations < 0.6))
print(paste("Potential outlier samples:", paste(potential_outliers, collapse = ", ")))
# Check batch effects
pheatmap(sample_cor, annotation_col = colData(dds)[c("condition", "batch")])
🔬 Real-World Interpretation Scenarios Link to heading
The Success Story Link to heading
What you see: Beautiful blocks of high correlation (>0.85) within each condition, clear separation between conditions, hierarchical clustering that perfectly matches your experimental design.
What this means: Your samples are high quality, your experimental conditions worked as intended, and your downstream analysis will be reliable.
The Investigation Needed Link to heading
What you see: One sample shows correlations of 0.6-0.7 with its supposed replicates but correlates well with a different condition.
What this means: Likely sample mix-up during processing. Check your sample tracking immediately and consider whether to reassign or remove the sample.
The Batch Effect Discovery Link to heading
What you see: Samples cluster perfectly by processing date rather than by treatment condition.
What this means: Technical variation is stronger than biological variation. You’ll need to include batch in your DESeq2 model or use batch correction methods.
⚡ Pro Tips for Success Link to heading
Essential Technical Points Link to heading
Always use transformed data: VST or rlog, never raw counts
- Raw counts will be dominated by sequencing depth differences
- Transformation makes correlations meaningful
Remove low-expression genes: Filter before calculating correlations
- Low-expression genes add noise to correlation calculations
- Focus on genes that are actually detectable
Use meaningful sample names: Make interpretation easier
- Clear naming helps spot patterns and problems
- Include key experimental factors in names
Interpretation Guidelines Link to heading
Correlation values:
-
0.9: Excellent replication
-
0.8-0.9: Good replication
-
0.7-0.8: Acceptable but investigate
-
<0.7: Problematic, needs investigation
Distance interpretation:
- Darker colors = more similar (yes, it’s counterintuitive)
- Look for clear blocks matching your experimental design
- Outliers appear as lighter rows/columns
🎉 Integration with Your QC Workflow Link to heading
The Complete Quality Control Pipeline Link to heading
- PCA plot: Overall patterns and major issues
- Sample correlation heatmap: Detailed similarity relationships
- Sample distance heatmap: Clustering validation
- Decision point: Proceed or investigate problems
Decision Framework Link to heading
If QC heatmaps look good:
- High within-group correlations (>0.8)
- Clear between-group separation
- Clustering matches experimental design
- → Proceed with differential expression analysis
If QC heatmaps show problems:
- Low correlations within groups
- Samples clustering by technical factors
- Clear outliers or batch effects
- → Investigate and fix before continuing
Quick Quality Assessment Link to heading
# Automated quality check
assess_sample_quality <- function(vst_data, condition_column) {
cor_matrix <- cor(assay(vst_data))
conditions <- colData(vst_data)[[condition_column]]
# Calculate within-group correlations
within_group_cors <- c()
for(cond in unique(conditions)) {
group_samples <- which(conditions == cond)
if(length(group_samples) > 1) {
group_cors <- cor_matrix[group_samples, group_samples]
within_group_cors <- c(within_group_cors, group_cors[upper.tri(group_cors)])
}
}
avg_within_cor <- mean(within_group_cors)
min_within_cor <- min(within_group_cors)
cat("Average within-group correlation:", round(avg_within_cor, 3), "\n")
cat("Minimum within-group correlation:", round(min_within_cor, 3), "\n")
if(min_within_cor > 0.8) {
cat("Sample quality: EXCELLENT\n")
} else if(min_within_cor > 0.7) {
cat("Sample quality: GOOD\n")
} else {
cat("Sample quality: NEEDS INVESTIGATION\n")
}
}
assess_sample_quality(vst_data, "condition")
🔍 The Bottom Line Link to heading
PCA shows the forest - overall patterns and major variation sources QC heatmaps show the trees - precise sample relationships and quality
Together, they provide a comprehensive view of your data quality. These plots take 5 minutes to generate but can save you months of chasing artifacts or drawing conclusions from problematic data.
Don’t skip this step. Quality control heatmaps are your safety net against sample mix-ups, batch effects, and technical failures that could invalidate your entire analysis.
🧪 What’s Next? Link to heading
Post 13: MA Plots and Dispersion Diagnostics will explore DESeq2’s built-in diagnostic plots that help you understand how well the statistical model fits your data and identify potential issues with the differential expression analysis itself.
💬 Share Your Thoughts! Link to heading
What sample quality issues have QC heatmaps helped you catch? Any tips for interpreting tricky correlation patterns?
#RNAseq #QualityControl #DESeq2 #SampleCorrelation #DistanceHeatmap #DataVisualization #Bioinformatics #VST #DiagnosticPlots #GeneExpression #RStats