🧬 Mastering Bulk RNA-seq Analysis in R – Post 12: QC Heatmaps for Sample Quality! Link to heading

🔥 The Unsung Heroes of Quality Control Link to heading

After using PCA to get the big picture view of your data in Post 11, it’s time to zoom in on the fine details. While PCA shows you overall patterns, quality control heatmaps reveal the precise relationships between every pair of samples in your dataset.

These aren’t the gene expression heatmaps we covered earlier - these are diagnostic tools that focus entirely on sample quality and relationships. Think of them as the detailed inspection that follows PCA’s initial screening.

🎯 The Two QC Heatmap Champions Link to heading

🔥 Sample Correlation Heatmap Link to heading

What it shows: How similar each sample is to every other sample Values: Range from -1 to +1 (higher = more similar) Purpose: Reveals which samples have similar expression profiles

📊 Sample Distance Heatmap Link to heading

What it shows: Euclidean distances between samples in expression space Colors: Darker = more similar (counterintuitive but standard) Purpose: Shows clustering relationships and outliers

Both heatmaps should tell the same story, just with different visual approaches.

💡 Creating Your QC Heatmaps Link to heading

Basic Implementation Link to heading

library(DESeq2)
library(pheatmap)

# Use transformed data (essential!)
vst_data <- vst(dds, blind = FALSE)

# Sample correlation heatmap
sample_cor <- cor(assay(vst_data))
pheatmap(sample_cor, 
         main = "Sample Correlation Heatmap",
         annotation_col = colData(dds)[c("condition")])

# Sample distance heatmap
sample_distances <- dist(t(assay(vst_data)))
sample_dist_matrix <- as.matrix(sample_distances)
pheatmap(sample_dist_matrix,
         main = "Sample Distance Heatmap", 
         annotation_col = colData(dds)[c("condition")])

🌟 What Success Looks Like Link to heading

Sample Correlation Patterns Link to heading

High correlations within conditions (>0.8-0.9)

Biological replicates should correlate strongly
Shows consistent sample preparation and biology

Lower correlations between conditions (<0.7)

Different treatments should be less similar
Demonstrates your experimental manipulation worked

Clear block pattern

Visual blocks of high correlation matching your experimental design
Clean separation between treatment groups

Sample Distance Patterns Link to heading

Dark squares within conditions

Samples from the same group cluster together
Tight, consistent groupings

Light squares between conditions

Clear separation between different treatments
Hierarchical clustering follows biological logic

🚩 Red Flags and Troubleshooting Link to heading

Major Warning Signs Link to heading

Low correlations within replicates (<0.7)

Possible sample mix-ups or technical failures
Investigate individual samples immediately

One sample correlates poorly with its group

Check for sample degradation or processing errors
Consider removing if clearly problematic

Samples cluster by batch, not condition

Batch effects are stronger than biological effects
Need to model batch in your DESeq2 design

No clear clustering pattern

Weak experimental effects or high technical noise
May need more samples or stronger treatments

Quick Diagnostic Steps Link to heading

# Identify problematic samples
cor_matrix <- cor(assay(vst_data))
min_correlations <- apply(cor_matrix, 1, function(x) min(x[x < 1]))
potential_outliers <- names(which(min_correlations < 0.6))

print(paste("Potential outlier samples:", paste(potential_outliers, collapse = ", ")))

# Check batch effects
pheatmap(sample_cor, annotation_col = colData(dds)[c("condition", "batch")])

🔬 Real-World Interpretation Scenarios Link to heading

The Success Story Link to heading

What you see: Beautiful blocks of high correlation (>0.85) within each condition, clear separation between conditions, hierarchical clustering that perfectly matches your experimental design.

What this means: Your samples are high quality, your experimental conditions worked as intended, and your downstream analysis will be reliable.

The Investigation Needed Link to heading

What you see: One sample shows correlations of 0.6-0.7 with its supposed replicates but correlates well with a different condition.

What this means: Likely sample mix-up during processing. Check your sample tracking immediately and consider whether to reassign or remove the sample.

The Batch Effect Discovery Link to heading

What you see: Samples cluster perfectly by processing date rather than by treatment condition.

What this means: Technical variation is stronger than biological variation. You’ll need to include batch in your DESeq2 model or use batch correction methods.

⚡ Pro Tips for Success Link to heading

Essential Technical Points Link to heading

Always use transformed data: VST or rlog, never raw counts

Raw counts will be dominated by sequencing depth differences
Transformation makes correlations meaningful

Remove low-expression genes: Filter before calculating correlations

Low-expression genes add noise to correlation calculations
Focus on genes that are actually detectable

Use meaningful sample names: Make interpretation easier

Clear naming helps spot patterns and problems
Include key experimental factors in names

Interpretation Guidelines Link to heading

Correlation values:

0.9: Excellent replication
0.8-0.9: Good replication
0.7-0.8: Acceptable but investigate
<0.7: Problematic, needs investigation

Distance interpretation:

Darker colors = more similar (yes, it’s counterintuitive)
Look for clear blocks matching your experimental design
Outliers appear as lighter rows/columns

🎉 Integration with Your QC Workflow Link to heading

The Complete Quality Control Pipeline Link to heading

PCA plot: Overall patterns and major issues
Sample correlation heatmap: Detailed similarity relationships
Sample distance heatmap: Clustering validation
Decision point: Proceed or investigate problems

Decision Framework Link to heading

If QC heatmaps look good:

High within-group correlations (>0.8)
Clear between-group separation
Clustering matches experimental design
→ Proceed with differential expression analysis

If QC heatmaps show problems:

Low correlations within groups
Samples clustering by technical factors
Clear outliers or batch effects
→ Investigate and fix before continuing

Quick Quality Assessment Link to heading

# Automated quality check
assess_sample_quality <- function(vst_data, condition_column) {
  cor_matrix <- cor(assay(vst_data))
  conditions <- colData(vst_data)[[condition_column]]
  
  # Calculate within-group correlations
  within_group_cors <- c()
  for(cond in unique(conditions)) {
    group_samples <- which(conditions == cond)
    if(length(group_samples) > 1) {
      group_cors <- cor_matrix[group_samples, group_samples]
      within_group_cors <- c(within_group_cors, group_cors[upper.tri(group_cors)])
    }
  }
  
  avg_within_cor <- mean(within_group_cors)
  min_within_cor <- min(within_group_cors)
  
  cat("Average within-group correlation:", round(avg_within_cor, 3), "\n")
  cat("Minimum within-group correlation:", round(min_within_cor, 3), "\n")
  
  if(min_within_cor > 0.8) {
    cat("Sample quality: EXCELLENT\n")
  } else if(min_within_cor > 0.7) {
    cat("Sample quality: GOOD\n") 
  } else {
    cat("Sample quality: NEEDS INVESTIGATION\n")
  }
}

assess_sample_quality(vst_data, "condition")

🔍 The Bottom Line Link to heading

PCA shows the forest - overall patterns and major variation sources QC heatmaps show the trees - precise sample relationships and quality

Together, they provide a comprehensive view of your data quality. These plots take 5 minutes to generate but can save you months of chasing artifacts or drawing conclusions from problematic data.

Don’t skip this step. Quality control heatmaps are your safety net against sample mix-ups, batch effects, and technical failures that could invalidate your entire analysis.

🧪 What’s Next? Link to heading

Post 13: MA Plots and Dispersion Diagnostics will explore DESeq2’s built-in diagnostic plots that help you understand how well the statistical model fits your data and identify potential issues with the differential expression analysis itself.

What sample quality issues have QC heatmaps helped you catch? Any tips for interpreting tricky correlation patterns?

#RNAseq #QualityControl #DESeq2 #SampleCorrelation #DistanceHeatmap #DataVisualization #Bioinformatics #VST #DiagnosticPlots #GeneExpression #RStats