🔬 Bulk RNA-Seq Series – Post 6: From BAM to Count Matrices with featureCounts & HTSeq Link to heading
🛠 Why Counting Reads Matters Link to heading
After aligning your RNA-Seq reads to a reference genome, the next step is to quantify how many reads map to each gene. This generates the count matrix — a crucial input for differential expression tools such as DESeq2, edgeR, or limma-voom.
The count matrix forms the core of RNA-Seq analysis: - 📊 Rows = Genes (features) - 📊 Columns = Samples - 📊 Values = Number of mapped reads per gene
Let’s explore the two most common tools for generating count matrices: featureCounts and HTSeq-count.
🔧 featureCounts: Fast & Versatile Link to heading
Developed as part of the Subread package, featureCounts
is designed for efficient processing of large BAM files.
🔹 Why use featureCounts? Link to heading
✔️ Extremely fast and memory-efficient
✔️ Supports multi-threading for speed (-T
)
✔️ Accepts GTF and GFF annotations
✔️ Handles gene-level and exon-level counting
✔️ Easy integration into large pipelines
📘 Example Command: Link to heading
featureCounts -T 8 -a genes.gtf -o counts.txt aligned.bam
🔍 Option Breakdown Link to heading
Option | Meaning |
---|---|
-T 8 |
Use 8 processing threads |
-a genes.gtf |
Input annotation file (GTF format) |
-o counts.txt |
Output file for count matrix |
-s 1 |
Strand-specific mode (0 = unstranded, 1 = yes) |
✅ Outputs a tab-delimited file with gene IDs and raw counts.
🐍 HTSeq-count: Pythonic & Reliable Link to heading
HTSeq-count
is part of the HTSeq Python package and offers simplicity and reproducibility for RNA-Seq quantification.
🔹 Why use HTSeq-count? Link to heading
✔️ Excellent compatibility with Ensembl annotations
✔️ Reliable defaults and informative error messages
✔️ Ideal for scripting in custom workflows
✔️ Handles strandedness and different sorting methods
📘 Example Command: Link to heading
htseq-count -f bam -r pos -s no -i gene_id aligned.bam genes.gtf > counts.txt
🔍 Option Breakdown Link to heading
Option | Description |
---|---|
-f bam |
Input format (BAM) |
-r pos |
Input sorted by position |
-s no |
No strand specificity (yes or reverse if needed) |
-i gene_id |
Attribute in GTF to use as gene ID |
✅ Outputs a gene-wise count table to standard output.
📊 Key Considerations Before Counting Link to heading
To ensure accurate read counting:
- ✅ BAM files should be sorted by coordinate
- ✅ Annotations must match the genome version used during alignment
- ✅ Select the correct strand mode (
-s
) based on your library prep - ✅ Inspect logs and output for warnings or read assignment failures
🧮 What Does the Output Look Like? Link to heading
Both tools produce a matrix like this:
Gene ID | Sample1 | Sample2 | Sample3 |
---|---|---|---|
ENSG000001 | 2031 | 1987 | 2203 |
ENSG000002 | 412 | 398 | 441 |
This raw count matrix is the essential input for DESeq2 normalization, variance stabilization, and modeling steps.
📌 Key Takeaways Link to heading
✔️ featureCounts
is optimized for speed and scale, ideal for large datasets
✔️ HTSeq-count
is great for small-scale, scriptable analysis
✔️ Proper sorting, strandedness, and annotation matching are critical
✔️ Always check count summaries for anomalies
📌 Next up: Preprocessing Count Matrices for DESeq2! Stay tuned! 🚀
👇 Which counting tool do you prefer — featureCounts or HTSeq? Let’s discuss!
#RNASeq #featureCounts #HTSeq #CountMatrix #Transcriptomics #Genomics #Bioinformatics #ComputationalBiology #OpenScience #DataScience