🔬 Bulk RNA-Seq Series – Post 6: From BAM to Count Matrices with featureCounts & HTSeq Link to heading

🛠 Why Counting Reads Matters Link to heading

After aligning your RNA-Seq reads to a reference genome, the next step is to quantify how many reads map to each gene. This generates the count matrix — a crucial input for differential expression tools such as DESeq2, edgeR, or limma-voom.

The count matrix forms the core of RNA-Seq analysis: - 📊 Rows = Genes (features) - 📊 Columns = Samples - 📊 Values = Number of mapped reads per gene

Let’s explore the two most common tools for generating count matrices: featureCounts and HTSeq-count.


🔧 featureCounts: Fast & Versatile Link to heading

Developed as part of the Subread package, featureCounts is designed for efficient processing of large BAM files.

🔹 Why use featureCounts? Link to heading

✔️ Extremely fast and memory-efficient
✔️ Supports multi-threading for speed (-T)
✔️ Accepts GTF and GFF annotations
✔️ Handles gene-level and exon-level counting
✔️ Easy integration into large pipelines

📘 Example Command: Link to heading

featureCounts -T 8 -a genes.gtf -o counts.txt aligned.bam

🔍 Option Breakdown Link to heading

Option Meaning
-T 8 Use 8 processing threads
-a genes.gtf Input annotation file (GTF format)
-o counts.txt Output file for count matrix
-s 1 Strand-specific mode (0 = unstranded, 1 = yes)

✅ Outputs a tab-delimited file with gene IDs and raw counts.


🐍 HTSeq-count: Pythonic & Reliable Link to heading

HTSeq-count is part of the HTSeq Python package and offers simplicity and reproducibility for RNA-Seq quantification.

🔹 Why use HTSeq-count? Link to heading

✔️ Excellent compatibility with Ensembl annotations
✔️ Reliable defaults and informative error messages
✔️ Ideal for scripting in custom workflows
✔️ Handles strandedness and different sorting methods

📘 Example Command: Link to heading

htseq-count -f bam -r pos -s no -i gene_id aligned.bam genes.gtf > counts.txt

🔍 Option Breakdown Link to heading

Option Description
-f bam Input format (BAM)
-r pos Input sorted by position
-s no No strand specificity (yes or reverse if needed)
-i gene_id Attribute in GTF to use as gene ID

✅ Outputs a gene-wise count table to standard output.


📊 Key Considerations Before Counting Link to heading

To ensure accurate read counting:

  • ✅ BAM files should be sorted by coordinate
  • ✅ Annotations must match the genome version used during alignment
  • ✅ Select the correct strand mode (-s) based on your library prep
  • ✅ Inspect logs and output for warnings or read assignment failures

🧮 What Does the Output Look Like? Link to heading

Both tools produce a matrix like this:

Gene ID Sample1 Sample2 Sample3
ENSG000001 2031 1987 2203
ENSG000002 412 398 441

This raw count matrix is the essential input for DESeq2 normalization, variance stabilization, and modeling steps.


📌 Key Takeaways Link to heading

✔️ featureCounts is optimized for speed and scale, ideal for large datasets
✔️ HTSeq-count is great for small-scale, scriptable analysis
✔️ Proper sorting, strandedness, and annotation matching are critical
✔️ Always check count summaries for anomalies

📌 Next up: Preprocessing Count Matrices for DESeq2! Stay tuned! 🚀

👇 Which counting tool do you prefer — featureCounts or HTSeq? Let’s discuss!

#RNASeq #featureCounts #HTSeq #CountMatrix #Transcriptomics #Genomics #Bioinformatics #ComputationalBiology #OpenScience #DataScience