Mastering Single-Cell RNA-Seq Analysis in R – Post 15: From Marker Genes to Biological Pathways! Link to heading
The single-cell series is back! It’s been a while, but we’re picking up right where we left off. In Post 14, we transformed overwhelming DEA tables into crystal-clear volcano plots and heatmaps — visualizations that let your eyes do the pattern recognition instead of drowning in endless rows of numbers.
But here’s the thing: even the most beautiful volcano plot still leaves you with a fundamental question. You can see NKG7 and GZMB lighting up like fireworks in your NK cell comparison — but what do these genes actually do together? What biological story are they telling as a group?
That’s exactly what pathway enrichment analysis answers. Today we’re bridging the gap from individual marker genes to biological mechanisms — and the best part? If you followed our bulk RNA-seq series, you already know how this works!
The Gene List Problem Link to heading
After running FindMarkers() or FindAllMarkers(), you end up with a list of differentially expressed genes. For NK cells compared to B cells in our IFNB dataset, you might see something like:
- NKG7 — avg_log2FC: 8.2, p_adj: 1e-50
- GZMB — avg_log2FC: 7.1, p_adj: 2e-45
- PRF1 — avg_log2FC: 6.8, p_adj: 5e-40
- CD247 — avg_log2FC: 5.9, p_adj: 3e-38
- KLRB1 — avg_log2FC: 5.2, p_adj: 1e-35
Impressive numbers. But what’s the collective biological function these genes represent? Are they all part of the same process? Are there unexpected pathways hiding in your marker list?
This is the “so what?” moment that pathway enrichment analysis resolves. It takes you from a molecular inventory to a biological narrative — from “these 47 genes are upregulated” to “this cell type is specialized for cytotoxic immune response and viral defense.”
What Is Pathway Enrichment Analysis? Link to heading
Pathway Enrichment Analysis (PEA) asks a deceptively simple question: Are my differentially expressed genes involved in specific biological processes more than expected by chance?
The concept is intuitive. Imagine you’ve caught 200 fish from a lake containing 20,000 fish. You notice that 50 of your fish are salmon — but salmon only make up 5% of the lake. You’d expect about 10 salmon by chance, yet you caught 50. That’s a 5-fold enrichment, and it’s statistically significant. Something about your fishing method is specifically attracting salmon.
Replace “fish” with “genes” and “salmon” with “immune response genes,” and you have Over-Representation Analysis (ORA) — the foundational method of pathway enrichment.
The math boils down to a hypergeometric test:
- Your gene list: Differentially expressed genes from DEA
- A pathway: A curated set of genes known to work together (e.g., “Natural Killer Cell Mediated Cytotoxicity”)
- The question: Are more of your DEA genes in this pathway than random chance would predict?
When the answer is yes, you’ve found an enriched pathway — and that pathway tells you what your cell type is actually doing.
The Beautiful Truth: Single-Cell PEA = Bulk PEA Link to heading
Here’s something that makes this topic much less intimidating: pathway enrichment analysis works identically for single-cell and bulk RNA-seq data.
Why? Because both approaches take DEA results as input. Whether your differentially expressed genes came from DESeq2 on bulk data or from FindMarkers() on single-cell clusters, the downstream enrichment analysis is the same. The biology hasn’t changed — just the resolution at which you discovered those genes.
If you followed our bulk RNA-seq series, you’ve already mastered these concepts:
- Post 13: GO Enrichment — the fishing expedition analogy and Over-Representation Analysis with Gene Ontology
- Post 14: KEGG Pathways — mapping genes to metabolic and signaling pathway maps
- Post 15: Reactome — human-curated pathways for disease and clinical research
- Post 16: GSEA — Gene Set Enrichment Analysis for coordinated pathway changes
Every principle, every tool, every interpretation strategy from those posts applies directly to your single-cell markers. The only difference is where your gene list came from.
ORA vs GSEA: Why We Start with ORA Link to heading
There are two major approaches to pathway enrichment, and choosing the right one matters:
ORA (Over-Representation Analysis) asks: “Are my significant genes enriched in this pathway?” It takes a simple gene list — your DEGs that pass a significance and fold-change threshold — and tests for enrichment against curated databases.
GSEA (Gene Set Enrichment Analysis) asks: “Is this pathway trending across ALL my genes?” It uses a ranked list of every gene (not just significant ones) and looks for coordinated shifts in pathway gene sets.
For single-cell analysis, ORA is the natural starting point for several reasons:
- It works directly with FindMarkers output. Filter for significance and fold change, extract gene names, and you’re ready to go.
- No ranking complexity. GSEA requires a meaningful ranked gene list, which gets tricky with
FindAllMarkers()where you’re comparing one cluster against all others simultaneously. - Clear, interpretable results. ORA gives you a straightforward answer: these pathways are enriched in your cell type markers.
We’ll tackle GSEA in a later post — including ssGSEA with the escape package, which works directly on expression matrices rather than DEA results. That opens up an entirely different analytical approach for single-cell data. But ORA first!
The Three Databases You Need to Know Link to heading
ORA tests your gene list against curated pathway databases. The three major ones each offer a different lens on your biology:
Gene Ontology (GO) organizes biological knowledge into three sub-ontologies:
- Biological Process (BP): What cells do — e.g., “immune effector process,” “cell killing”
- Cellular Component (CC): Where things happen — e.g., “cytoplasmic granule,” “cell surface”
- Molecular Function (MF): How genes work — e.g., “serine-type endopeptidase activity”
KEGG provides hand-drawn pathway maps showing how genes interact in metabolic and signaling cascades. Think of KEGG as a GPS for cellular processes — it doesn’t just tell you which genes are involved, it shows you how they connect.
Reactome offers expertly curated human pathways with a focus on disease mechanisms and clinical relevance. It’s the gold standard when your research has translational implications.
Start with GO (specifically BP) for the broadest biological picture, then layer in KEGG and Reactome for mechanistic and clinical depth.
What Good Results Look Like Link to heading
When ORA works well on your single-cell markers, you should see:
- Biologically coherent terms — NK cell markers should enrich for cytotoxicity and immune defense, not random metabolic processes
- Significant p-values after correction — look for adjusted p-values below 0.05 using Benjamini-Hochberg correction
- Reasonable gene ratios — pathways with 5-50 genes in your list are most informative; very large or very small overlaps can be misleading
- Consistent story across databases — if GO says “immune response” and KEGG says “Natural Killer Cell Mediated Cytotoxicity,” you’re on solid ground
Red flags include only very broad terms (like “biological regulation”), contradictory results between databases, or no significant enrichment at all — which may indicate your DEA thresholds need adjusting.
Coming Up Next Link to heading
Now that you understand the theory behind pathway enrichment analysis, it’s time to write some code! In Post 16, we’ll walk through a complete ORA workflow using clusterProfiler on our IFNB NK cell markers — from preparing your gene list to interpreting dotplots. Every step, every parameter, every visualization explained.
Ready to transform your marker gene lists into biological stories? Let’s do it!