Advanced scRNA-seq for Biologists — Post 2: Monocle3 Fundamentals! 🧭 Link to heading
In Post 1, we established when trajectory analysis makes biological sense! Now it’s time to actually do it! 🚀
Monocle3 is the workhorse package for trajectory inference in R! It’s well-maintained, handles complex branching trajectories, and integrates smoothly with Seurat workflows! By the end of this post, you’ll understand not just the code, but what each step does biologically — so you can make informed decisions about your own data! 🎯
Why Monocle3? 🤔 Link to heading
The trajectory inference landscape has dozens of methods! Why start here?
Active development! The Trapnell lab continues improving it! Documentation is solid! Community support exists! 📚
Handles branching! Unlike methods that assume a single linear trajectory, Monocle3 can capture cell fate decisions where lineages split! 🌳
Seurat integration! You can import your existing Seurat object — clusters, UMAP coordinates, everything — without starting from scratch! 🔗
Interpretable outputs! The results are visualizations and gene lists you can present to collaborators! No black-box mysteries! 📊
Is it the best method for every dataset? No! We’ll discuss method selection in Post 5! But it’s an excellent starting point and handles most common use cases well! ✅
The big picture workflow 🗺️ Link to heading
Before we look at code, here’s the conceptual flow:
- Convert your Seurat object to Monocle3’s format (cell_data_set) 🔄
- Learn the trajectory graph that connects your cells 🧠
- Order cells along that trajectory (pseudotime) ⏱️
- Analyze genes that change along the trajectory 🔬
Four steps! Each one builds on the previous! Let’s walk through them! 💪
Step 1: Converting from Seurat 🔄 Link to heading
Monocle3 uses its own data structure called a cell_data_set (cds)! The good news is that conversion from Seurat is straightforward using the SeuratWrappers package! ⚡
# Load required packages
library(Seurat)
library(monocle3)
library(SeuratWrappers)
# Convert Seurat object to cell_data_set
cds <- as.cell_data_set(seurat_obj)
This transfers your expression data, metadata, and dimensionality reduction! However, Monocle3 uses different terminology than Seurat:
- Seurat’s “reductions” become Monocle3’s “reducedDims”
- Seurat’s “meta.data” becomes Monocle3’s “colData”
- Seurat’s “assays” become Monocle3’s “assays”
After conversion, you need to calculate clusters in Monocle3’s framework — even if you already clustered in Seurat:
# Recreate UMAP for Monocle3
cds <- preprocess_cds(cds, method = "PCA")
cds <- reduce_dimension(cds, reduction_method = "UMAP")
cds <- cluster_cells(cds)
Alternatively, you can transfer your Seurat UMAP and clusters directly:
# Transfer UMAP coordinates from Seurat
cds@int_colData$reducedDims$UMAP <- seurat_obj@reductions$umap@cell.embeddings
# Transfer cluster labels
cds@clusters$UMAP$clusters <- seurat_obj$seurat_clusters
The second approach is faster and ensures consistency with your existing analysis! 🎯
Step 2: Learning the trajectory graph 🧠 Link to heading
This is where the magic happens! Monocle3 builds a principal graph — essentially a skeleton that captures the structure of your data! ✨
cds <- learn_graph(cds)
What’s happening under the hood? The algorithm finds a low-dimensional representation that describes the topology of your cells! Think of it as drawing the simplest possible “road network” that connects all your cells while respecting their relationships! 🛤️
The result is visible when you plot:
plot_cells(cds,
color_cells_by = "cluster",
label_cell_groups = FALSE,
label_leaves = TRUE,
label_branch_points = TRUE,
graph_label_size = 3)
You’ll see your UMAP with a black graph overlaid! Branch points are where lineages diverge! Leaves are endpoints (often mature cell types)! This graph structure is the foundation for pseudotime! 🌳
Key parameters to consider:
learn_graph(use_partition = TRUE)— keeps disconnected populations separate (usually what you want) ✅learn_graph(close_loop = FALSE)— prevents circular trajectories (default, usually appropriate) ✅
Step 3: Ordering cells in pseudotime ⏱️ Link to heading
Now we assign each cell a pseudotime value — its position along the trajectory! But pseudotime needs a starting point! 🏁
cds <- order_cells(cds)
This opens an interactive interface! You click on the graph to select the root — the point where pseudotime equals zero! All other cells get pseudotime values based on their distance from this root along the trajectory! 🎯
Choosing the root is a biological decision! This isn’t something the algorithm can determine automatically because it doesn’t know your biology! 🧬
Practical guidelines for root selection:
- For differentiation: Root should be your most undifferentiated cells! Look for stem cell markers, progenitor signatures, known early-stage identities! 🔬
- For disease progression: Root should be healthy or early-stage cells! 🏥
- For cell cycle: G1 cells are a natural starting point! 🔄
- If unsure: Pick the population you know best! If you’re confident that cluster 3 contains stem cells, root there! 🤔
After ordering, visualize pseudotime:
plot_cells(cds,
color_cells_by = "pseudotime",
label_cell_groups = FALSE,
label_leaves = FALSE,
label_branch_points = FALSE)
Cells should show a gradient from low pseudotime (your root) to high pseudotime (terminal states)! If the gradient looks wrong — say, your most differentiated cells have low pseudotime — you may have selected the wrong root! Re-run order_cells() and choose differently! 🔧
Step 4: Finding dynamic genes 🔬 Link to heading
With pseudotime assigned, you can identify genes that change along the trajectory! These are your differentiation markers — genes that turn on or off as cells progress! 📈
# Find genes that vary significantly with pseudotime
trajectory_genes <- graph_test(cds, neighbor_graph = "principal_graph")
# Filter for significant results
sig_genes <- subset(trajectory_genes, q_value < 0.05)
# Sort by significance
sig_genes <- sig_genes[order(sig_genes$q_value), ]
The output is a data frame with genes ranked by how significantly they vary across the trajectory! q_value is your multiple-testing-corrected p-value! Low values mean the gene’s expression changes systematically as pseudotime increases! 📊
To visualize specific genes:
plot_cells(cds,
genes = c("gene1", "gene2", "gene3"),
label_cell_groups = FALSE)
Or to see expression as a function of pseudotime:
plot_genes_in_pseudotime(cds[rownames(sig_genes)[1:5], ],
color_cells_by = "cluster")
This shows line plots of expression versus pseudotime! You’ll see which genes turn on early, which turn on late, and which are expressed in specific branches! 🎯
Handling branching trajectories 🌳 Link to heading
If your data contains cell fate decisions — a progenitor that can differentiate into two distinct lineages — you’ll want to analyze each branch separately!
# Identify cells in each branch/lineage
# This is based on which "partition" they belong to after the branch point
For branch-specific analysis, you can subset your cds to cells in one lineage and re-run gene finding! This lets you compare: which genes drive lineage A versus lineage B? 🤔
Common mistakes and how to avoid them ⚠️ Link to heading
Wrong root selection! If your pseudotime gradient doesn’t match expected biology (mature cells have low pseudotime), re-run order_cells()! This is the most common and most fixable error! 🔧
Forcing trajectories on discrete data! If you have genuinely separate cell types with no biological connection, the trajectory will be meaningless! Check Post 1 for guidance on when trajectory analysis is appropriate! ❌
Ignoring branch points! If learn_graph() identifies branch points but you analyze everything as one linear trajectory, you’re missing biological complexity! Investigate what’s branching! 🌳
Over-interpreting pseudotime values! Pseudotime is ordinal, not absolute! A cell with pseudotime 5 is “later” than one with pseudotime 3, but you can’t say it’s been differentiating “twice as long!” The numerical values are arbitrary! 📊
What good results look like ✅ Link to heading
When trajectory analysis works well:
- Pseudotime gradients align with known biology (stem markers at low pseudotime, mature markers at high) 🎯
- Dynamic genes include expected differentiation markers 🧬
- The trajectory structure (branches, endpoints) matches expected lineage relationships 🌳
- Results are reproducible across different root selections within the same biological starting point 🔄
When to be skeptical: 🤔
- Pseudotime correlates more with batch than biology ⚠️
- “Dynamic” genes are mostly housekeeping or technical artifacts ❌
- The trajectory connects populations that shouldn’t be connected 🚩
Complete minimal workflow 📋 Link to heading
Putting it all together:
library(Seurat)
library(monocle3)
library(SeuratWrappers)
# Convert from Seurat
cds <- as.cell_data_set(seurat_obj)
# Process for Monocle3
cds <- preprocess_cds(cds)
cds <- reduce_dimension(cds)
cds <- cluster_cells(cds)
# Learn trajectory
cds <- learn_graph(cds)
# Order cells (interactive - select root)
cds <- order_cells(cds)
# Visualize
plot_cells(cds, color_cells_by = "pseudotime")
# Find trajectory-dependent genes
trajectory_genes <- graph_test(cds, neighbor_graph = "principal_graph")
sig_genes <- subset(trajectory_genes, q_value < 0.05)
Bottom line 🎯 Link to heading
Monocle3 is GPS, not autopilot! It shows you possible routes through your data! Your job is to determine which route reflects real biology — and that requires bringing your biological knowledge to the interpretation! 🧠
The code is straightforward! The decisions are hard! In Post 3, we’ll focus on the hardest decision of all: what pseudotime values actually mean and how to validate that your trajectory reflects reality! 🔬
See you there! 🙏