Spatial Transcriptomics Data Analysis: A Practical Introduction
Spatial transcriptomics data analysis helps you read gene expression in place—not just how much a gene is expressed, but where it happens inside a tissue. That spatial context makes it possible to map cell neighborhoods, discover tissue micro-environments, and connect molecular changes to histology. This article is a clear, lab-friendly guide to getting from raw spots or cells to interpretable biological insights.
If you've done bulk or single-cell RNA-seq before, you'll feel at home. We'll keep the focus on decisions that matter in practice—how to design the experiment, set sensible QC thresholds, choose the right normalization, cluster and annotate regions, and integrate single-cell references for deconvolution. We also highlight widely used toolchains (e.g., Space Ranger, Seurat, Squidpy, Giotto) and when each fits.
What you'll get from this guide
- A step-by-step overview from preprocessing to reporting
- Practical QC tips that balance sensitivity and noise
- Clear routes for cell type mapping using scRNA-seq references
- Reproducible workflows you can adopt or adapt in your lab
Who this is for
Researchers who want a concise starting point with enough depth to run real analyses—without wading through excessive theory. We avoid hype and aim for choices you can justify in a methods section.
Development of spatial transcriptomics. (Du, Jun, et al., Journal of translational medicine 2023)
Standard Workflow for Spatial Transcriptomics Data Analysis
A typical spatial transcriptomics analysis follows a structured, stepwise workflow that transforms raw sequencing and image data into meaningful biological insights:
1. Preprocessing & Image Alignment
Run platform-specific pipelines (e.g., Space Ranger) to generate expression matrices and align tissue images.
2. Data Integration into an Analysis Framework
Load processed data into tools like Seurat, Squidpy, Giotto, or SpatialExperiment to organize spatial features and metadata.
3. Quality Control & Filtering
Identify and remove low-quality spots based on library size, gene counts, and mitochondrial content.
4. Normalization & Variable Gene Selection
Apply normalization methods (e.g., SCTransform, log-normalization) and select highly variable genes for downstream analysis.
5. Dimensionality Reduction & Clustering
Reduce noise using PCA/UMAP and identify transcriptionally distinct spatial domains.
6. Spatially Variable Gene Detection
Identify genes showing significant spatial expression patterns (SVGs) across the tissue.
7. Cell Type Deconvolution
Integrate scRNA-seq references to estimate cell-type composition per spatial spot.
Typical structure of spatial transcriptomics analysis. (Williams, C.G., et al., Genome Med 2022)
Tip: Treat each step as a decision point. Your choices at early stages (like filtering or normalization) influence everything downstream.
Essential Tools for Spatial Transcriptomics Data Analysis
A wide range of open-source tools now supports spatial transcriptomics data analysis—from basic QC to high-level modeling. In this section, we introduce the most commonly used tools across five key areas of the analysis pipeline. Whether you prefer R or Python, or need flexible pipelines for multi-sample studies, these tools form the foundation of a reproducible and insightful workflow.
1. Preprocessing & Image Registration
Before analysis begins, raw data must be processed using the platform's own pipeline to align reads, generate count matrices, and register spatial coordinates with histological images.
- Space Ranger (10x Genomics): Processes Visium or Visium HD data; outputs expression matrices, tissue masks, aligned images, and QC metrics.
- CosMx / GeoMx / Xenium onboard software: Similar pipelines for NanoString or 10x single-cell resolution platforms.
- Manual QC overlays: Always review spot overlays on tissue images to verify correct registration.
Why it matters: Misaligned images or incorrect spot calls at this stage can undermine the entire analysis downstream.
2. Analysis Frameworks (R or Python)
Once data are processed, the next step is to load and explore them in a dedicated framework that supports spatial features, clustering, visualization, and statistical testing.
- Seurat (R): Widely used for Visium data; offers modules for QC, normalization (including SCTransform), spatial clustering, and plotting.
- Squidpy (Python): Built on top of Scanpy, it includes spatial graphs, neighborhood-based analysis, and integration with histology images.
- Giotto (R): Offers both statistical analysis and a built-in interactive viewer; supports SVG detection and spatial interaction analysis.
- SpatialExperiment (Bioconductor, R): A standardized container for spatial datasets, ideal for reproducible workflows and cross-method benchmarking.
Overview of the SpatialExperiment class structure. (Righelli, D., et al., Bioinformatics, 2022)
Choosing a platform: Seurat and Squidpy are the most actively maintained and supported. Giotto is great for interactive exploration, while SpatialExperiment is ideal for teams managing multiple spatial datasets.
3. Cell-Type Mapping & Deconvolution
Many researchers want to infer which cell types are present in each spatial spot using single-cell RNA-seq as a reference. These tools help estimate cell-type proportions or assign labels based on transcriptomic profiles.
- cell2location (Python): Bayesian framework to map fine-grained cell types; supports multi-sample comparisons and uncertainty estimation.
- Tangram (Python): Learns spatial mappings between scRNA-seq and spatial datasets; fast and scalable for large tissues.
- RCTD (R): A popular deconvolution tool for Visium; works well with curated single-cell references.
Tip: Use tissue-matched or platform-matched single-cell references whenever possible, and validate predictions using known spatial markers or histology images.
4. Spatial Clustering & Domain Detection
Spatial transcriptomics data can reveal distinct tissue regions or microenvironments through clustering methods that incorporate spatial location and expression.
- BayesSpace (R): A Bayesian model that enhances resolution and refines spatial clusters beyond the spot level.
- SpaGCN (Python): Combines gene expression, spatial proximity, and histological texture using graph convolutional networks.
- stLearn (Python): Integrates spatial distance, morphology, and transcriptomics to reconstruct spatial trajectories and domains.
Best use: These methods are most effective when identifying subtle spatial patterns that standard clustering might overlook.
5. Reproducible Pipelines
To analyze multiple samples or standardize team workflows, automated pipelines save time and improve reproducibility.
- Spacemake (Python + Snakemake): Modular and scalable pipeline that handles preprocessing, clustering, and spatial modeling across multiple datasets and platforms.
- Panpipes (Python): Supports multimodal single-cell and spatial analysis including QC, integration, and cell-type mapping, built on Scanpy.
Why pipelines matter: They ensure that all steps—from filtering to reporting—are recorded, reproducible, and easy to scale up across experiments.
Summary:
There's no single best tool for all projects. Start with familiar platforms (Seurat or Squidpy), then extend to spatially-aware clustering or deconvolution as your questions evolve. For multi-sample studies or collaborative projects, invest early in a reproducible pipeline.
Common Challenges and How to Handle Them
Which tools should I use to analyze spatial transcriptomics data?
Most users start with Seurat or Squidpy for core analysis. For deconvolution, cell2location and Tangram are leading options. For spatial clustering, BayesSpace and SpaGCN are well-supported and widely used.
Are there pipelines for analyzing spatial data across multiple samples?
Yes. Spacemake and Panpipes are designed for batch processing and reproducibility, making them ideal for labs managing large projects or complex study designs.
Detailed Analysis Pipeline for Spatial Transcriptomics Data
Once you've generated spatial transcriptomics data and loaded it into your analysis environment, the real interpretation begins. This section walks through each major step of the analysis workflow—not just what to do, but how to think through each stage from a researcher's perspective.
1. Preprocessing & Data Loading
Preprocessing starts with converting raw sequencing outputs into usable expression data and registering them against histological images. This step is typically handled by the platform's own software (e.g., Space Ranger), but it's critical that you check the results carefully before moving on.
What to look for:
- Are the spots aligned properly with the tissue image?
- Are background spots filtered out correctly?
- Are spatial coordinates scaled and centered correctly in downstream software?
Once preprocessed, the data are imported into an object structure that stores:
- Gene expression counts
- Spot/cell coordinates
- High-resolution images
- QC metadata
These structured objects allow for seamless progression through the rest of the analysis pipeline in R or Python environments.
2. Quality Control (QC) & Filtering
QC is not just about removing bad data—it's about ensuring you're working with biologically meaningful signals.
What should you evaluate?
- Low-complexity spots: Too few reads or genes detected
- High mitochondrial content: Often indicates dying or degraded cells
- Off-tissue spots: Can introduce background noise
Use violin plots, scatterplots, and spatial overlays to visualize QC metrics. Be flexible: a FFPE sample may require looser thresholds than a fresh frozen one.
✅ Practical tip: Save your filtering logic as a script or notebook cell—it's often revisited later when you tweak thresholds based on downstream results.
3. Normalization & Feature Selection
Normalization adjusts for differences in sequencing depth and technical variation, making expression values comparable across spots.
Common approaches:
- SCTransform (Seurat) – variance-stabilizing, robust across samples
- Log-normalization (Scanpy/Squidpy) – straightforward and interpretable
Once normalized, select highly variable genes (HVGs) to reduce noise and focus on informative features. These genes will be used for PCA, clustering, and visualization.
Don't skip HVG filtering—without it, clustering can be driven by noise or housekeeping genes.
4. Dimensionality Reduction & Clustering
To reveal structure in your dataset, apply dimensionality reduction and clustering.
- PCA is the usual first step to capture global variation.
- UMAP or t-SNE then projects these into 2D for visualization.
- Clustering algorithms like Leiden or Louvain group spots with similar profiles.
Key principle: Always verify clusters on the actual tissue image. Expression-based clusters that don't correspond to anatomical structures may still be valid—but require closer scrutiny.
5. Spatially Variable Gene (SVG) Detection
SVG detection identifies genes whose expression varies across spatial space, not just by expression level.
When to perform SVG analysis:
- After QC and normalization
- Once clusters are established or a spatial structure is suspected
- To identify regional marker genes
Methods to consider:
- Statistical models (e.g., SPARK-X)
- Graph-based smoothing (e.g., Squidpy)
- Gaussian process modeling (e.g., nnSVG)
SVGs are often used for downstream:
- Marker gene discovery
- Region annotation
- Spatial trajectory modeling
Method schematic of SPARK-X and simulation results. (Zhu, J., et al., Genome Biol, 2021)
6. Cell Type Mapping via scRNA-seq Integration
Many spatial datasets (e.g., from 10x Visium) contain multi-cell spots. To understand cellular composition, you can map reference scRNA-seq data back to each spot.
Steps involved:
- Select or generate an annotated single-cell reference dataset.
- Normalize and align gene names between spatial and scRNA data.
- Use tools like cell2location, Tangram, or RCTD to infer cell-type proportions.
Validation is key:
- Visualize estimated abundances spatially
- Compare with known marker genes or histology
- Look for consistency across biological replicates
Common pitfall: Using a mismatched single-cell reference (wrong tissue, condition, or platform) can lead to misleading or uninterpretable mappings.
7. Spatial Domain Annotation & Biological Interpretation
At this point, you have clusters, marker genes, and inferred cell types. The final step is biological interpretation.
Questions to ask:
- Do clusters align with known tissue regions (e.g., cortex, tumor margin, immune infiltrate)?
- Do SVGs or cell-type maps suggest new hypotheses about tissue architecture?
- How do spatial patterns relate to phenotype, condition, or treatment?
Integrate your spatial data with:
- Histology images
- Published atlases
- In situ hybridization (ISH) or IHC validation (if available)
8. Reproducible Workflow Design
Even if you're analyzing a single sample, plan for reproducibility.
Recommended practices:
- Save all filtering and clustering parameters
- Use reproducible scripts or notebooks
- Store intermediate files in structured folders
- Consider using pipelines (e.g., Panpipes, Spacemake, or MOSAIK) for multi-sample projects
Bonus tip: Add a README or workflow diagram to your project folder for future you—or future collaborators.
Practical Tips & Real-World Experience
While the formal analysis pipeline lays out a clear sequence of steps, actual research often brings surprises—batch effects, unexpected tissue variability, or analysis dead ends. This section collects practical advice from real-world spatial transcriptomics studies, grounded in both lab workflows and bioinformatics analysis.
Application scenarios of spatial transcriptomics. (Du, Jun, et al., Journal of translational medicine 2023)
1. Experimental Design Impacts Everything Downstream
Good data starts at the bench.
- Sample preservation matters:
- Fresh frozen samples generally yield higher-quality data, but FFPE is more accessible for clinical specimens. Plan for lower complexity in FFPE datasets.
- Tissue thickness and coverage:
- Thick sections may cause partial spot dropout due to incomplete imaging or RNA diffusion.
- Trimmed or uneven samples often result in low-quality border regions—flag these early during image review.
- Spot resolution vs. biological question:
- Visium is sufficient for most tissue-level questions.
- Platforms like Xenium or CosMx offer single-cell or subcellular resolution—but require much more data processing and storage.
Design tip: Align your platform choice and tissue preparation with your biological question and downstream analysis plan.
2. Be Realistic About Quality Control
QC thresholds aren't one-size-fits-all.
- Immune tissue may naturally show fewer genes per spot.
- Necrotic tumor cores or fibrotic zones may have low signal but still be biologically important.
- Brain tissue typically shows strong contrast between gray and white matter—expect distinct QC profiles.
Strategy: Use spatial plots to compare raw metrics visually, not just numerically. If necessary, segment QC by tissue region before applying global filters.
3. Plan for Batch Effects and Sample Integration
Spatial transcriptomics datasets often come from multiple sections, patients, or timepoints. These can introduce technical variation that obscures true biological signals.
- Use batch-aware normalization tools (e.g., SCTransform with regression, Harmony, Scanorama).
- Track sample IDs and platform metadata from the beginning—don't retroactively reconstruct this.
- Visualize batch effects in PCA or UMAP before clustering or SVG analysis.
Pro tip: Run a small pilot analysis to confirm batch behavior before committing to full integration.
4. Interpret Results in Spatial and Biological Context
Gene expression patterns are only part of the story. Interpretation improves when layered with histology, tissue landmarks, or spatial features.
- Compare cluster outlines with H&E features—don't trust UMAP alone.
- SVGs near the tissue edge may reflect technical artifacts (e.g., tissue detachment).
- Enrichment of immune or stromal markers at margins may signal real biological boundaries—or sample sectioning effects.
Biology-first mindset: Always ask, "Does this pattern make sense biologically and spatially?"
5. Invest in Reproducibility Early
When a project moves from 1–2 samples to 10+, manual workflows break down fast. Early investment in organization pays off.
- Use consistent file naming (e.g., patientA_slide1.h5ad)
- Save intermediate objects at each stage (raw, filtered, normalized, clustered)
- Log your parameters (e.g., QC thresholds, clustering resolution) in plain text or markdown files
- Use version-controlled notebooks (e.g., R Markdown, Jupyter with Git)
If working with multiple people or samples, pipelines like Spacemake, MOSAIK, or Panpipes can prevent errors and accelerate consistency.
Common Pitfalls to Avoid
Pitfall | Why it happens | How to avoid it |
---|---|---|
Clusters don't match histology | Overreliance on UMAP | Cross-check clusters on tissue images |
Poor deconvolution | Mismatched scRNA-seq reference | Use tissue- and condition-matched references |
Inconsistent QC | Thresholds applied globally | Visualize QC per tissue or sample |
Lost file traceability | Manual renaming, no folder structure | Standardize naming and save scripts |
Summary & Suggested Next Steps
A spatial transcriptomics project moves from raw reads → clean spots → interpretable maps. Keep it structured, reproducible, and biology-first.
Workflow at a Glance
Stage | Purpose |
---|---|
Preprocessing & Loading | Turn raw reads/images into counts + aligned coordinates |
Quality Control | Remove low-quality or off-tissue spots |
Normalization & HVGs | Stabilize signals; keep informative genes |
DR & Clustering | Reveal transcriptomic domains |
SVG Detection | Find genes with spatial patterns |
Cell-Type Mapping | Estimate cell proportions using scRNA-seq references |
Interpretation & Reporting | Anchor signals to histology and biology |
Reproducible Design | Scale across samples and collaborators |
References
- Ståhl, Patrik L., et al. "Visualization and analysis of gene expression in tissue sections by spatial transcriptomics." Science 353.6294 (2016): 78-82.
- Rodriques, S.G., Stickels, R.R., Goeva, A. et al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).
- Stickels, R.R., Murray, E., Kumar, P. et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nature Biotechnology 39, 313–319 (2021).
- Righelli, D., Crowell, H.L., Weber, L.M. et al. SpatialExperiment: infrastructure for spatially-resolved transcriptomics data in R using Bioconductor. Bioinformatics 38(11), 3128–3131 (2022).
- Cable, D.M., Murray, E., Zou, L.S. et al. Robust decomposition of cell type mixtures in spatial transcriptomics. Nat Biotechnol 40, 517–526 (2022).
- Hu, J., Li, X., Coleman, K. et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat Methods 18, 1342–1351 (2021).
- Zhu, J., Sun, S. & Zhou, X. SPARK-X: non-parametric modeling enables scalable and robust detection of spatial expression patterns for large spatial transcriptomic studies. Genome Biol 22, 184 (2021).
- Du, Jun, et al. "Advances in spatial transcriptomics and related data analysis strategies." Journal of translational medicine 21.1 (2023): 330.
- Williams, C.G., Lee, H.J., Asatsuma, T. et al. An introduction to spatial transcriptomics for biomedical research. Genome Med 14, 68 (2022).