Databases and Resources: How to Find Spatial Omics Datasets and Protocols
Spatial omics is reshaping how scientists study tissues by combining molecular insights with spatial context. Rather than losing where gene activity happens—like traditional methods—spatial omics keeps that information intact. In this article, we'll guide you through key resources like SpatialDB and Bioconductor's SpatialDatasets, showing you how to locate, compare, and use spatial omics data in your research workflows. Whether you're exploring tissue architecture or cell interactions, this overview is crafted to help you find the right datasets and strategies efficiently and reproducibly.
How can researchers find and download spatial omics datasets
Finding publicly available spatial omics datasets is often the first step in designing a robust analysis workflow. Fortunately, several platforms make it easier to discover and retrieve data for exploratory studies, validation, or benchmarking new tools.
Where to Start
Here are trusted platforms widely used in the spatial biology community:
- SpatialDB – A curated browser for spatial transcriptomics datasets, allowing users to search by gene and explore spatial expression maps. Datasets span multiple platforms (e.g., ST, Slide-seq, seqFISH) and include both visualization tools and raw downloads. Ideal for initial gene-level screening and quick comparisons.
- GEO (Gene Expression Omnibus) – A broad repository of functional genomics data. Spatial datasets can be found using filters like "spatial transcriptomics" along with tissue, organism, or platform-specific terms. Processed and raw files are typically available.
- SRA and ENA Archives – If you're working with custom pipelines, these repositories offer access to raw sequencing files from published spatial studies. Useful when reproducibility or in-house reprocessing is a priority.
Note: If you're planning full pipeline integration or working in R, Bioconductor's SpatialDatasets package provides pre-structured datasets. We'll discuss its usage in later sections.
Efficient Download Tips
- Use filters: On GEO or ENA, combine keywords with filters like tissue type or technology platform to refine your search.
- Always download metadata: Most datasets include annotations—such as sample origin, processing method, or spatial coordinates—that are essential for spatial analysis.
- Check associated publications: Review the original studies linked to datasets for context, protocols, and validation details.
- Choose formats carefully: Ensure that file formats (e.g., CSV, matrix, image stacks) match your analysis environment.
For hypothesis-driven work, many researchers begin with a visual scan on SpatialDB, then move to GEO or ENA to retrieve full datasets for deeper analysis. This two-step method combines intuitive exploration with structured analysis, helping balance discovery and rigor.
Choosing the Right Spatial Omics Platform or Dataset
Selecting the right spatial omics dataset—or deciding which platform best fits your study—is about matching technology capabilities to your biological questions. The goal isn't always to use the most advanced method, but to use the one that's most appropriate for your tissue, targets, and resolution needs.
Choosing the Right Platform: What Matters
Rather than repeating platform descriptions, focus on aligning the technical parameters with your research intent:
- Spatial resolution:
If your work requires resolving single cells or small structures (e.g., tumor boundaries, neuronal circuits), platforms like Slide-seq, Stereo-seq, or MERFISH offer finer resolution. For broader tissue-level patterns, Visium or ST may be sufficient.
- Target type:
RNA-based platforms like Slide-seq or Visium are ideal for transcriptome-wide exploration. For spatial proteomics (e.g., immune profiling), platforms like CODEX or MIBI-TOF are better suited.
- Sample type:
Consider whether your samples are fresh frozen or FFPE. Some platforms (e.g., Visium FFPE, Xenium) support archival samples, while others require freshly preserved tissue.
- Data complexity and scalability:
Ultra-high-resolution platforms generate large, complex datasets that require advanced computational resources. Choose platforms that align with your team's data handling capacity.
Selecting a Public Dataset: Practical Criteria
When browsing existing datasets for analysis, these tips can help:
- Match disease or tissue context:
Use datasets that reflect your biological model (e.g., colorectal cancer, brain development) to ensure relevance.
- Check for clean metadata:
Well-labeled samples with spatial coordinates, preparation details, and grouping variables will save time and increase reproducibility.
- Look for associated publications:
Studies linked to peer-reviewed articles often provide transparent methodology, which adds reliability to the dataset.
How to handle metadata and extract expression matrices
Once you've downloaded a spatial omics dataset, the real work begins. One of the most common pitfalls researchers encounter is treating the expression matrix as standalone data—without realizing that its true value lies in the spatial and experimental context captured in the accompanying metadata.
Why metadata isn't optional
In spatial omics, metadata is more than a supporting file—it's the backbone that links gene expression back to the biological environment. It may contain tissue region labels, platform type, experimental condition, and even pixel or coordinate maps that align molecular data with tissue structure. Losing this context, or failing to interpret it correctly, can compromise your entire analysis.
Researchers should always double-check the metadata before diving into any preprocessing steps. Was the tissue fresh frozen or FFPE? Are there batch variables? Were all samples collected under similar imaging conditions? These details shape how you normalize, filter, and compare datasets.
In our experience, inconsistent or missing metadata causes more troubleshooting issues than the data matrix itself.
Understanding expression matrix formats
Not all expression data are formatted the same. Some datasets provide per-spot gene counts, especially common in platforms like 10X Visium or Slide-seq. Others, such as MERFISH or CODEX, report molecule-level data or intensity values over pixel grids. Before analysis, it's essential to identify what kind of expression output you're working with and whether it's already normalized, log-transformed, or raw.
Here are a few quick questions to ask yourself before loading the matrix:
- Are spatial coordinates embedded or separate?
- Is the data pre-filtered, or do I need to apply QC?
- Does the expression represent absolute molecule counts or scaled intensities?
Answering these early helps you avoid redoing work or introducing bias.
Practical tips from the lab
A simple habit that saves time: keep all files—metadata, expression matrices, and image data—in the same folder structure with consistent naming. For larger studies, creating a sample tracking sheet (even a spreadsheet) that links sample ID, file name, tissue source, and platform makes cross-sample comparison much smoother.
Another useful practice is to delay normalization until basic quality checks are complete. Filtering out low-expression spots or poor-quality regions before scaling ensures that downstream analyses reflect true biological signal rather than technical noise.
And remember: never rename barcode files unless you've also updated the matrix and spatial coordinate references. It's a quick way to break the link between data and spatial context.
Comparing Spatial Omics Datasets Across Platforms
If you're working with multiple spatial datasets—or considering integrating results across technologies—you'll quickly find that not all spatial omics platforms speak the same "language." Resolution, measurement type, and spatial granularity can vary dramatically between platforms, and understanding those differences is essential before drawing any biological conclusions.
What really differs between platforms
At first glance, most spatial datasets may look similar: tissue images, gene expression values, and some annotation files. But under the hood, the differences are meaningful.
For instance, a Slide-seq dataset might capture subcellular RNA distribution across thousands of densely packed barcoded beads, while a Visium dataset summarizes gene expression over larger 55 µm spots. That changes how you interpret spatial gradients, boundaries, or cell-type specificity.
The type of molecule being measured is also crucial. Platforms like MERFISH and seqFISH focus on targeted RNA molecules, allowing for high multiplexing of specific genes. In contrast, proteomic platforms like CODEX or MIBI-TOF detect spatial patterns of proteins, using antibody-based imaging. You're not just comparing apples to oranges—you might be comparing transcripts to proteins, and that matters.
Another often-overlooked factor is tissue preparation. Some platforms are FFPE-compatible; others require fresh-frozen samples. Differences in RNA integrity, imaging depth, or tissue section thickness can introduce variability that complicates comparison.
When does it make sense to compare across platforms
Cross-platform comparison is most valuable when you're:
- Looking for shared spatial patterns (e.g., immune cell localization) across different molecule types
- Integrating multi-modal datasets in the same tissue (e.g., combining RNA and protein maps)
- Validating findings from one technology using another with higher resolution or sensitivity
But even in these cases, you'll want to align platforms carefully. Avoid comparing high-resolution MERFISH data directly with low-resolution spot-based transcriptomics unless you're aggregating spatial bins to make them comparable.
What helps in cross-platform evaluation
Here are a few research-proven strategies that can help:
- Standardize your metadata early. Define common field names for tissue types, condition labels, and spatial scale units. This makes merging annotations across datasets much more manageable.
- Normalize spatial resolution thoughtfully. If datasets differ in spot size or spatial granularity, consider aggregating or interpolating data to a shared grid. Just don't oversmooth and erase meaningful biological variation.
- Look for biological overlap, not just technical similarity. It's more useful to compare two datasets from similar anatomical regions or disease contexts—even if platforms differ—than to force technical alignment on biologically unrelated samples.
- Use platform-aware analysis tools. Some software tools support cross-platform integration natively. For example, some spatial analysis pipelines allow for harmonization of single-cell, spatial transcriptomics, and imaging-based data using anchor-based mapping or spatial deconvolution methods.
Field insight
In collaborative projects, we often encounter the temptation to directly compare spot-wise gene expression from Visium with single-molecule RNA data from seqFISH. While it's possible, it rarely works well unless the data are first transformed to a compatible spatial scale. One strategy we've found helpful is creating a "summary matrix" of gene expression averaged over common anatomical zones, which allows comparisons to focus on biology—not raw technical resolution.
Resource recommendations and update strategies
Keeping up with spatial omics isn't just about analyzing existing datasets—it's about staying connected to a field that evolves quickly. New protocols, multi-omics platforms, and large-scale consortia are publishing spatial datasets at an increasing pace. For researchers, having a system to discover, save, and revisit high-quality data resources is not optional—it's part of doing reproducible, forward-looking science.
Trusted Resources to Bookmark
Here's a curated list of platforms and databases that continue to provide value in both daily work and long-term projects:
- SpatialDB
A user-friendly portal for browsing gene-level spatial expression across multiple studies and platforms. Especially useful for visual exploration before selecting a dataset for analysis.
- Bioconductor SpatialDatasets
Offers well-structured datasets optimized for analysis in R. Because it's tied to Bioconductor releases, updates are well-documented and version-controlled, which is ideal for reproducibility.
- HuBMAP (Human BioMolecular Atlas Program)
A major NIH-funded initiative building high-resolution spatial maps of human tissues. Data includes both transcriptomics and proteomics with standardized metadata and visualization tools.
- SPATCH
While not as widely known, SPATCH integrates datasets from multiple spatial studies and includes interactive tools for spatial domain analysis and data comparison.
- GEO and ArrayExpress
Though not spatial-specific, many spatial omics studies are deposited here. With proper keyword strategies (e.g., "spatial transcriptomics," "Visium," "MERFISH"), these archives remain valuable sources of both raw and processed data.
Tip: Save time by building a centralized "resource sheet" in your lab or project folder, with links to key portals, login info (if needed), and notes on what each resource is best used for.
Simple Habits to Stay Updated
Rather than chasing updates, let them come to you—here's how:
- Subscribe to database newsletters
Platforms like Bioconductor, HuBMAP, and SpatialDB offer mailing lists or RSS feeds. Subscribing ensures you hear about new releases, tools, or tutorials without having to check back manually.
- Follow key journals and preprint servers
Journals like Nature Methods, Genome Biology, and Cell Systems regularly publish new spatial technologies. Checking bioRxiv's spatial omics section monthly is a good habit for spotting upcoming techniques and datasets.
- Use versioning tools to track your own datasets
If you're curating your own dataset collection, version-control platforms like GitHub or simple lab wikis can help track what's been used, cleaned, normalized, or annotated. This is especially helpful for multi-user teams.
- Curate your own reference list
Maintaining a spreadsheet with dataset names, links, associated publications, organism/tissue type, and download status can prevent unnecessary re-searching and helps onboard collaborators more quickly.
Conclusion
Having access to good data is no longer the hard part—knowing how to organize, evaluate, and revisit it consistently is the real advantage. By combining trusted sources with lightweight update routines, you'll stay ahead in a field where staying current is just as important as staying rigorous.
How CD Genomics Can Help
At CD Genomics, we support researchers working with spatial omics data by offering reliable sequencing solutions and comprehensive data analysis services. Whether you're exploring tissue architecture, validating spatial patterns, or integrating multi-source datasets, our team can help streamline your workflow and enhance reproducibility.
Contact us to discuss how we can support your spatial biology research.
References
- Zheng, Y., Chen, Y., Ding, X., Wong, K. H., & Cheung, E. (2023). Aquila: a spatial omics database and analysis platform. Nucleic Acids Research, 51(D1), D827-D834.
- Lin, S., Zhao, F., Wu, Z., Yao, J., Zhao, Y., & Yuan, Z. (2024). Streamlining spatial omics data analysis with Pysodb. Nature Protocols, 19(3), 831-895.
- Fan, Z., Chen, R., & Chen, X. (2020). SpatialDB: a database for spatially resolved transcriptomes. Nucleic acids research, 48(D1), D233-D237.
- Marconato, L., Palla, G., Yamauchi, K. A., Virshup, I., Heidari, E., Treis, T., ... & Stegle, O. (2025). SpatialData: an open and universal data framework for spatial omics. Nature methods, 22(1), 58-62.