The Human Cell Atlas is an impressive project that brings together a multitude of ‘omics datasets generated by over 300 different labs from across the world. Its data collection currently comprises data from 20M cells representing 10 major organs, originating from over 2,300 donors. All of it is publicly accessible, forming a veritable cornucopia of data for anyone seeking to probe the intricacies of our cells and the systems of life.
What’s perhaps even more impressive, given the crowd-sourced nature of this enormous collection, is that the bulk of the data contributed to the HCA is additionally re-processed through uniform pipelines and made available in standardized, well-documented formats. This goes a long way toward enabling a wide range of researchers to reuse this data and incorporate it into their own analyses, thus multiplying the value of each bit (or byte?) of data collected.
A short hop from the HCA Data Portal to your Terra workspace
The HCA project makes its collection available through a Data Portal that offers multiple options for downloading or otherwise exporting data of interest. My personal favorite (I may be slightly biased) is the option to export metadata to a Terra workspace, from where you can access and analyze the data on the cloud without actually having to download anything.
The export process is relatively straightforward, as you can see in this short 5-min demo video, which goes from browsing the data portal, selecting and exporting a dataset to Terra, to applying some basic analyses using a pre-configured workflow and a Jupyter Notebook running Seurat (happy birthday, Georges!).
The demo is based on a tutorial workspace created by my wonderful colleagues on the Terra User Education team, with expert input from our HCA and Bioconductor project partners, to help researchers get started with HCA data in Terra. The workspace includes pre-imported data from the SingleCellLiverLandscape project (MacParland et al., 2018) (which I re-import in the demo so you can see the full end-to-end process) as well as one workflow and four notebooks.
The workflow runs Cumulus, a pipeline for analyzing large-scale single-cell and single-nucleus RNA sequencing datasets (Li et al. 2020) that takes in a raw count matrix and performs filtering, normalization, and clustering. The four notebooks demonstrate popular single-cell analysis tools including Bioconductor, Seurat, Scanpy and Pegasus in Terra, with particular focus on providing reusable solutions for accessing and manipulating data on the cloud and for working with the data types and formats provided by the HCA project.
For more details, have a look at the full text instructions for the tutorial or check out the workspace and take it for a spin yourself — thanks to this great resource, you can replicate all the steps shown in the demo video, and more, in a matter of minutes!