Making large-scale single-cell RNASeq analysis scalable and cost-effective with Cumulus

In this guest blog post, Bo Li, Principal Investigator at Massachusetts General Hospital and Assistant Professor of Medicine at Harvard Medical School, explains how using Terra enabled his group to develop a new single-cell transcriptomic analysis framework, Cumulus, to serve the needs of the Immune Cell Atlas project. Cumulus is open source and is described in the following paper:

Li, B., Gould, J., Yang, Y. et al. Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq. Nat Methods 17, 793–798 (2020). https://doi.org/10.1038/s41592-020-0905-x

The human Immune Cell Atlas (ICA) project, launched by the Klarman Cell Observatory at the Broad Institute and contributing to the Immune System Biological Network of the international Human Cell Atlas initiative, involves profiling the transcriptomes of over 1.7 million single cells from a variety of immune-related tissues across more than 30 donors. As a result, the ICA project has accumulated over 5 Terabytes of sequencing data, which holds an amazing amount of information, but is far beyond the capacity of any local cluster we have access to. As the computational lead for the project, I was faced with the challenge of finding a solution to process and analyze all that data in a way that would be scalable, cost-effective and user-friendly. Terra came to my rescue by providing a platform for: 1) flexible and rapid workflow development, 2) easy execution even for biologists and 3) a great solution for reproducible science.

Moving to the Terra platform has been transformative for my team’s research

Building on the Terra platform, we developed Cumulus, a cloud-native framework for analyzing single-cell and single-nucleus RNA-seq data that scales up to millions of single cells. The Cumulus framework has already demonstrated substantial impacts: it is used in several big cell atlas projects, such as the ICA project and the Human Tumor Atlas Pilot Project (HTAPP), which is the pilot project leading to the Human Tumor Atlas Network (HTAN) efforts supported by National Cancer Institute.

The Terra platform is the cornerstone for the success of Cumulus

Terra provides us with a convenient and flexible way to run analysis workflows on the cloud via its built-in Cromwell workflow management service, which dispatches computing loads to virtual machines in Google Cloud, with the ability to parallelize execution across a large number of nodes on demand. The workflows are written in the Workflow Description Language (WDL) which allows us to easily do things like configure the computing resources allocated to running each task, such as number of CPUs/GPUs, memory and disk space, and embed custom Python code within individual tasks, which makes the workflow very flexible. With Terra, a well-trained programmer can deploy a basic bioinformatics workflow within a day, which significantly reduces the turnaround time to deliver results.

Terra also provides a powerful Graphical User Interface (GUI) so that biologists without much programming experience can also execute workflows on their own. Given a pre-configured workflow such as those we include in Cumulus, the user only has to fill in a web form to specify workflow inputs and then click ‘RUN ANALYSIS’. One of my colleagues, who is an immunologist, was even able to run her single-cell RNA-seq data analysis during her flight back to the US by using the Cumulus framework  on Terra.

Last but not least, Terra helps us ensure the computational reproducibility of our analyses. First, it guarantees that researchers will be able to utilize the exact same computing  environment as the one we used to produce experimental results (e.g. Ubuntu Linux vs. Redhat Linux, Python and depending package versions) by leveraging Docker containers to encapsulate all required software and dependencies and utilizing a workflow language that is readily portable across a wide variety of computing platforms, both on-premises and on cloud. It also enables anyone who wants to reproduce results to access the same type of computing resources (e.g. a big server with 100 CPUs and 1TB memory) since it is backed by the Google Cloud Platform, which as a public cloud vendor offers almost unlimited computing resources. In my experience, this combination of functionality is absolutely key to ensuring that researchers can reproduce, reuse and extend each other’s work.

We are now using Terra workspaces to publish our work in a more reproducible form

Terra makes it possible to package analysis materials, data and code into a shareable workspace that dramatically reduces the amount of setup time needed to reproduce an analysis. In fact, my team has developed a public workspace that demonstrates key analyses featured in our recently published paper describing the Cumulus framework on a subset of the original data, with the goal of making our methods more accessible to the research community. There is a lot more to discuss on that topic, so I plan to write another blog post once we’ve had a chance to get more external researchers to try out the workspace and give us their feedback.

In the meantime, I encourage you to check out the Cumulus workspace, and more generally I recommend Terra as a great cloud-native option for anyone who needs to run bioinformatics workflows, especially on large datasets!

Share

Share on facebook
Share on linkedin
Share on twitter