For researchers trying to understand and treat cancer, searching for useful datasets consumes valuable time. Even after accessing a dataset, researchers must figure out how—and where—to analyze it. After all, it is no mean task to wrangle petabytes of data.
Luckily, the Cancer Research Data Commons (CRDC) addresses many of these challenges. A recent paper in Cancer Research outlines how the CRDC accelerates researchers’ work by hosting large, cancer-related datasets on the cloud. Terra is proud to support this effort: an NCI-branded version of Terra (Firecloud) is one of three platforms that both host the CRDC’s datasets and provide cloud-based analysis tools to uncover the insights in that data.
How to access CRDC data in Terra
Through Terra, researchers can access several of the CRDC’s open- and controlled-access cancer datasets. These include reference genomes and files (such as those from the 1,000 Genomes Project), as well as data from The Cancer Genome Atlas (TGCA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and the Cancer Cell Line Encyclopedia (CCLE). These datasets are multimodal: they include genomics, proteomics, and imaging data.
Rather than analyzing each dataset in isolation, researchers can combine and subset the data to build a cohort that is appropriate for a specific project. In addition, researchers can combine the CRDC’s cancer datasets with other data that are accessible through Terra—for example, from the Analysis, Visualization, and Informatics Lab-space (AnVIL)—or with data from researchers’ own labs.
How to analyze cancer data in Terra
Once researchers have collected the right data for their scientific questions, Terra provides easy access to several analysis tools to begin answering those questions. These include a library of existing workflows (automated pipelines for high-throughput analyses) and interactive analysis notebooks, with tools for long- and short-read variant calling, Genome-Wide Association Studies, epigenomic and RNAseq data processing, and fusion transcript detection.
In addition to leveraging pre-existing tools from the research community, analyzing CRDC data in the cloud is much more lightweight than on a personal or high-performance computer (HPC). Large datasets require a specialized computational infrastructure, but there’s no need to set up or maintain this infrastructure when working in the cloud. There’s also no need to set up a data security system, because data analyzed in Terra remains safely in a FedRAMP- and FISMA-certified environment.
CRDC data in action: the Proteomic Data Commons
Researchers have already leveraged Terra’s CRDC resources to push the cancer field forward. For example, the Broad Institute’s Proteomics Platform integrated data from the Proteomic Data Commons (PDC) into Terra to make it easier for researchers to uncover cancer mechanisms and biomarkers.
The PDC is an important resource for cancer researchers because it hosts data generated by several large cancer programs—these include TCGA, TARGET, the Clinical Proteomic Tumor Analysis Consortium (CPTAC), and the Applied Proteogenomics Organizational Learning and Outcomes (APOLLO) Network. D.R. Mani’s team at the Proteomics Platform worked with the Terra team to build a way for researchers to export data from the PDC into a Terra workspace with the click of a button.
Once the data is in Terra, it’s ready to be analyzed with a common proteomics analysis toolkit: FragPipe. To make this process even more seamless, the group set up a template workspace with step-by-step instructions for selecting PDC data and importing it into Terra. The workspace also includes a FragPipe workflow and an interactive notebook that walks researchers through how to analyze PDC data with an example FragPipe tool.
With cloud-based integrations like this uniting CRDC data and analysis tools, researchers can focus on understanding and treating cancer, rather than accessing and wrangling data.
Use Terra’s CRDC resources in your own work
To explore Terra’s CRDC resources further, register for an account and explore Terra’s NCI featured workspaces. These workspaces provide further instructions to access connected datasets and suggest tools to analyze the data. Note that you will need an eRA Commons account to access CRDC data on Terra.
Thank you to Alex Baumann, Emily LaPlante, and Katherine Thayer for their help preparing this post.