In the last decade, we’ve seen exponential growth in the amount and breadth of cancer research data. We’ve harnessed the power of cloud repositories to host large oncology datasets, including genomics, proteomics, imaging, and more! Although this diversity in datasets is a strength, combining these data to leverage the full range of information available to accelerate cancer research remains a challenge; there is no easy way to search across and retrieve data from the multitude of program- and data type-specific repositories in use today.
Now, thanks to an award from the National Cancer Institute (NCI), the Broad Institute will lead a collaborative effort with Seven Bridges and the Institute for Systems Biology to build a Cancer Data Aggregator (CDA), a query engine that will allow the research community to seamlessly discover, retrieve, and connect oncology datasets across the NCI’s Cancer Research Data Commons (CRDC) and partner repositories.
The Cancer Data Aggregator will streamline the complex process of searching across repositories
Currently, searching and aggregating results across all NCI data repositories is tedious and time-consuming. If you want to find, as an example, clinical and genomics data for a particular cancer phenotype, you first must perform independent searches within the two individual datatype repositories using each repository’s unique metadata terms. Because the metadata are not harmonized between the repositories, you may need to use a different search term for each repository to identify the same phenotype of interest. If you’re unfamiliar with a repository’s metadata, you may miss relevant datasets that could have been discovered using a different search term. And once you’ve found the appropriate datasets within each repository, you still need to aggregate the datasets before finally performing analyses.
With the CDA, researchers will be able to use a harmonized, interoperable and open source API to retrieve simultaneous results from all CRDC and NCI approved repositories with just one query. This is because the CDA will be designed in collaboration with the CRDC’s Center for Cancer Data Harmonization (CCDH) to implement a common data model across repositories. Therefore, a single search term/query can be used to retrieve related datasets across all the disparate data repositories. Not only does this make diverse and interrelated datasets more discoverable, it saves time, allowing researchers to focus more on the science and less on the data retrieval. Additionally, the ability to search across scientific domains will empower new research collaborations to further expand cancer therapeutic options and improve diagnoses.
The CDA consortia will leverage its collaborative roots to bring more researchers to the cancer data ecosystem
The Broad Institute, Seven Bridges, and the Institute for Systems Biology will build off their deeply collaborative history to officially join together to build the CDA. In 2014, the NCI launched the Cloud Pilots program, which was designed to help bring NCI data to the cloud and standardize data access. While each of these teams worked to build their own independent platforms, they closely interacted with one another to optimize the analysis of the NCI’s diverse and powerful datasets and to engage the wide world of cancer researchers. Building the CDA is an exciting opportunity for these teams to leverage their years of experience working together to bring this valuable new asset to the cancer data ecosystem.
The CDA is set to launch 2021. You can read more about it in the official news release, the NCI announcement, and the Seven Bridges announcement.