As the global battle against COVID-19 continues, researchers in the viral genomics group at the Broad Institute have been hard at work using viral sequencing and genomic epidemiology to understand the spread of SARS-CoV-2 close to home, yielding new insights into the arrival and spread of the epidemic in the greater Boston area.
The Broad’s Data Sciences Platform has been supporting this work and hopes to enable anyone around the world to do this as well, by harnessing the power of Terra as a community data access and analysis platform for pathogen genomic surveillance. Using the Broad team’s SARS-CoV-2 genomic data as a pilot, this initiative will provide the data management, analysis tools, and compute needed to help make genomic data for public health epidemiology more accessible, interoperable, and user friendly.
In their first data release, the Broad research team focused on the early phase of the local epidemic in the Boston area, from early March to early April, revealing how the virus was introduced from multiple sources, both domestically and abroad. They also investigated two “superspreading” events, one an international conference, and the other within a congregate living facility. A working draft version of the manuscript describing the details of this data and analysis can be found on Virological.org. The team has been working in close collaboration with their partners at Massachusetts General Hospital, and the Massachusetts Department of Public Health, sharing their findings in real-time.
The team is sharing their data and analytical tools openly in a newly featured COVID-19 Broad Viral NGS workspace in Terra, so other research and public health labs can explore and use these resources for their own work. The workspace is designed to enable researchers to go all the way from raw reads to phylogenetic trees using their own data, and to make batch data submission to GenBank easier.
The current dataset contains 331 assembled and high-quality SARS-CoV-2 viral genomes from human-depleted, metagenomic Illumina short read data, from samples taken from nasal swabs of confirmed COVID-19 patients tested at the Massachusetts General Hospital. It also contains 3065 publicly available SARS-CoV-2 genomes from the global dataset available on GenBank as of May 14th, 2020 to include in phylogenetic analyses of the primary dataset.
Along with the data, the workspace contains road-tested analysis workflows for viral genomic data that allow researchers to easily:
- perform reference-based assembly of viral genomes from PCR amplified or metagenomic Illumina reads, called assemble_refbased
- perform phylogenetic analysis and interactive visualization using the community standard Nextstrain.org toolset, called nextstrain_augur_viral_pipelines
- simplifying viral genome submissions to NCBI Genbank via direct bulk submission, in a manner consistent with SARS-CoV-2 community standards, called genbank
With this workspace, researchers are also able to recreate the phylogenetic trees shown in the Terra workspace dashboard and can run the workflows on their own data in Terra to create new phylogenetic trees and interactive visualizations. Both the Nextstrain visualization and the GenBank submission workflows can operate on genomes generated within Terra or imported from external assembly software.
All of these resources have also been distributed via public repositories—the genomic dataset has been released publicly on NCBI under BioProject PRJNA622837 and the featured, portable analysis workflows are published on the Dockstore tool repository service. The workflows have been validated to run on other commercial cloud platforms as well as local compute hardware. However, the viral genomics group has found from years of experience that Terra increases access to these tools to a much wider set of researchers, as it does not require traditional bioinformatic expertise.
Terra, a product of the Broad Institute in collaboration with Verily Life Sciences, leverages cloud resources to enable computation at a scale not possible in a lab with limited compute resources. Terra supports several flagship human genetic research initiatives such as NHGRI’s AnVIL, NHLBI’s BioData Catalyst, and NIH’s All Of Us. Terra is designed to make it easier to share and access large datasets, run analyses at scale, and work collaboratively across groups and organizations without having to worry about the underlying computational infrastructure or keeping track of large datasets. The team hopes providing data and tools on Terra can help to empower public health labs as they engage in viral sequencing work.
This initiative was sponsored in part by the National Institute of Allergy and Infectious Diseases and the Bill and Melinda Gates Foundation and is working in partnership with the Public Health Alliance for Genomic Epidemiology (PHA4GE) and the Virus Pathogen Resource (ViPR).