Behind the scenes: Bringing the analysis of COVID-19 data from greater Boston into the cloud

Christine Loreth is a project manager in the Data Sciences Platform at the Broad Institute. In this guest blog post, she tells the story of how she and colleagues in the DSP helped members of the Sabeti Lab, a leading infectious diseases research group at the Broad, use Terra to enable a genomic epidemiology study of the COVID-19 outbreak in the greater Boston area, led by Bronwyn MacInnis. The findings from the study are described in the following pre-print:

Lemieux et al. (2020) Phylogenetic analysis of SARS-CoV-2 in the Boston area highlights the role of recurrent importation and superspreading events. MedRxiv:

All materials discussed here are publicly available in the study’s Featured Workspace.

At the start of 2020, I had just begun working with members of the Sabeti Lab to demonstrate the use case and flexibility of Terra as a platform for pathogen genomic surveillance (PGS). PGS refers to using genomics as an epidemiological tool for tracking sources of infection, detecting outbreaks, and understanding the evolution of an epidemic or pandemic. We had recently received funding from the Gates Foundation to adapt Terra for viral genomic data and genomic surveillance applications and to accelerate the adoption of genomic surveillance by public health researchers and practitioners around the world. Shortly before the world went into lockdown, the project launched successfully and we began to make plans for how we could use Ebola data previously generated by the Sabeti lab to demonstrate sharing an analysis in Terra.

As the pandemic swept through Europe then the USA, however, everything changed. The Sabeti lab had begun formalizing plans with their partners at Massachusetts General Hospital and the Massachusetts Dept of Public Health to sequence the viral genomes of COVID-19 patients in the greater Boston area. They needed to formulate a plan for handling that data — and fast. We realized very quickly that this would be the right time to adapt our plans for sharing Ebola data and mobilize on Terra for SARS-CoV-2. This realization kick-started our work and we immediately began to plan for how this would come together — from flow cell to downstream analysis — in Terra.

As a starting point, I began by reviewing the lab’s workflows and typical processes, from wet bench to analytical outputs — with my formal background in molecular biology and significant experience at the bench in previous roles, I was able to ‘speak the language’ and understand the lab processes. One of our field engineers, Sushma Chaluvadi, jumped in with me to help get the lab’s analytical workflows set up in Terra.

Once we understood the processes, Sushma and I began porting their existing workflows from DNAnexus and GitHub to Terra. In parallel, I worked with our Pipeline Operations team to set up an automated workflow to transfer flow cell data from the Genomics Platform sequencing facility to Google Cloud Storage.

By this time, COVID-19 cases were rising sharply in the greater Boston area and the lab was receiving numerous plates of samples from their collaborators. As samples were sent for sequencing by our Genomics Platform, I tracked each run, so that the data could be made available in Terra as soon as possible. Once the raw data was in Terra, I ran demultiplexing and assembly workflows. Unsurprisingly, in working with such large amounts of data, there were some initial hiccups and challenges; for example, we had to figure out how to organize the data effectively, and we ran into some Google Cloud usage quotas. Various colleagues in the Data Sciences Platform stepped up to help us solve these issues, playing a critical role in getting the data demultiplexed and analyzed very quickly.

We also worked with the lab to create a workflow to use Nextstrain, a collection of open source phylogenetic analysis tools that are widely used by scientists, epidemiologists, and public health officials for studying pathogen spread and evolution, especially in outbreak scenarios. One of these tools, Augur, was developed for tracking pathogen evolution from sequencing data. We used it to build customized phylogenetic trees to analyze the evolutionary relationships between viral genomes isolated from cases of COVID-19, which enabled us to map the initial emergence and sustained transmission of the virus in the greater Boston area. You can see some of those outputs here.

The study of COVID-19 in the greater Boston area will continue; meanwhile, the initial study findings are available in MedRxiv, pending peer review, and the data and tools are publicly available in the featured workspace. Our team will continue to support the Sabeti Lab and others in the research and public health communities around the world by enabling them to organize, analyze, and share their data. We hope this will empower public health labs, as they scale their viral sequencing work.

I have been very lucky to be able to be a part of this important work this year. It has given both myself and my colleagues in the Data Sciences Platform an incredible amount of insight into the needs of the researchers and public health officials who are engaged in pathogen surveillance and infectious diseases work. We plan to use the knowledge and perspective we gained to further improve how we support research and data sharing in Terra.

This initiative was sponsored in part by the National Institute of Allergy and Infectious Diseases and the Bill and Melinda Gates Foundation and is working in partnership with the Public Health Alliance for Genomic Epidemiology (PHA4GE) and the Virus Pathogen Resource (ViPR).


Share on facebook
Share on linkedin
Share on twitter

Leave a Reply

Your email address will not be published. Required fields are marked *