Guest blog post by Kylee Degatano, Product Manager for the Lantern Pipelines team in the Data Sciences Platform at the Broad Institute.
I’m very excited to announce the recent release of WDL Analysis Research Pipelines (WARP), a brand new, public GitHub repository of cloud-optimized WDL workflows that are used in production at the Broad Institute. But before I get into specifics, I want to give you some context that explains why this is significant if you’re not already familiar with our organization.
The Broad Institute of MIT and Harvard — or just “the Broad” for short — is a non-profit biomedical research institution that also happens to be one of the largest sequencing centers in the world. Our Sequencing Platform pumps out hundreds of genomes and exomes every day, amounting to over 45 Tb/day, and the Institute overall is currently managing about 35 Pb of data stored in the cloud.
The teams that develop and operate Terra at the Broad (in collaboration with Verily, an Alphabet company) are part of the Data Sciences Platform (DSP), a sister organization to the Sequencing Platform. The DSP is responsible for processing all the sequencing data generated at the Broad, as well as data produced by partner institutions in the context of large collaborative efforts such as gnomAD, All of Us, the Human Cell Atlas, and the BRAIN Initiative. This includes genomic data of course but also single-cell transcriptomics and epigenomics.
To that end, we have a team of data engineers and bioinformaticians called “Lantern Pipelines” dedicated to building, optimizing, and maintaining workflows, using open-source tools that are either developed in-house (like GATK) or are produced by others in the research community (like STAR). Given the amount of data that needs to be processed, the workflows need to be very robust — we can’t afford to have our production operations screech to a halt anytime a sample comes through with some quirky formatting — and they need to be economical, to make every research dollar count.
The Lantern Pipelines team writes all their workflows in WDL, the Workflow Description Language maintained by the OpenWDL community, and optimizes them to run as quickly and efficiently as possible on Google Cloud using the Cromwell workflow management system — which is the same infrastructure that you have access to in Terra.
Note: Internally we use the term “pipeline” somewhat interchangeably with the term “workflow”, with the nuance that some pipelines are composed of several individual workflow scripts (typically one main workflow importing nested sub-workflows). Here, I use the term workflow throughout for consistency with the terminology used in Terra and its documentation.
As you can imagine, a huge amount of effort goes into developing and testing these workflows, so we’ve been wanting to share them with the world. We’d love to see others benefit from our team’s hard work, especially if that means they avoid having to reinvent wheels that we’ve already built!
That’s why I’m so thrilled that we’ve finally arrived at a good solution for sharing these workflows, which until now had been mostly siloed in private development repositories. We’ve set up a public GitHub repository called WARP, for WDL Analysis Research Pipelines, that collects all of our “blessed” workflows, meaning those that have been fully vetted for use at scale in our production operations. We’ve also registered each of them in Dockstore so they can be easily imported into Terra. In fact, many of these workflows are already available in public Terra workspaces; you can find those by searching for the “warp-pipelines” tag in Terra. We encourage you to clone those workspaces and try out the workflows on the included test data to familiarize yourself with their operation before running them on your own data.
If you have not run workflows in Terra before, check out the Workflows Quickstart video on Youtube for a step-by-step run-through of the process.
We are deeply committed to reproducibility, provenance, and transparency, so every workflow in WARP is released with a semantic version number to provide clear provenance of your data processing, as well as release notes that outline any updates as applicable. They are all open-source under a BSD 3-Clause license, and they call only open-source tools; though keep in mind that the exact licenses of the tools involved may be different, and you are responsible for checking that your intended use is allowed by each tool’s license. In addition, all tools are packaged in public docker containers distributed either by us or by their authors.
Finally, the repo’s documentation portal, which we are currently fleshing out, will include a full list of available workflows and documentation for each of them, so be sure to check it out and see if any of these workflows might provide a solution to some of your analysis needs. Remember also to check the Terra workspaces list in case the one you like is already available in a preconfigured workspace.
On behalf of the entire Lantern Pipelines team, I hope you will find this new resource useful and I look forward to hearing your thoughts about how we could further improve it.
WARP speed ahead!