At its core, Terra is an open platform with a “bring your own tools” philosophy, designed to accommodate the heterogeneous and ever-evolving nature of life sciences research. Yet in practice, we know there are specific tools that a lot of people want to use, so we work with their developers to make those tools available out of the box, to reduce the amount of collective effort that needs to be spent on the setup as opposed to actual research. GATK was one of the first toolkits we added to Terra’s built-in tool repertoire, and it’s a great illustration of how making complex computational tools available this way can benefit a wide range of researchers. Given the GATK team recently released a major update, version 4.2, I thought we could mark the occasion by doing a quick recap of GATK’s history in Terra as well as a roundup of relevant resources.
GATK and Terra, a match made in the cloud
If you’re not already familiar with it, GATK stands for Genome Analysis Toolkit and is the world’s most widely used open-source toolkit for variant calling. It’s developed by colleagues of ours in the Broad Institute Data Sciences Platform; we’ve been working with them since the early days of Terra to make sure GATK tools and pipelines would run well on the platform. In fact, key components of Terra’s workflow execution system were originally developed specifically for the large-scale production version of the GATK Best Practices whole genome pipeline, which includes WGS data processing and short variant calling.
Today, the GATK team uses Terra as part of their development process, particularly for testing at scale, and as a platform for making their tools available to the research community in a form that goes well beyond just releasing executable code. For each of their pipelines, which span a range of use cases covering all major classes of variants, germline and somatic, the team produces public workspaces containing pre-configured workflows and test data along with instructions for how to run them. They also provide workshop materials in Terra workspaces that include detailed tutorials in the form of Jupyter notebooks. You can read more about their perspective on this — and browse a recent list of public GATK workspaces — in this GATK blog post.
Enjoy all the GATK 4.2 innovations in Terra
The GATK team recently released version 4.2, which includes updates ranging from performance upgrades to classic tools, to brand new tools that expand GATK’s scope of action, particularly in the realm of copy number and structural variant calling. And of course, the “DRAGEN mode” functionality that was added in preparation for the upcoming DRAGEN-GATK short variants pipeline.
You can read all about the new features on the GATK blog; either just the highlights of the latest release or a more high-level overview of what’s new since version 4.1, which traverses two years’ worth of new developments.
All these improvements are available to use in Terra right away. The workflows corresponding to multi-tool pipelines are registered in Dockstore under the Broad Institute “Organization” and conveniently grouped into Collections such as GATK Best Practices, Long Read Pipelines, and GATK for Microbes, among others. From there, you can easily import any of them into Terra; though the most popular of these are preloaded in public workspaces as I described earlier, so it’s worth checking the showcase before doing any work yourself.
Leverage the new library of per-tool WDLs
All that is great, but one of the updates I’m personally most excited about in GATK 4.2 is the automated release of per-tool WDL workflows (and their registration in Dockstore) for a number of read-based tools. There’s a whole backstory to this that I’m not going to go into here, but basically, the result is that you’ll be able to run any of those tools individually in Terra without having to write a new WDL from scratch. Similarly, you’ll be able to import any of those tool WDLs into a higher-order workflow to use that tool without having to write a new task or define all the inputs.
One caveat is that the 4.2 release only includes a specific subset of tools. The team still has some work to do to activate the WDL generation feature for all tools, which involves adding special annotations in the source code for each tool, so it might take a bit of time before we have WDLs for all GATK tools. But that day will come, and it will be glorious.
The best place to ask for help
Finally, I want to mention one thing that’s a little bit special about using GATK on Terra, compared to other tools or other computing platforms: the support staff for GATK and for Terra are all part of the same team behind the scenes. And I don’t mean that metaphorically or in a vague “Go Team Broad!” way; I mean literally the same team, with the same manager and the same happy hours. The immediate benefit of that to you is that whenever you run into one of those weird cases where it’s not clear why a run is failing, for example, they can work together behind the scenes to figure out your problem, regardless of whether you posted it in the GATK Forum or contacted the Terra Helpdesk directly. And if it’s a really weird case, they have direct access to both the GATK developers and the engineers who operate Terra. That level of combined expertise is hard to beat.
Key Resources
GATK Forum and User Guide
Terra Helpdesk and User Guide
For a comprehensive hands-on curriculum that integrates GATK, workflows, and Terra, see this blog post.