Megan Shand is a Computational Biologist in the Data Sciences Platform at the Broad Institute. As a member of the GATK development team, she manages methods and data pipelines for whole genome sequencing (WGS) data. In this guest blog post, Megan describes the new GATK pipeline for processing short-read WGS data produced by the new Ultima Genomics technology.
The biotech startup Ultima Genomics recently shared its new short-read sequencing technology, which generates high-throughput genomic data using sequencing-by-synthesis (SBS). In keeping with our goal of enabling the research community to use relevant data types, we’ve been collaborating with Ultima to adapt our whole genome analysis pipeline — aka the GATK Best Practices for short variant discovery — to handle the new data appropriately. We also made a Terra workspace preloaded with sample data and the fully configured analysis workflow so you can check it out for yourself without having to download or install anything.
Let’s start with a brief summary of how the technology works, then I’ll explain what we changed in the pipeline and how you can try it out for yourself.
Summary of the new sequencing approach
The company released a preprint describing their new technology in detail, so I won’t rehash it all here — you should really read it for yourself — but here’s a quick summary of how it works.
The most fundamental part of the approach should sound very familiar. There’s a “flowcell-like” substrate patterned with landing pads for sequencing beads, and the chemistry involves “mostly natural” sequencing-by-synthesis, which refers to the use of sparsely labeled non-terminating nucleotides. At each sequencing cycle, the beads are exposed to the MNN mix, and polymerase extension is performed to incorporate 0, 1, or several bases of a single nucleotide base type (dA, dC, dG or dT) into each growing strand, depending on the length of the homopolymer in the template. The labeled bases are detected by optical scanning, with the signal of each bead being proportional to the length of the homopolymer sequenced. The base-calling is done by on-board GPUs that employ a deep convolutional neural network to convert the raw signals into the sequences.
However, instead of using a traditional linear flowcell, Ultima’s sequencer uses an “open fluidics” design, which consists of a circular 200mm silicon wafer that spins while reagents are released at its center. The spinning causes the reagents to spread out by inertial distribution, which reduces the volume of reagents needed per cycle. The imaging is also done by spinning the wafer — like reading a compact disc, as the preprint puts it.
Finally, the data produced by the Ultima technology is very similar to “classic” short read genomic data. It can be stored in the same file formats (FASTQ, SAM/BAM/CRAM etc), which means the tools we have all been using so far — BWA, GATK etc — can read and write the Ultima data out of the box.
For a more in-depth third-party review of the new technology and its implications, check out Keith Robison’s “Omics! Omics!” blog.
Adapting the GATK pipeline to handle Ultima data
Despite the overall similarities, we did find some differences in the error modes that affect the data generated by the new technology. Some error types are less common (mismatches), while other error types are more common (homopolymer indels). As a result, we had to make some modifications to a subset of the algorithms that are used in our pipeline to handle those differences.
Key tool/algorithm changes
– The Picard tool MarkDuplicates has been adapted to handle ambiguity in read alignment start positions;
– The contamination estimation step has been adapted to use only the highest quality SNV sites based on flow cycles and local realignment;
– In the HaplotypeCaller variant caller:
– The classic Hidden Markov Model (HMM) that we previously used to calculate genotype likelihoods has been replaced by a new likelihood model (“flow-based”) that more accurately accounts for sequencing errors present in the data;
– A new haplotype filtering step has been added to remove spurious alleles that don’t contribute significantly to the genotyping likelihood;
– At the variant filtering stage, VQSR and its Mixed Gaussian model has been replaced with an external package developed by Ultima that applies a Random Forest model.
Overall, these are mostly “under the hood” changes to existing tools that we activate using configuration flags and parameters. The only step where we’re swapping out components is variant filtering, as described above.
You can read more about these changes in the GATK technical documentation.
New pipeline implementation
The resulting pipeline is different enough from our “generic” whole genome analysis pipeline to warrant its own implementation, which you can find here in the WARP repository. Like all our other pipelines, it is written in the Workflow Description Language, or WDL, which you can learn more about here.
Fortunately we were able to reuse a lot of existing code because of the modular design we’ve been using for the past few years: the bulk of the work done by the pipeline is divided into sub-workflow scripts that we can call from a top-level workflow script, sometimes in different combinations or with different parameters as needed. We originally adopted that design to maximize code reuse between the exome and whole genome versions of our variant calling pipeline, and now it’s also coming in handy to minimize the amount of homologous code that we need to maintain in parallel for running on different data types. Hopefully this will also make things easier for any of you who are also using our workflows to analyze your short read data.
If you develop your own GATK pipelines, make sure to update to the latest version of GATK so you can take advantage of the new functionality.
Take the data and the new pipeline out for a spin
Of course there’s no better way to really get to know a new data type than to poke at it yourself. To that end, our teams collaborated to create a public workspace in Terra that contains sample Ultima data as well as the updated whole genome analysis workflow preconfigured to run on the sample data. The workspace also includes all the relevant logs, intermediate outputs, and so on produced in the process of running the analysis workflow.
Having all of that bundled together in a workspace allows you to examine the input files and see exactly how the analysis workflow is set up, what all the parameter values are, how long it takes to run and what the quality control metrics and outputs look like, in full technicolor detail. If you’d like, you can even clone the workspace under your own account and try running the workflow yourself.
So check out the Ultima workspace today to form your own opinion of the new data and test the updated whole genome pipeline. We’d love to hear your feedback in either the Terra community forum or the GATK support forum.
If you have any technical questions about the GATK side of things, please post your question in the GATK support forum. For resources to get started using the Terra workflow system, see the “Getting Started” links in the workspace dashboard. Finally, for questions about the new technology, reach out to Ultima Genomics via their contact form.