Paper Spotlight: A complete reference genome improves analysis of human genetic variation

This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied. 


A complete reference genome improves analysis of human genetic variation

By Sergey Aganezov, Stephanie M. Yan, Daniela C. Soto, Melanie Kirsche, Samantha Zarate, Pavel Avdeyev, Dylan J. Taylor, Kishwar Shafin, Alaina Shumate, Michael C. Schatz et al., 2022

Science, Vol 376, Issue 6588 

Abstract: Compared to its predecessors, the Telomere-to-Telomere CHM13 genome adds nearly 200 million base pairs of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for clinical and functional study. We show how this reference universally improves read mapping and variant calling for 3202 and 17 globally diverse samples sequenced with short and long reads, respectively. We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.



What part of the work was done in Terra?

Excerpts from the paper’s Methods section:


Short-read variant calling 

To evaluate short-read small-variant calling between GRCh38 and T2T-CHM13, we used the NHGRI AnVIL (44) to align all 3,202 1KGP samples to CHM13 with BWA-MEM (45) and performed variant calling with GATK HaplotypeCaller (77) using a workflow modeled on the one developed by the New York Genome Center (NYGC) for 1KGP analysis performed on GRCh38 (28). As in the NYGC analysis, we recalibrated the variant calls with GATK VariantRecalibrator. We analyzed coverage statistics using samtools and AF using bedtools. To identify Mendelian-discordant variants, we used GATK VariantEval.


Note: NHGRI AnVIL is a project of the US National Human Genome Research Institute that brings together Terra and several complementary platforms into a powerful genomics analysis ecosystem. The AnVIL portal powered by Terra provides full access to Terra’s data and analysis capabilities.


How did they do it?

The authors developed WDL workflows for calling variants in the short read sequencing data based on a previous analysis by the New York Genome Center. They ran the workflows at scale on all 3,202 whole genomes in the 1000 Genomes project cohort using Terra’s workflow execution service

You can learn more about the scaling challenges they faced and how they overcame them by using Terra in this blog post, written by Samantha Zarate of the Schatz Lab.  

To try your hand at running a workflow in Terra, check out this Quickstart Tutorial Workspace


Appendix: Data and code availability