There is a new, more complete human reference on the Terra reference disk!
When the Human Genome Project was declared complete in 2003, there were a number of gaps in the reference, due to the repetitive nature of much of the human genome and limitations of technologies available then. This left approximately 8% of the human genome out of the canonical reference sequence. To fill this gap, the first complete sequence of a human genome – also known as the Telomere-To-Telomere (T2T) reference – was announced, in early 2022. The T2T reference was built using a multitude of latest sequencing technologies to expose the missing 8%. It lets scientists access even the trickiest regions like centromeres, segmental duplications, and other complex regions, including around 100 new protein-coding genes. In addition to these gains, some errors found in the previous GRCh38 reference are corrected resulting in an overall higher quality reference. Simply switching to this reference improves variant calling performance (see this writeup).
Scientific Background
Building a reference sequence is a special type of assembly project, involving multiple modes of orthogonal technologies to generate data and careful analysis. In the case of the T2T reference, the CHM13 (complete hydatidiform mole) haploid human cell line was used to assemble the autosomes and the X chromosome. The full Y chromosome sequence was included in 2023 using DNA from a sample commonly known as HG002 (also frequently referred to as NA24385). A few versions of the T2T reference were included in the data release to facilitate different types of analyses. These differ in a handful of ways described below.
The maskedY Reference
The pseudo-autosomal regions (PAR) are a pair of regions on chrX and chrY. Because they are homologous to each other, alignments to these regions are ambiguous, and end up with much lower mapping quality for both. Traditionally, one would mask the corresponding regions on chrY for all samples, and produce diploid variant calls in the PAR on chrX. The maskedY version of the T2T reference does just this, and replaces the PAR on chrY with hard-masked N bases so users can readily align their reads coming from these regions.
The rCRS Sequence
The T2T release includes a new mitochondrial sequence derived from the CHM13 cell line. An alternate version, containing the old revised Cambridge Reference Sequence (rCRS) chrM sequence identical to the one in the human reference hg38, is included in this version of the T2T release. This is convenient for backwards compatibility with mitochondrial studies done with hg38.
The EBV Sequence
The Epstein-Barr virus (EBV) sequence is included as chrEBV in our version of the reference provided on Terra. This is consistent with our curated hg19 and hg38 reference files, as described in our reference documentation and following conventions from the All of Us project. This contig is useful during alignment for siphoning off viral sequence common in human samples and helps improve data quality for samples from lymphoblastoid cell lines like the Corielle 1000 Genomes samples.
How to use the T2T Reference on Terra
You’ll find our recommended version of the T2T reference for most use cases on Terra under the name “T2T-v2.” The v2 corresponds to the T2T release v2.0, which includes the chrY sequence.
Our version corresponds to the following choices
- Uses the maskedY version for generic alignment to allosomes
- Uses the rCRS chrM sequence for backwards compatibility with human mitochondrial studies
- Includes a copy of chrEBV for cleaner alignments in human samples containing the viral sequence.
So you can drop the right files into your workflows and begin testing out the improved reference immediately, we also include precomputed index files for use with BWA. For other versions of the T2T reference for more specialized applications, see the full data release.
For instructions on how to add reference files to your Terra workspace, and reference them in your workflow inputs, see this guide.
Learn more about T2T with NHGRI AnVIL
The Telomere-to-Telomere (T2T) consortium is an open, community-based effort to de novo assemble the first complete reference human genome. They share their data (T2T, chrY) and methods with the community using the NHGRI AnVIL ecosystem. In fact, Terra was used in a detailed analysis of how our understanding of human genetic variation is improved using the new reference genome across thousands of human samples from globally diverse ancestries. Terra was also used to evaluate variation on the Y chromosome using data from the 1000 Genomes Project and the Simons Genome Diversity Project
We hope by adding the T2T reference to our list of provided references in Terra, users can leverage the same cutting-edge science in their work. See if it can help yours when you check it out on Terra/AnVIL!
Acknowledgements
Thanks to Kate Balaconis (DSP), Eric Banks (DSP), Allie Cliffe (DSP), Fabio Cunial (DSP), Kylee Degatano (DSP), Laura Gauthier (DSP), Steve Huang (DSP), Vijeta Limbekar (DSP), Karen Miga (UCSC), Sam Novod (DSP), Adam Phillippy (NHGRI), Michael Schatz (JHU), Beth Sheets (DSP), Hang Su (DSP), Nick Watts (DSP), and Jessica Way (DSP) for helpful scientific and technical input in curating the data and preparing this blog post.