Terra Blog

Introducing the All of Us + AnVIL Imputation Service

The Broad Institute’s Data Sciences Platform has launched the All of Us + AnVIL Imputation Service, now available at https://allofus-anvil-imputation.terra.bio/. This is the first in a new suite of cloud-based scientific services, designed to make large-scale genomic research faster, more accurate, and more inclusive.

Genotype imputation plays a key role in genome-wide association studies and polygenic risk score analyses by expanding the portion of the genome that can be analyzed for phenotypic associations, at a fraction of the cost of sequencing. Imputation services allow researchers to leverage large reference panels (without direct access to protect participant privacy) while also eliminating the need to build and manage the complex computational infrastructure required for imputation pipelines.

The largest and most diverse panel in the world

Our imputation service features a diverse reference panel of genomes from over 515,000 participants from the All of Us Research Program and AnVIL Center for Common Disease Genomics, including more than 250,000 genomes from non-European inferred genetic ancestries.

Figure 1. The All of Us + AnVIL reference panel contains more than 515,00 total genomes from the following computed genetic ancestries: 254,416 European (49%), 101,982 African (20%), 90,553 Americas (18%), 13,226 East Asian (3%), 9,710 South Asian (2%), 1,065 Middle Eastern (0.2%), and 44,627 remaining individuals (9%).

The All of Us + AnVIL reference panel is the largest and most ancestrally diverse reference panel currently available, enabling researchers to enhance their datasets with greater accuracy. When imputing 42 arrays representing an ancestrally diverse set of samples against their whole genome sequences, we found high confidence (R² of 0.7) in both imputed SNPs and indels, even at very low allele frequencies in most ancestries. You can read more about our reference panel in our documentation.

Figure 2. Mean R² values for imputed SNPs (top) and indels (bottom) across all chromosomes compared to their respective 30X whole genomes for 42 samples from diverse ancestries (African, African-American, Admixed American, East Asian, Non-Finnish European, South Asian). The X axis is allele frequency, and the Y axis is the mean R².

Secure and scalable infrastructure

Behind the scenes, the imputation service leverages the Terra platform. This means you can expect the same level of security and privacy, including the ability to impute controlled access data that requires NIST-800-53 Rev5 FedRAMP Moderate compliance. You can learn more about Terra’s security at https://terra.bio/terra-security/.

Because our service runs on the cloud, you can expect to get your results fast without having to wait in a queue of users. Early testing revealed that submitting 2,500 samples to the service returns results the next day.

Get Started

The beta release of the service offers imputation of array data via a command-line tool called terralab. When a user signs up during the beta release, they may be eligible for credits, allowing them to submit up to 2,500 samples at no cost, thanks to funding from the National Institutes of Health.

Discover how to utilize our service by visiting our documentation page at https://broadscientificservices.zendesk.com/hc/en-us.

What to expect over the next year

Soon, we plan to release a point-and-click web user interface, as well as support for imputing cloud-hosted data.

With the rapid growth of genomic projects, we are preparing to meet increasing demand. After the beta period, we will transition to a paid model designed to support large-scale studies, expanded pre- and post-analysis options, and the release of imputation pipelines for low-pass genomes.

Acknowledgments

This work was made possible by National Institutes of Health (NIH) awards: (1) OT2OD035404, “All of Us Data and Research Center (DRC);” (2) OT2OD03821, “Broad-Color: The Genome Center for the Future of All of Us;” (3) OT2OD002750, “The Broad-LMM-Color Genome Center for All of Us,” funded by the NIH Office of the Director; and (4) U24HG010262, “AnVIL: A National Resource for Genomic Data Analysis and Visualization,” funded by the National Human Genome Research Institute.

We gratefully acknowledge All of Us and Centers for Common Disease Genomics participants for their contributions, without whom this research would not have been possible. We also thank the National Institutes of Health’s All of Us Research Program and NHGRI AnVIL for making available the participant data for the reference panel.