Terra Blog

Synthetic phenotypes for 1000 Genomes: Updated dataset for testing, training, and learning

One of the challenges we face in human genomic research is that there is a lot of data that we can’t freely share with one another, for legal and ethical reasons. This can be particularly vexing for tool developers who need data for testing, and for educators who need to provide example data to trainees. In both situations, it’s almost impossible to get access to protected data, even on a temporary basis, if it cannot be meaningfully de-identified.

Enter the 1000 Genomes Project, a widely-used resource that includes fully public exome and genome sequencing data from over 2,000 participating individuals (hey, it’s better to have more samples than the name implies). We regularly use 1000 Genomes sequencing data in testing and trainings; in fact, we’re actively working on a new set of tutorial videos using that data for our upcoming workshop at the annual ASHG meeting, which takes place later this month (Oct 27-30, 2020).

Yet even the 1000 Genomes project doesn’t include public phenotypic data, which is essential for many types of downstream analysis, such as genome-wide association studies (GWAS). To be clear, in this context phenotypic data includes health measurements about the participants such as age, height, weight, body-mass index (BMI), cholesterol levels, and so on.

Last year, we ran into this exact problem while developing a workshop for the 2019 annual meeting of the American Society for Human Genetics (ASHG), which aimed to teach researchers how to perform a GWAS in the cloud — specifically, using Hail and Jupyter Notebooks in Terra. Fortunately, we were collaborating with Tim Majarian, a bioinformatician from the Manning Lab who was generously providing his time and expertise to the project.

To solve our problem, Tim created a dataset of made-up but realistic phenotypic data for all of the 1000 Genomes Project samples, which allowed us to run a complete GWAS analysis in the workshop without needing to deal with data access complications. And as an immensely valuable side benefit, we were able to distribute all of the workshop materials in a public Terra workspace that anyone can access and use to learn the techniques involved.

Why am I bringing this up now? Well, for one thing, not many people seem to realize that this synthetic phenotypic dataset exists at all, so I want to raise awareness of this valuable resource, which continues to be freely available in the original workshop workspace. The workspace also includes a summary description of how Tim generated the data in the first place, as well as links to the scripts he used to do so, which you could use as a starting point to generate your own synthetic phenotype dataset if the one we provide does not fit your needs.

On top of that, we recently updated the dataset, so you may want to check out the new version even if you already knew about this resource. The key difference is that the ranges of values for the various phenotypes now reflect what you would expect for raw, unmodified patient data, as illustrated below:

Previously, some of the values fell in ranges that seemed unrealistic, such as negative values for the BMI and height phenotypes:

This was not a bug; Tim had created the dataset to reflect ranges of values as they would appear in the data after an investigator had applied some transformations to the data prior to analysis — a process that can have the side effect of making the values appear abnormal. We decided to update the dataset to show simulated raw values instead (as shown in the first table), to make it more intuitively accessible to learners.

The updated dataset can still be fed into the same analysis methods that are presented in the workspace notebook for demonstration purposes. However, you should keep in mind that an experienced analyst would apply some transformations to the data before proceeding.

So, there it is! I hope some of you may find this resource helpful, and I’d love to get some feedback on the choice of providing raw vs. pre-processed phenotype data. We could do a follow-up post to discuss that point in more detail if that proves to be of wide interest; let us know what you think in the comments below.

Postscript: Tim Majarian recently authored a guest blog post about the Manning Lab’s migration to the cloud and the many lessons they learned in the process, which you may find helpful if you are starting that process yourself.