Having access to good data is crucial for genomics research. Without large-scale datasets, there’s little chance of uncovering the genetic underpinnings of disease. This is particularly important when studying diseases whose risk is determined by a large number of genes, each of which has a relatively small effect.
But analyzing and managing these large datasets can be challenging. This is particularly true when the data are protected, so that only approved researchers can access it. Because of these challenges, many good scientific stories happen alongside equally-interesting stories about data management.
In one such story, Terra’s data management tools allowed researchers to access data from the UK Biobank. In turn, this access has fueled many innovative projects that were only possible with large-scale data.
Getting the data onto the Cloud
The UK Biobank amasses genotypic and phenotypic data from roughly half a million participants in the United Kingdom. Research groups interested in working with the data can apply for access to this data set. In 2017, a group of Broadies across different research interests — including Krishna Aragam, Andrea Ganna, and Mary Hross — understood that having a single dataset available on the Cloud (and on the Broad Cluster) would allow more researchers to study the keys to disease, and simultaneously reduce storage costs. As a result, Ben Neale’s lab sponsored a project to undertake this work. The project’s storage is funded by Broad ITS and managed by Sam Bryant, who was a Senior Data Management Specialist at the time. Bryant has since become the Associate Director of Data Management at the Broad’s Stanley Center for Psychiatric Research.
Working with such large data was no easy task. To start, it took a long time to obtain the files — at the time, the UK Biobank limited downloads to 10 concurrent files. So, Sam Bryant got approval to access genotype and phenotype data from roughly half a million participants, then spent about two weeks downloading it onto the Broad’s on-premises cluster. From there, he uploaded the data to a Terra workspace in the Cloud.
Managing access to the data
With these data in hand, Bryant faced a second challenge: managing who could use the data. The UK Biobank uses a careful approval process to protect their data. So now Bryant needed a way to ensure that only approved users could access the Terra dataset.
The solution was to create a Terra Group of approved researchers. Researchers who wanted to access the data — including the Neale lab’s collaborators — applied for approval from the UK Biobank, which then let Bryant know who he could add to the Terra group. Once added to the group, collaborators could access the data — from inside or outside of Terra — in order to analyze it.
An alternative method: using DUOS to manage controlled data on Terra
In many cases, Bryant’s method is still the best way to share controlled-access data with collaborators. However, it works best when data managers are sharing their data with known collaborators. DUOS offers an alternative to share controlled-access data from the Broad and the National Human Genome Research Institute (NHGRI). DUOS speeds the data-access application process by automatically updating access permissions whenever someone is approved. This system makes it easy to share data with unknown researchers, as well as collaborators. You can learn more about how DUOS makes it easier to access controlled-access data in Streamlining Data Access and on DUOS’ documentation.
The data’s scientific impact
Since 2017, research groups have accessed the UK Biobank dataset on Terra to answer questions about several aspects of human health. These include coronary artery disease (Fahed et al., 2022; Patel et al., 2022; Patel et al., 2023; Dron et al., 2023; Khera et al., 2022); Alzheimer’s (Paranjpe et al., 2022); obesity (Agrawal et al., 2022); liver disease (Haas et al., 2021); heteroplasmy (Gupta et al., 2023); and clonal hematopoiesis (Brown et al., 2023). These data have also supported a growing understanding of how the human genome and phenome are structured – for example, uncovering correlations within the human phenome (Carey et al., 2022). In addition, the UK Biobank dataset has helped researchers better understand the effects of a dataset’s size and diversity on machine learning models (Cui et al., 2023; Majara et al., 2023).
How can you access this data?
The Broad’s UK Biobank dataset is still available on Terra. This dataset is a subset of the full UK Biobank data — the remainder is accessible via DNANexus.
If you’d like to leverage this dataset for your own research, and you’ve already applied for access to the UK Biobank, follow the instructions in this document. If you have not yet applied for access to the UK Biobank, contact Sam Bryant directly at firstname.lastname@example.org. And to learn more about sharing controlled data on Terra, see Best practices for sharing and protecting data resources and Managing access to shared data and tools with groups.
Many thanks to Sam Bryant, Caroline Cusick, and Jonathan Lawson for their help preparing this post.