This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied.
Phenotype and genetic analysis of data collected within the first year of NeuroDev: A Pilot Study
By Patricia Kipkemoi, Heesu Ally Kim, Bjorn Christ, Emily O’Heir, Jake Allen, Christina Austin-Tse, Samantha Baxter, Harrison Brand, Sam Bryant, Nick Buser, Victoria de Menil, Emma Eastman, Alice Galvin, Martha Kombe, Collins Kipkoech, Alysia Lovgren, Daniel G. MacArthur, Brigitte Melly, Katini Mwangasha, Alba Sanchis-Juan, Moriel Singer-Berk, Michael E. Talkowski, Grace VanNoy, Celia van der Merwe, The NeuroDev Project, Charles Newton, Anne O’Donnell-Luria, Amina Abubakar, Kirsten A Donald, and Elise Robinson
Preprint in medRxiv (2022) https://doi.org/10.1101/2022.08.22.22278891
Abstract: Genetic association studies have made significant contributions to our understanding of the aetiology of neurodevelopmental disorders (NDDs). However, the vast majority of these studies have focused on populations of European ancestry, and few include individuals from the African continent. The NeuroDev project aims to address this diversity gap through detailed phenotypic and genetic characterization of children with NDDs from Kenya and South Africa. Here we present results from NeuroDev’s first year of data collection, including phenotype data from 206 cases and clinical genetic analysis of 99 parent-child trios. The majority of the cases met criteria for global developmental delay/intellectual disability (GDD/ID, 80.3%). Approximately half of the children with GDD/ID also met criteria for autism spectrum disorders (ASD), and 14.6% met criteria for ASD alone. Analysis of exome sequencing data identified a pathogenic or likely pathogenic variant in 13 (17%) of the 75 cases from South Africa and 9 (38%) of the 24 cases from Kenya Candidate novel disease gene variants in 7 total cases were matched through MatchMaker Exchange. Data from the trio pilot cases has already been made publicly available, and the NeuroDev project will continue to develop resources for the global genetics community.
What part of the work was done in Terra?
Excerpts from the paper’s Methods section:
Genetic analyses of the trio pilot data
In brief, exome sequencing was performed on each of the trios, and the data was uploaded to the seqr platform for analysis. […]
Exome Sequencing & Data Processing Methods
Exome data was processed through a pipeline based on Picard, using base quality score recalibration and local realignment at known indels, aligned to the human genome build 38 using BWA, and jointly analyzed for single nucleotide variants (SNVs) and insertions/deletions (indels) using Genome Analysis Toolkit (GATK) Haplotype Caller package version 18.104.22.168. […] Basic functional annotation will be performed using Variant Effect Predictor (VEP), and then the joint variant call file will be uploaded to the seqr platform for further annotation and analysis.
Copy-number variants (CNVs) were discovered from the exome sequencing data following GATK-gCNV best practices. Read coverage was calculated for each exome using GATK CollectReadCounts. After coverage collection, all samples were subdivided into batches for gCNV model training and execution; these batches were determined based on a principal components analysis (PCA) of sequencing read counts. After batching, one gCNV model was trained per batch using GATK GermlineCNVCaller on a subset of training samples, and the trained model was then applied to call CNVs for each sample per batch. Finally, all raw CNVs were aggregated and post-processed using quality- and frequency-based filtering to produce a final CNV callset.
Exome Sequencing Data Analysis Process
Upon completion of data generation, both the SNV/indel and CNV callsets were uploaded to seqr, the centralized genomic analysis platform used by the Broad Institute’s Center for Mendelian Genomics (CMG). […]
How did they do it?
The authors processed exome sequencing data and discovered CNVs using the GATK Best Practices GermlineCNVCaller workflow in Terra. The workflow is implemented in the workflow description language (WDL) and is available in the Broad Methods Repository and in a public Terra workspace, where you can test it out on example data at very low cost.
Terra also supports importing workflows from Dockstore, a free and open-source platform for sharing reusable and scalable analytical tools and workflows.
To try your hand at running a workflow in Terra, check out this Quickstart Tutorial Workspace.
The authors further analyzed the resulting variant calls using seqr, an intuitive browser-based system for analyzing rare disease exome and genome data on a family basis, that is available on AnVIL powered by Terra.
Appendix: Data and code availability
– Controlled-access genetic and phenotypic data collected during the course of this study are available via the National Human Genome Research Institute (NHGRI) Analysis Visualization and Informatics Lab-space (ANVIL) platform.