This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied.
Molecular map of chronic lymphocytic leukemia and its impact on outcome
By Binyamin A. Knisbacher, Ziao Lin, Cynthia K. Hahn, Ferran Nadeu, Martí Duran-Ferrer, Gad Getz, Chip Stewart, Catherine J. Wu et al.
Nature Genetics (2022) https://doi.org/10.1038/s41588-022-01140-w
Abstract: Recent advances in cancer characterization have consistently revealed marked heterogeneity, impeding the completion of integrated molecular and clinical maps for each malignancy. Here, we focus on chronic lymphocytic leukemia (CLL), a B cell neoplasm with variable natural history that is conventionally categorized into two subtypes distinguished by extent of somatic mutations in the heavy-chain variable region of immunoglobulin genes (IGHV). To build the ‘CLL map,’ we integrated genomic, transcriptomic and epigenomic data from 1,148 patients. We identified 202 candidate genetic drivers of CLL (109 new) and refined the characterization of IGHV subtypes, which revealed distinct genomic landscapes and leukemogenic trajectories. Discovery of new gene expression subtypes further subcategorized this neoplasm and proved to be independent prognostic factors. Clinical outcomes were associated with a combination of genetic, epigenetic and gene expression features, further advancing our prognostic paradigm. Overall, this work reveals fresh insights into CLL oncogenesis and prognostication.
What part of the work was done in Terra?
Excerpts from the paper’s Methods section:
Sequence data processing and analysis
All sequencing data (WES, WGS, RNA-seq, RRBS, and targeted NOTCH1 sequencing) were processed and analyzed using methods implemented in the Terra platform (https://app.terra.bio). The main Terra methods are available at https://app.terra.bio/#workspaces/broad-firecloud-wupo1/CLLmap_Methods_Apr2021 […]
RNA-seq data were processed in Terra using the GTEx V7 pipeline (https://github.com/broadinstitute/gtex-pipeline). Briefly, reads were aligned with STAR (v2.6.1d) to hg19 (b37) using the GENCODE v19 annotation, and quality control metrics and gene expression were computed with RNA-SeQC v2.3.6 (https://github.com/getzlab/rnaseqc). A collapsed version of the GENCODE annotation was used to quantify gene-level expression (available at gs://gtex-resources/GENCODE/gencode.v19.genes.v7.collapsed_only.patched_ contigs.gtf). TPMs were used for sample clustering, whereas gene counts were used for differential gene expression, as required[…]
DNA methylation data processing
DNA methylome data was analyzed for a total of 1,037 samples, including 490 samples profiled with Illumina 450K array previously analyzed52 (European Genome-phenome Archive (EGA) accession EGAD00010001975), and 547 samples profiled using RRBS with either single-end or paired-end approaches. A pipeline was developed in Terra to obtain the CpG methylation estimates from RRBS data (Supplementary Note). The epitype classifier and the epiCMIT mitotic clock were previously developed for Illumina 450K and EPIC array data […]
How did they do it?
The authors processed and analyzed all sequencing data (WES, WGS, RNA-seq, RRBS, and targeted NOTCH1 sequencing) using WDL workflows. The methods used in this study can be found in this Terra workspace.
If you are a new Terra user, try your hand at running a workflow in Terra with this Quickstart Tutorial Workspace.
Appendix: Data and code availability
- The molecular data used in this study are publicly available and are included in the following patient cohorts: DFCI, Dana-Farber Cancer Institute; GCLLSG, German CLL Study Group; ICGC, International Cancer Genome Consortium; MDACC, MD Anderson Cancer Center; NHLBI, National Heart Lung and Blood Institute; UCSD, University of California San Diego. Sequencing, expression, and genotyping is available at EGA (http://www.ebi.ac.uk/ega/), which is hosted at the European Bioinformatics Institute, under accession number EGAS00000000092 (ICGC cohort) and in dbGaP under accession numbers phs001473.v2.p1 (MDACC, NHLBI), phs000922.v2.p1 (GCLLSG), phs001431.v2.p1 (DFCI, UCSD), phs001091.v1.p1 (MDACC), phs000435.v3.p1 (DFCI), phs002297.v2.p1 (NHLBI) and phs000879.v1.p1 (DFCI) and GEO accession number GSE143673 (GCLLSG). 450K array data are available at EGA under accession number EGAD00010001975 (ICGC). The project data portal is available at https://cllmap.org.
- Terra methods used in the study can be found at https://app.terra.bio/#workspaces/broad-firecloud-wupo1/CLLmap_Methods_Apr2021. Source code used in the study can be found at https://github.com/getzlab/CLLmap. The RFcaller pipeline is available at https://github.com/xa-lab/RFcaller. The new epiCMIT suitable for Illumina arrays and NGS approaches as well as the CLL epitype classifier can be found at https://github.com/Duran-FerrerM/CLLmap-epigenetics.