Paper Spotlight: Single-cell analysis of human primary prostate cancer reveals the heterogeneity of tumor-associated epithelial cell states

This blog is part of our Paper Spotlight series, which features peer-reviewed research publications involving work done in Terra and highlights how the analysis methods were applied. 


Single-cell analysis of human primary prostate cancer reveals the heterogeneity of tumor-associated epithelial cell states

By Hanbing Song, Hannah N. W. Weinstein, Paul Allegakoen, Marc H. Wadsworth II, Alex K. Shalek, Franklin W. Huang et al., 2022

Nature Communications 13, 141 (2020) 

Abstract: Prostate cancer is the second most common malignancy in men worldwide and consists of a mixture of tumor and non-tumor cell types. To characterize the prostate cancer tumor microenvironment, we perform single-cell RNA-sequencing on prostate biopsies, prostatectomy specimens, and patient-derived organoids from localized prostate cancer patients. We uncover heterogeneous cellular states in prostate epithelial cells marked by high androgen signaling states that are enriched in prostate cancer and identify a population of tumor-associated club cells that may be associated with prostate carcinogenesis. ERG-negative tumor cells, compared to ERG-positive cells, demonstrate shared heterogeneity with surrounding luminal epithelial cells and appear to give rise to common tumor microenvironment responses. Finally, we show that prostate epithelial organoids harbor tumor-associated epithelial cell states and are enriched with distinct cell types and states from their parent tissues. Our results provide diagnostically relevant insights and advance our understanding of the cellular states associated with prostate carcinogenesis.



What part of the work was done in Terra?

Excerpts from the paper’s Methods section:


 Single-cell RNA sequencing 

Sequencing was largely based on the Seq-Well S^3 protocol. […] The sequenced data were preprocessed and aligned using the dropseq_workflow on Terra ( A digital gene expression matrix was generated for each sample, parsed, and analyzed following a customized pipeline. Additional details are provided below.

Sequencing results were returned as paired FASTQ reads […] Then, the paired FASTQ files were aligned against the reference genome using a STAR aligner (v2.7.6a) built within the dropseq workflow (Snapshot 7). The aligning pipeline output included aligned and corrected bam files, two digital gene expression (DGE) matrix text files (a raw read count matrix and a UMI-collapsed read count matrix where multiple reads that matched the same UMI would be collapsed into one single UMI count) and text-file reports of basic sample qualities such as the number of beads used in the sequencing run, total number of reads, alignment logs. 

Whole-exome sequencing

The remaining frozen single cells from the prostate tumor and matching normal specimens (N = 3 patients) were processed for genomic DNA and underwent whole-exome sequencing (Novogene) […] Sequencing data were then aligned to the GRCh38/hg38 reference genome. Somatic single-nucleotide variants (SNVs) were identified using our in-house pipeline which integrated somatic variant caller Mutect2 and annotation using Funcotator. The list of SNVs was further filtered using the following criteria: (a) variants with less than a minimum read depth of ten reads were excluded, (b) variants with less than three supporting reads of the altered nucleotide were excluded, (c) variants with a variant allele frequency of less than 5% were excluded. Somatic copy number alterations were identified using the GATK somatic CNV pipeline.


How did they do it?

To process the single-cell sequencing data, the authors used the dropseq_workflow from the Cumulus framework (also described in this blog post and Nature Methods paper). This workflow is written in the Workflow Description Language (WDL) and shared in the Broad Methods Repository under cumulus/dropseq_workflow (requires sign-in). In the methods text, “Snapshot 7” in the methods text refers to the specific version (=snapshot) they used. 

Terra also supports importing workflows from Dockstore, a free and open source platform for sharing reusable and scalable analytical tools and workflows. 

To process the exome sequencing data, the authors used workflows based on the GATK Best Practices for somatic short variant discovery and copy number variant discovery, which are available, preconfigured to run on example data, in public Terra workspaces (here and here respectively).

The authors ran the workflows at scale using Terra’s workflow execution service

To try your hand at running a workflow in Terra, check out this Quickstart Tutorial Workspace


Appendix: Data and code availability

  • Raw single-cell RNA-sequencing FASTQ files and gene expression matrices files generated in this study have been deposited in the Gene Expression Omnibus (GEO) under accession number GSE176031
  • Whole-exome sequencing FASTQ files of the three primary prostate cancer patients in this study have been deposited in the European Genome-phenome Archive (EGA) under accession number EGAS00001005685.
  • See above for links to relevant workflow scripts and Terra workspaces.
  • Additional code used in the manuscript is available in the following Github repository: