New year, new partnership… and a new blog series focusing on highlighting papers that we think will be of interest to many of you. For this first iteration, we review a review paper (review-ception!) about workflow systems, coming out of C. Titus Brown’s lab at UC Davis and fresh off the virtual press over at GigaScience.
Taylor Reiter, Phillip T Brooks, Luiz Irber, Shannon E K Joslin, Charles M Reid, Camille Scott, C Titus Brown, N Tessa Pierce-Ward, Streamlining data-intensive biology with workflow systems, GigaScience, Volume 10, Issue 1, January 2021, giaa140, https://doi.org/10.1093/gigascience/giaa140
As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these practices in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.
Read on to learn why this paper is a must-read if you’re getting started with workflows.
This paper in a nutshell:
Everything you need to know to get started with bioinformatics workflows
Seriously, this review covers an impressive amount of ground.
It starts with an accessible explanation of what workflows are, and why they are such an important and rapidly growing part of biological data analysis, which I expect will be very helpful to anyone who might be new to the challenges posed by Really Large Datasets™.
Then, the authors provide a clear and concise review of the main types of workflows, languages and systems that you might encounter — including WDL, Terra’s current workflow language of choice, which they identify alongside CWL as “workflow specification formats that are more geared towards scalability, making them ideal for production-level pipelines with hundreds of thousands of samples” (yep, that checks out). They also touch on software management systems, including container systems (like Docker) and package managers (like Conda), and how these systems integrate with workflow systems.
That content alone is already solidly informative, yet we’re not even at the halfway point yet.
There’s a lot more in there, starting with a set of best-practice recommendations for managing a workflow-based project. This includes what to document (everything), how to document it (consistently) and what tools exist for visualization, version control and collaboration. I was nodding so hard reading that section, I pulled a neck muscle.
From there, the authors move to a series of practical recommendations for actually getting started with workflows, including finding and accessing compute resources. As stated in the abstract, these are “mainly focused on high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.” I found myself agreeing vehemently once more — the “Strategies for troubleshooting” should be required reading for every researcher who ever comes within three feet (~1m) of a computer, regardless of their field of study.
I could go on, but frankly at this point you’d be better off just reading the review itself. It’s solidly researched and well supported, insightful, clearly written and just beautifully scoped overall — well worth your time if you’re somewhat or completely new to workflows. Or even if you’re not so new and you’re willing to consider that your habitual practices might still have some room for improvement!
For an introduction to running workflows on Terra, see the Workflows documentation.