This blog post is part of a series based on the paper “Ten simple rules for large-scale data processing” by Arkarachai Fungtammasan et al. (PLOS Computational Biology, 2022). Each installment reviews one of the rules proposed by the authors and illustrates how it can be applied when working in Terra. In this first installment, we cover data and tooling resources that Terra users can take advantage of to avoid doing unnecessary work.
We kick off this “Ten simple rules” series with “Don’t reinvent the wheel”, a classic maxim that is ubiquitous in programming advice forums yet tragically underappreciated in the world of research computing. Certainly a fitting start to any list of guiding principles for tackling computational science at scale.
In their paper, Arkarachai Fungtammasan and colleagues address this rule mainly from the point of view of data resources, emphasizing that, before you set out to process a large body of data, you should check whether the work might have been done for you already:
[…] In short, undertaking large-scale data processing will require substantial planning time, implementation time, and resources.
There are many data resources providing preprocessed data that may meet all or nearly all of one’s needs. For example, Recount3 [4,5], ARCHS4 , and refine.bio  provide processed transcriptomic data in various forms and processed with various tool kits. CBioPortal [1,8] provides mutation calls for many cancer studies. Cistrome provides both data and tool kit for transcription factor binding and chromatin profiling [9,10]. A research project can be substantially accelerated by starting with an existing data resource.
This focus on data surprised me a little, because in my experience, the “Don’t reinvent the wheel” rule is more commonly invoked to advocate for using existing bioinformatics tools and workflows rather than writing new ones. However the authors are not wrong to call out the usefulness of looking for already processed data, particularly in an age when large data generation initiatives are being developed specifically for the purpose of making data available for mining by the wider research community.
In the Terra ecosystem, there are multiple research consortia that are making data resources available in a form that has already been processed through standardized workflows, so that researchers can focus their resources on downstream analysis, and that can be readily imported into Terra for analysis. For example, the Human Cell Atlas provides a multitude of analysis-ready ‘omics data resources that can be imported into a Terra workspace via the HCA Data Portal, as does the BRAIN Initiative Cell Census Network (BICCN), which offers human, non-human primate and mouse ‘omics data through its Terra-connected Neuroscience Multi-Omics (NeMO) portal.
You can check out the Terra Dataset Library to browse the various public and access-controlled datasets (spanning multiple data types and research focus areas) that are available in repositories connected to Terra.
And now, to extend the scope of discussion a little compared to the paper…
Try to reuse existing code, tools, containers, and other assets
Unless what you’re doing is unusually cutting-edge, chances are someone has already tackled a similar problem, and you may be able to reuse some of their tooling. Not to get into the debate of when it’s appropriate to write a new genome aligner from scratch — but I think we can all agree that there are some well-established data processing operations like running a variant calling pipeline on human WGS data, or generating count matrices from single-cell RNAseq data, where you can often benefit from reusing existing tools and workflows rather than rolling your own. In some cases you may need to make some modifications to adapt them to your specific use case, but that’s still a lot less work than starting from nothing.
So where do you find existing tooling?
In the context of Terra, here’s a shortlist of the best places you can look for ready-to-use tools:
1. The Terra showcase features a growing collection of public workspaces that offer fully-configured workflows, Jupyter notebooks, example data and more for a wide range of use cases. Some of these workspaces are created by tool developers to serve as a demonstration of how to run their tools. Others are created by researchers, often as companions to published papers, to recapitulate an end-to-end analysis in a fully reproducible way. The great thing they all have in common is that they combine data, tools and configuration settings that have been shown to work, so you can see in practice how the different pieces are supposed to connect. You may not find a workspace that’s an exact match for your needs, but you may find one that is close enough to use as a starting point, which can dramatically shorten the amount of setup time you need to get your analysis going.
2. For interactive analysis, Terra’s Cloud Environments system provides a menu of pre-built environment images for running applications like Jupyter Notebook and RStudio that come with sets of popular packages pre-installed to get you up and running as quickly as possible. For example, the Bioconductor environment developed as part of the AnVIL project includes the Bioconductor Core packages.
4. The Terra Notebooks Playground is a great resource for finding code examples of how to perform a variety of operations in Terra notebooks. In addition, many researchers now share Jupyter Notebook files demonstrating how to run computational analyses that they have published; many of these can be run in Terra’s Cloud Environments with only minimal adaptations.
5. For running automated pipelines at scale, the Dockstore workflow repository offers a large collection of workflows contributed by research groups around the world, with a particular emphasis on large-scale analyses and optimizations for cloud platforms. Dockstore connects directly to Terra, so once you’ve found a workflow you’re interested in, you can import the workflow script and an example configuration file with a few clicks. Most WDL workflows that you find in Dockstore can be run in Terra without any modifications. If you do need to modify the workflow code to suit your use case, either fork the original code in github and register your version in Dockstore, or bring it into the Broad Methods Repository if you want basic version control and editing capabilities without having to deal with git. There are also other sources of WDLs out there that are not registered in Dockstore, like the BioWDL project; the OpenWDL community is a good starting point to track those down.
6. Tool container repositories like Dockerhub and Quay.io can be really handy if you’re writing your own workflows. Running workflows in the cloud requires the use of “containers”, which are a way to package command-line tools into a self-contained environment that can be run on a virtual machine. One of the things we hear researchers worry about when they start moving to the cloud is that they’re not comfortable with creating their own Docker containers. The good news is that creating your own containers is actually not as difficult as it’s sometimes made out to be (if you have the right tutorial) BUT we can all agree it’s even easier if you don’t have to do it at all. Fortunately, many tool developers now provide pre-built containers through container repositories such as those listed above, and for the rest, there are community-driven projects like BioTools that make containers available for a wide range of popular bioinformatics tools. So once again, chances are you can find what you need off the shelf and not have to do it yourself.
Finally, keep in mind that reusing existing tools will not only save you a whole lot of time and effort; you will also be more likely to generate outputs that are more directly compatible with other researchers’ work. This increases the comparability of results across different studies and opens up opportunities to aggregate results into federated analyses that will deliver greater power and broader insights.
And don’t forget to share your tools and data, so the next researcher in line can also avoid having to reinvent the wheel!