“This is not a question. I just wanted to say that I’ve avoided learning how to use Terra Data tables for a long time, primarily because I’ve used workflows that require sample sheet files, but I finally got to try it out and I must say it was quite powerful.” — A computational scientist, in a recent comment to the Terra helpdesk.
Data tables are probably the Terra feature that has been the most misunderstood and underutilized until now, despite being stunningly useful when you know how to use them effectively. Let’s take a few minutes to go over why data tables are so great, then I’ll point you to some brand new tutorials that will help you master the key functionalities involved.
In a nutshell, Terra’s data tables are designed to allow you to describe your dataset in a structured way so that you can launch workflows on batches of data of arbitrary size nearly effortlessly, and have the results associated with the appropriate data entries automatically.
For example, let’s say that you have a very simple data table in which each row represents a biological sample. The columns include a sample identifier (‘sample_id’), links to where the sequencing data files are stored (‘bam’ and ‘bam_index’), and maybe some additional metadata related to the experiment.
Terra’s workflow configuration interface allows you to point the workflow system to that table and basically just say “run this workflow on every sample in this table, taking whatever is in the column called ‘bam’ as the data input for each one” (or do the same thing but for a subset of the data). Conveniently, when the processing is finished, the output files will be added to the relevant rows in the table, so you don’t have to go digging for them anywhere — they will automatically be associated with the original data.
This is conceptually very simple, but incredibly powerful because it scales to just about any cohort size that you might want to process.
Where things can get a little more complicated is that real datasets will usually involve multiple data entities organized in separate tables. For example, your samples are typically derived from study participants; samples and participants are two different data entities that are related to each other but possess different kinds of attributes (metadata, associated data files, etc), so you’ll want to store them in different tables. Yet you’ll still want to link them together by including a back-reference to the relevant participant in each sample’s row.
The formal description of the various entities involved in your experimental design and how they relate to each other is called the data model.
There’s a whole raft of benefits that comes with having a well-structured data model in place. I’m not going to go further into detail about that here, but hopefully, I’ve sparked your interest enough that you want to learn more, yes?
If you haven’t worked with data tables yet, or if you have tried and struggled, be sure to check out our brand new tutorials on the basic use of data tables, importing or adding your own data tables, and working with “sets”.
Stay tuned also for a follow-up blog post in which I’ll cover some more advanced topics related to data tables. On the docket: what happens when you import data from various repositories, how we’re thinking about data model harmonization for federated analyses and an early implementation of namespaces for managing data table attributes.