I remember when I originally started hearing about Docker containers; I didn’t really understand what they were and anything I googled on the topic seemed awfully complicated. And I’ll admit, getting used to working with cloud storage was rough too. What do you mean it’s not a filesystem? What’s object storage? Buckets? Wait, did we suddenly shift to talking about closet organizers?
Real talk. Many of you have probably been thrust into the world of cloud computing without any formal preparation or training. And as much as Terra’s graphical interface tries to reduce the underlying complexity of cloud computing, the hard truth is that we’re still at a point where there are some terms and concepts of cloud infrastructure that you need to understand reasonably well in order to do your work successfully on this platform. You can of course turn to the Terra knowledge base, which is growing richer every day (check out the new videos on YouTube), but you might find yourself wishing for something more like a comprehensive course that would walk you through it all, step by step, with detailed exercises.
This training gap is something that has bothered me for a long time, so I’m very pleased that I got the opportunity to pair up with Brian O’Connor (from the UCSC team that brought you Dockstore) to develop a textbook that would fill that gap, with the support of software publishing powerhouse O’Reilly Media (as in the animal books). And I’m even more pleased that the book finally came out earlier this month!
The book is called Genomics in the Cloud and is accordingly focused on genomic analyses, but the cloud concepts, tools and processes that it covers apply fairly equally to other fields. In fact, aside from two chapters that are very specifically focused on GATK Best Practices (Ch. 6 and Ch. 7), you could adapt pretty much all the exercises to run your favorite command line tools instead of HaplotypeCaller, the variant calling tool that we use as our show pony in most of the book.
Here is the table of contents:
- Foreword by Dr. Eric Lander, Founding Director of the Broad Institute
- Preface: Purpose, Audience and Scope of this book
- Introduction
- Genomics in a Nutshell: A Primer for Newcomers to the Field
- Computing Technology Basics for Life Scientists
- First Steps in the Cloud
- First Steps with GATK
- GATK Best Practices for Germline Short Variant Discovery
- GATK Best Practices for Somatic Variant Discovery
- Automating Analysis Execution with Workflows
- Deciphering Real Genomics Workflows
- Running Single Workflows at Scale with Pipelines API
- Running Many Workflows Conveniently in Terra
- Interactive Analysis in Jupyter Notebook
- Assembling Your Own Workspace in Terra
- Making a Fully Reproducible Paper
To be clear, we don’t have you start working in Terra straight away; we actually spend the first seven hands-on chapters (Ch. 4 through Ch. 10) working in a simple virtual machine on Google Cloud, pulling Docker containers manually and running commands in the terminal. Our goal there is to familiarize you enough with the components of cloud infrastructure so that by the time you actually get to Terra in Chapter 11, you have a sufficient understanding of what’s under the hood and can therefore focus on learning to use Terra features optimally. Brian and I believe that this approach will also equip you with transferable skills in case you need to use other platforms.
The Terra-focused chapters are accompanied by a fully public workspace called Genomics-in-the-Cloud-v1 that is set up to demonstrate both workflow execution (both from files and using the data tables) (Ch. 11) and several types of interactive analysis with Jupyter Notebooks (Ch. 12). If nothing else, I recommend you have a look at the notebook, which contains all exercises from Chapter 12 and is the notebook I wish I had been given when I was getting started. The instructions are a little sparse since the detailed explanations are in the book itself, but they are calibrated to be sufficient to understand what’s going on even without reading the book.
Interested? If you have access to the O’Reilly Learning Library (sometimes called Safari) through your institution, you can start reading right away at https://oreil.ly/genomics-cloud. To get a sense of the writing and technical level, you can also browse several chapters in the Kindle version preview on Amazon. The paperback version is also previewable, though it’s printed in grayscale so the images don’t pop quite as much (especially screenshots).
If you do pick it up, I hope you’ll find it useful! Don’t hesitate to reach out if you run into any issues or burning questions, either by commenting on this post or writing to the Terra helpdesk — they know where to find me.