Community-maintained Notebook environments in Terra

Ever since we introduced Jupyter Notebooks in Terra, we’ve sought to provide default environments pre-loaded with software packages that are likely to interest you, to minimize the amount of setup necessary to get your work going. However, we’ve found that there’s a huge amount of variation in the needs and preferences of researchers, from the selection of packages themselves to the frequency at which people want to adopt new version updates. There’s clearly not a one-size-fits-all solution to that challenge, so we’ve developed a few complementary approaches: offering community-maintained environments, keeping legacy versions available, and providing options for building and sharing your own custom environments. Let’s take a closer look at what the first two entail, with a focus on the first; we’ll discuss custom environments in a follow-up blog post.

Community-maintained environments: Hail, Bioconductor, and Pegasus

We feel strongly that our role in this context is not to be the arbiters of what tools researchers should use, but to listen to what researchers say they need, on one hand, and on the other, to empower bioinformatics tool developers to make their tools available to the research community.

As a starting point, we try to identify software toolkits that are widely used within a particular research domain and can be provided in a dedicated Notebook environment in Terra. For example, Hail is a Python-based package for scalable data analysis specialized in genomics (e.g. genome-wide association studies); Bioconductor is a large R-based collection of bioinformatics tools, and Pegasus is a tool for single-cell and single-nucleus transcriptomics that can be used as a command-line tool or as a Python package.

To ensure that the pre-built environment for each toolkit will meet all of its requirements and provide good user experience, we engage with the project maintainers to design the environment. For example, in the case of Bioconductor, where all of the packages cumulatively amount to many gigabytes(GB) of data, the pre-built environment does not include all of the project packages, but it contains everything you need to get started and install the packages you want very quickly.

But we don’t stop there. To make sure that the pre-built environment will stay up to date with the latest project developments, we also enable the project maintainers to update the environment themselves. This is important because many of you rely on having access to the very latest algorithm improvements and bug fixes to make progress in your work. So when the Hail team, for example, develops a bug fix in response to a bug report, they can submit an updated version of their Hail environment with the bug fix to the Terra team, who can then take in the update with minimal effort. That way, updates are effectively no longer gated on the Terra team’s availability, which speeds up the process by an order of magnitude (e.g. only weeks instead of months between updates).

This collaborative approach allows us to offer pre-built environments that are likely to be useful to many of you, and it offers developers a platform for making their tools readily available to you without requiring you to do any installation or configuration.

On that note, if you or someone you know is a developer of a widely used bioinformatics software package that can be called from Jupyter Notebooks, we’d love to discuss options for making it available in Terra! Contact us at info@terra.bio with information about use cases, important datasets the tool(s) could be applied to, and the estimated size of the tool’s user base.

Availability of legacy versions

With all our enthusiasm for bleeding-edge development, we do recognize that you don’t actually always want to use the very latest version of a software package. If you’re already deep into an analysis when a major update happens, you probably won’t want to risk breaking perfectly good code with a library update that you don’t really need. Alternatively, you might be trying to reproduce some work that you or a collaborator did some time ago with an older version of the tools.

Accordingly, we are now providing a range of versions for each pre-built environment, to accommodate the need for continued access to older versions, and you can see the changelog for all versions in Github (eg Hail, Bioconductor, Pegasus). On this point, we are working on ways to further improve the presentation and range of these options, and we are open to suggestions, so don’t hesitate to let us know what you’d like to see by leaving a comment below or posting in the Feature and Documentation Requests section of the community forum.

Try out the community-maintained environments today

How would you like to take one of these environments for a spin? If you already have an account set up, it’ll only take a few minutes; just hop in and follow the instructions below. (If you don’t already have an account, follow the instructions for getting started first — sign up then set up billing with free credits from GCP)

Let’s check out the Pegasus environment, which is maintained by Bo Li’s group at Massachusetts General Hospital as part of their Cumulus framework for single-cell and single-nucleus transcriptomics. First, go to the Cumulus workspace and clone it. Then, in your clone, open the “Cloud environment” control panel (top right corner, gear icon), and expand the “Application configuration” dropdown menu to display the pre-built environments, as shown in the screenshot below.

Note that we are still working on refining how the pre-built environments are displayed and categorized within this menu, so you may see something slightly different from the screenshot depending on when you follow these instructions.

Screen_Shot_2020-09-30_at_11.25.10_PM.png

Select the Pegasus environment then click the “Next” button at the bottom of the panel to have Terra create your new environment. This will take a couple of minutes, during which Terra talks to Google Cloud to provision a virtual machine and set it up for you using a container that holds the Pegasus software and its dependencies.

Once your cloud environment is ready (the widget will say “RUNNING”), head over to the Notebooks tab of your Cumulus workspace clone and open one of the Pegasus tutorial notebooks. You should now be able to run all the code in the notebook without any additional steps. That’s it! That’s all it takes.

See the Notebooks Quickstart video if you’re not familiar with Jupyter Notebooks in Terra, and the Li lab’s Pegasus tutorial for more specific information about how to use the analysis package itself. Note that the Pegasus video shows an older version of the Terra cloud environments interface, which did not yet support community-maintained environments. Hopefully, the comparison illustrates why we are excited about this new functionality!

The steps for selecting other pre-built environments are the same starting from the “Cloud environment” control panel, which appears in the top right corner whenever you are in an open workspace. Note that you can get a complete list of the contents of any pre-built cloud environment by clicking the “What’s installed on this environment” text that appears below the dropdown menu when you have an environment selected.

Let us know what you think about this functionality in the comments below, and don’t hesitate to reach out to the Terra support team (via the helpdesk or the forum) if you run into any trouble.

Share

Share on facebook
Share on linkedin
Share on twitter

Leave a Reply

Your email address will not be published. Required fields are marked *