Matt Bookman is a Solutions Architect at Verily. In this guest blog post, Matt explains recent improvements to Terra that allow users to control where in the world their data is stored and computing occurs.
We are excited to announce the release of new Terra features that give Terra users greater control over data storage and computing resources by exposing the regional architecture of the Google Cloud Platform in key parts of Terra. This will empower many to reduce their storage and computing costs, as well as work with data in non-US regions.
Rest assured however that these new features are entirely optional and do not require you to change anything if you’re happy with how Terra helps reduce the complexity of cloud computing.
Before we dive into the detailed changes in Terra, let’s first cover a few basics of cloud architecture for context. Feel free to skip ahead if you already know how this works.
Quick primer on Cloud regions and zones
Despite the general idea that using “the cloud” means that you don’t need to worry about where the physical computers are that store and process your data, in practice, Google Cloud maintains data centers in many parts of the world and provides you with the option to choose which ones you want to use. Regions are the top-level geographical distinction, with names that are reasonably descriptive, like us-central1, which refers to facilities in Council Bluffs, Iowa.
Each region is further divided into two or more zones, which correspond to separate data centers with their own physical infrastructure (power, network, etc.). This system plays an important role in limiting the impact of localized problems, and all major cloud providers use some version of this strategy.
These geographical constructs have some consequences. For example, if you store data in a single region, you’ll pay less for that storage than if you use multi-region storage. However, when you run processing on that data, if you want to use compute capacity in a different region, you’ll pay to have the data egress to the remote region.
So where does Terra live?
The Terra “control plane” runs on Google Cloud Platform (GCP) services in the US region us-central1. Your workspace metadata (description, variables), data tables, and Job History (submission and orchestration metadata) from workflows are stored on Terra servers in that region as well. These centralized resources are all managed by Terra and cannot currently be assigned to other regions.
This is in contrast to the “distributed” computing and storage resources that you interact with directly in the course of your work in Terra: the storage buckets associated with your workspaces, as well as the virtual machines and disks you request for running analyses. Thanks to the new regional control features we have added, you can now override the default region settings that are normally assigned to those resources.
Resources powering the Terra control plane (top left) and storing workspace metadata (top middle) are located in the us-central1 region. Terra users can specify the region of workspace buckets (top right) as well as storage and compute resources for Cloud Environments (bottom left) and workflows (bottom right).
Region settings can affect your costs
When you run a workflow or create an interactive cloud environment, by default, Terra sets all computing resources to that same US region, us-central1, and when you create a new workspace, it sets the storage bucket to “US multi-region”. This ensures there are no egress charges for any work done using only compute and storage resources managed by Terra.
However, storage costs for multi-region buckets are higher than for buckets set to a specific region, so folks who have a lot of data in Terra-managed buckets end up paying more in the long run. Conversely, folks who are working with data stored in buckets managed outside of Terra and located in regions other than us-central1 may get charged for transfer between regions.
That is why this change is important — now you can decide whether to accept the default OR specify a region for most data storage and compute resources that you will use in Terra.
Why “most”? Because there are a few things (listed above) that are managed “centrally” by the Terra platform which we have not regionalized yet. For this feature release, we focused on regionalizing the things that have the most impact on cost and have been requested by the community. As a result, the current state may not satisfy all data residency requirements that your data may be subject to, so be sure to check the documentation on “Working with non-US data in Terra” to evaluate compliance fit.
Taking advantage of region control in Terra
As outlined above, you can now choose the region of the following resources at creation time:
- Workspace buckets (your data)
- Cloud Environment virtual machines (your interactive compute)
- Workflow virtual machines (your batch compute)
For workspace buckets and Cloud Environment VMs, you will now see a region selection option in the corresponding resource creation form. For workflow VMs, you can override the default region by specifying zones per-task in your WDL workflow code.
Keep in mind that once the resource has been created, its region can no longer be changed — for example, that means you cannot reassign an existing workspace bucket to a different region.
See the documentation in the Terra support center under “Customizing where your data are stored and analyzed” for detailed instructions.
The following table summarizes the changes brought by the new features.
Feature | Past | Now |
Workspace buckets | US Multi-region | Defaults to US Multi-region, but you can choose a GCP region. |
Workflow Virtual Machines | Defaults to the region of the workspace bucket, or us-central1 for US Multi-region buckets. Can be overridden by workflow tasks (zones attribute). | No change |
Cloud Environment Virtual Machines | us-central1 | Defaults to the region of the workspace or us-central1 for US Multi-region buckets. You can choose a GCP region. |
Future directions
This update is a big step forward for user control of storage and compute location. With the exception of certain types of metadata, you can now control where your data is stored and processed, and thereby reduce some of your costs. We have made regional control available in the US (all US regions) and Canada (northamerica-northeast1), with additional regions to be added in the coming months.
In the long term, we aim to add more options that will give researchers and data providers even greater control over cloud computing and storage resources through Terra as we continue to make Terra a great platform for life sciences research.
Additional resources
If you have data outside of the US, please review data policies for your region and see Working with Non-US data in Terra. In that article, you will also see the plans for adding more regions to Terra around the world.
If you have data in the US, and are interested in learning more about how you can save money on Cloud storage costs, see US Regional or Multi-regional US buckets: tradeoffs.