There’s a common misconception that the cloud is not secure enough for working with sensitive data like human genome sequence and medical records. I say misconception because, uh, yeah it is secure enough — or rather, it can be if it’s done right — but I’m not going to argue why that’s the case in this post. If that’s what you’re looking for, I recommend you check out the Security resource page on the Terra website.
I’m more interested in talking about how we balance the requirement of protecting sensitive data with the need to be able to share data with collaborators in a way that doesn’t make it super painful to get any work done.
In Terra, we achieve that balance, without compromising security, by using the “workspace” as an organizing principle. Seen through the user interface, a workspace seems like it’s just a convenient way to bundle together data and tools and keep track of analysis work in the context of a specific project. And indeed, it is all that. But under the hood, it also plays another very important role: it acts as a security perimeter for the assets it contains and is the focus of the key mechanisms that we use for sharing and securing data.
Sharing data … and responsibility
The standard way you control access to a workspace in Terra is through the “sharing” mechanism. You can share access to a workspace with an individual or with a group of people (created through the Groups management panel), and at that time you can choose what level of access they should have — just read or edit, whether they can run computations, and whether they can share the workspace with others.
With that mode of control, it’s up to you to make sure everyone you’re sharing a workspace with has the right to access the information and data contained in the workspace. It’s also your responsibility to communicate to everyone whom they can or cannot share it with if you give them sharing permissions. You’re basically handing people your keys and trusting everyone to use them wisely.
This is fine for some purposes, like making training resources with open access data. But it’s not sufficient for managing access to workspaces that contain very sensitive information, such as genome sequencing data and medical records from human participants, or data derived from them through tertiary analyses. For that kind of data, you need an additional layer of protection.
Protecting your sensitive data with an Authorization Domain
In Terra, that additional layer of protection is called an Authorization Domain (or Auth domain for short). Sounds a little bit daunting but it’s surprisingly straightforward: the basic idea is that you make a user group listing all the people who are allowed to access the data you want to protect. When you create your workspace, you select that user group as the workspace’s Auth domain. Whenever someone tries to access the workspace, Terra will check whether they’re on the list, like a bouncer at an upscale club. As a result, only people on the list will be able to access your data.
But wait, you ask, how is that different from sharing my workspace with a group using the basic sharing mechanism? Aha! Hundred dollar question right there and the answer is that it’s not so much a different mechanism, as a complementary one. For one thing, slapping an Auth domain on a workspace does not actually share that workspace with the people on the list; it just ensures that should you or someone else with access later choose to share the workspace with them, they will be allowed to access it. Conversely, if you share the workspace with someone who is not on the Auth domain list, they will not be allowed to access the workspace even though you explicitly shared it with them! This will occasionally be annoying if you tend to forget to add new lab members to your Auth domain user list, but it is very useful for protecting your data from accidental oversharing.
Importantly, when you clone a workspace, the copy automatically inherits its parent’s Auth domain, so the contents continue to be protected on the same terms. You can add an additional Auth domain to the copy during the cloning process if you want to restrict access further, but you cannot remove an Auth domain once it has been applied to a workspace. So as long as you manage the Auth domain user list itself carefully, your private data and all derived data will remain safe from an accidental breach.
You can read more about how all that works in this documentation article.
Accessing data in the Library controlled by third parties
Many of the datasets that are available through the Terra Data Library, such as GTex, TARGET and TCGA, contain sensitive data collected from human participants and are therefore subject to strict security requirements laid out by the custodians of the data. The key requirement being, of course, that the original data and immediately derived data can only be accessed by duly authorized researchers. So how does that work in Terra? Yep, you guessed it, we use Auth domains for this too. It may seem a little more complicated than the previous case because it involves authentication through third parties, but in practice, it works pretty smoothly.
Let’s use the example of TCGA, aka The Cancer Genome Atlas. The agency that serves as both funder and data custodian for TCGA is the National Cancer Institute, a subdivision of the US National Institutes of Health (NIH). The NCI maintains a registry of researchers who have applied for and received authorization to access the TCGA dataset through the NIH’s dbGaP system. Meanwhile, in Terra, the Library workspaces containing access-controlled TCGA data (collected in the FireCloud portal) are protected by an authorization domain that mirrors the dbGaP list of authorized users. As a result, only researchers with active authorized dbGaP credentials can access those workspaces, as well as any clones made from them.
So how does Terra know whether you have dbGaP-authorized access to TCGA? Good news: there is no emailing involved; Terra talks directly to the NIH systems. All you need to do is to link your Terra account to your NIH account (via the eRA Commons system) in your Terra user profile.
From that point on, you’ll be able to access the relevant workspaces without jumping through any additional hoops (though you will need to renew your login once a month). As an added bonus, you’ll also be able to clone the TCGA workspaces and share the copies with other authorized researchers without having to actually verify their dbGaP status. If their account is not linked or is not in good standing, Terra will simply deny them access until they correct the situation, which they can do without going through you (unless you’re their PI, in which case, there may be some emailing involved!).
There are of course other third parties besides NIH agencies who are providing access to controlled datasets through the Terra Data Library; we work with all of them to make the process of getting secure access to the data as straightforward as possible.
So there you have it; that’s not all there is to know about data security in Terra, but it certainly is a big chunk, so let’s leave it at that for today. If you have any questions, let us know in the comments below or reach out to the Terra Helpdesk.