This blog post is part of a series based on the paper “Ten simple rules for large-scale data processing” by Arkarachai Fungtammasan et al. (PLOS Computational Biology, 2022). Each installment reviews one of the rules proposed by the authors and illustrates how it can be applied when working in Terra. In this installment, we touch on questions of costs and benefits, regulatory constraints, and opportunities to leverage the scalability of cloud infrastructure.
This series has been cruising along quite smoothly, with two fairly self-explanatory rules so far — “Don’t reinvent the wheel” and “Document everything” — but this third rule might feel like a hard turn onto bumpy terrain. Less of a rule, perhaps, and more a checklist of considerations?
Buckle up as we tackle some of the key factors that Arkarachai Fungtammasan and colleagues recommend should go into choosing a computing platform for large-scale data processing.
Spoiler: you should probably use the cloud
To kick us off, the authors define the choice of such a platform as a “multi-objective optimization” problem. This sounds complicated, yet at its simplest this boils down to balancing costs — “with respect to computing cost or quota, waiting time, and implementation time” — against the value of the output, e.g. the scientific insights that you’ll be able to extract from the data.
Looking at that enumeration of what counts as a cost, it’s great to see “waiting time and implementation time” taken explicitly into account. So many discussions in this space focus primarily on the monetary costs of using on-premises computing vs. cloud infrastructure, and leave out the question of how long it takes for researchers to actually get work done. It’s important to remember that people’s time is valuable too, both at a personal level and from the point of view of opportunity cost for their organization. Public cloud infrastructure tends to yield very clear benefits on this front, because their whole point is to offer access to compute resources that are readily available on demand.
If you’re currently working with shared on-premises computing infrastructure, think about how much time you’ve spent waiting for your jobs to make it out of the queue, only to find out that you made a small mistake and need to tweak a parameter, then resubmit everything? What if you could get your work done in days instead of weeks, months instead of years?
Then there’s the closely related question of scale. The biggest driver of change in this space has been the need for greater scalability to cope with the flood of data brought on by a decade of intense technological development and the falling costs of data generation.
In this context, scalability doesn’t mean “just” being able to process a lot of data; it also means that you’re able to easily reapply the same methods to other data, or re-process cohorts of data to take advantage of new computational innovations. And as the authors point out in their paper:
“[…] the investment in infrastructure for repeatability and selecting and rerunning certain subsets of the data becomes much more valuable as the processing task grows.”
We’ve seen this play out with Terra itself, as over time more research consortia have flocked to the platform to take advantage of its large-scale workflow processing capabilities and built-in support for reproducibility and collaboration. I personally find it very exciting that a wide range of organizations are choosing to invest in shared, scalable infrastructure rather than each building out their own, because it means we’re able to go much further together, and deliver economies of scale that benefit the entire ecosystem.
Navigating regulatory constraints
This brings us to actual rules that you have to understand and comply with, or face potentially severe penalties, if you’re working with sensitive or identifiable data. The specifics vary substantially based on what is the applicable jurisdiction, but generally speaking every locality has policies that govern certain types of data, particularly human clinical data:
“Certain types of clinical data can require a computing platform to meet specific standards for data security and access control.”
The specific requirements may involve the need for accreditation from a relevant authority; for example, in the US, there is a government program called FedRAMP that authorizes software and service providers for use by federal agencies. Terra is one of a very small number of life sciences data platforms to be authorized under the FedRAMP program to make controlled-access datasets available to authorized researchers on behalf of certain NIH agencies.
More globally, many jurisdictions also set limitations on where data can be stored and processed:
“[…] many countries have data locality policies prohibiting data transfer out of the country.”
These policies can complicate use of cloud infrastructure, and solutions vary depending on the level of stringency adopted by a particular country. You should always check the applicable policies before uploading data to any platform.
Yet the authors also note the following:
“Performing computing where the data is located […] may make it easier to meet regulatory guidelines and avoids the cost and transfer time associated with moving large datasets to different locations.”
From the point of view of cloud infrastructure, this is the flip side of the previous point: if the cloud platform you’re interested in offers data residency in the “right” jurisdiction relative to your needs, it can indeed be a lot easier to give external collaborators access to your data and analyses by inviting them into the platform, rather than trying to figure out how to send them copies of the data while complying with all applicable data residency requirements.
Terra currently only guarantees data residency within the US, though we recently released new features that make it possible to specify data storage and processing locations in specific regions of Google Cloud (including one in Canada, with additional non-US regions coming soon) for the purpose of optimizing costs. We are working toward support for data residency in other countries as a future improvement.
Technical questions to ask in practical evaluation
Finally, the authors of the “Ten Simple Rules” paper close this rule with some questions that feel like the tie-breakers you would apply after having narrowed down your options based on the criteria we covered above. Here’s what the answers look like for Terra.
Is computing available on a first-come-first-serve basis?
Essentially, yes, but since Terra is built on the cloud, the resources you need are typically ready to go whenever you are. If you request a specific resource that is in particularly high demand (like certain types of GPUs), you may have to wait for an instance to become available. Keep in mind that the cloud providers monitor usage of these resources and periodically expand their inventory based on demand, so these tend to be transient limitations.
Are there hard limits on resources per task or per user?
The cloud provider (not Terra) imposes some resource quotas; for example, a limitation on the number of virtual machines you can have running at the same time. Most of these quotas are fairly reasonable, and in practice you only start hitting them when operating at a very large scale. If and when you reach that point, you can get the provider to raise your quotas by making a request that establishes you’re a bona fide researcher (as opposed to a shady bitcoin mining bot farm).
As a point of interest, the Terra system itself uses some internal queuing logic to keep everything running smoothly for everyone even when someone submits tens of thousands of samples. Fun fact: this feature is called “hog factors” because it prevents any one user from “hogging” the submission queue. Like many guardrails, this was originally spurred by a particular incident in which someone submitted a huge number of workflows on a Friday, exceeding their Google Cloud quota — so their workflows were queued by Google, and everyone else’s jobs were held up all weekend even though no one else was exceeding their own quota. Yikes! Since then, we’ve implemented the “hog factors” as well as a number of additional load balancing measures; the result is that Terra can enable individuals to submit very large processing requests without affecting anybody else’s work.
How do the specific features of hardware, network, and storage impact execution time?
Terra is set up to offer a lot of flexibility regarding the specific computing resources you can request, so if you have a need for speed, you can typically throw more powerful hardware at your problem.
For workflows, you can specify the amount of processors, memory and storage space per task, which allows you to customize resource use at a very granular level. This provides ample opportunity for optimization. Interestingly, selecting more powerful instances is not always more expensive; in some cases, even though the cost per minute of operation is higher, you can end up paying less overall if the job takes less time to complete.
For interactive applications, the basic usage is similar: you can dial the hardware specifications of your Cloud Environment up or down depending on what you need. And the system will turn off any idle instances on your behalf, so you won’t accrue any major charges while you’re away on vacation!
Whew, this was a big one. We’ll return next week with the more straightforward Rule #4: “Automate your workflows!”