Data Storage. The AnVIL platform meets the following desirable characteristics of a data repository:
Unique Persistent Identifiers – Researcher assigned unique persistent identifiers and version control are supported to ensure reuse by researchers is based on immutable data snapshots.
Long-Term Sustainability – The AnVIL platform has been in operation for almost ten years, supports thousands of researchers every month using two of three public clouds, and is funded by a breadth of organizations including government, nonprofit, and commercial organizations.
Metadata – AnVIL supports storage and analysis of many different data types, including both scientific data (both unstructured and structured) as well as structured metadata, and manages the relationship between scientific files and their associated metadata.
Curation and Quality Assurance – AnVIL supports self-service ingestion and curation, optionally funded services, and dedicated support for data QC and metadata management.
Free and Easy Access – NIH-funded data access through AnVIL is free to researchers. After access, researchers only pay pass-through cloud costs associated with their own data storage and compute. Controlled data access is enabled by DUOS accelerating the rate at which researchers are granted access to data by enabling efficient data access requests.
Broad and Measured Reuse – AnVIL’s workspaces are a robust way to both enable secure reuse of data, and also facilitate collaboration, sharing, and public showcasing of analyses.
Clear Use Guidance – The AnVIL system ensures enforcement of data access controls and auditability of data use.
Security and Integrity – As AnVIL already handles government data, it meets a strict security standard aligned against the internationally recognized NIST-800-53 standard, which includes best-in-class capabilities around authentication, authorization, auditing, and threat detection.
Common Format – The AnVIL platform supports arbitrary schema definitions to ensure any and all datasets can be published through AnVIL using the de facto data standard of your research community. Through our work with NCPI and GA4GH, we continue to advocate for use of data and metadata standards to support the co-analysis of interoperable datasets.
Provenance – AnVIL supports a Data Custodian role that enables publishing versioned “snapshots” for data reuse. Snapshots are immutable collections of data allowing clear provenance and versioning of analyses. AnVIL workspaces also dynamically reference versions of data and tools to ensure ease of reproducibility and auditability.
Retention Policy – Because AnVIL supports a “bring your own data” model, data is connected to AnVIL by the data owner until that owner determines that data hosting to no longer relevant or needed. The specifics of this policy are documented in AnVIL’s Terms of Service.
The AnVIL platform meets the following additional criteria for storing human participant data:
Fidelity to Consent – AnVIL has developed a robust Identity and Access Management System to ensure that data access is only provided to users with approval. AnVIL also audits user activity in support of ad-hoc compliance verification.
Restricted Use Compliant – AnVIL is highly secure and does not allow redistribution of data to unauthorized users within the AnVIL platform.
Privacy – The security posture is documented and publicly availabledetailing our approach to implementing the NIST SP 800-53 risk management framework and authorization as a FedRAMP Moderate impact system. Access to human subjects data is additionally regulated by leveraging the GA4GH standard DUO (Data Use Ontology) and restricting access based on Data Access Requests facilitated by DUOS.
Plan for Breach – As a part of FedRAMP Moderate impact system authorization, we document response plans for any detected data breach. These procedures are annually audited by third parties.
Download Control – Across applications and infrastructure, no researcher is authorized to access or download data without approval. Researchers only have the lowest necessary access in adherence with the principle of least privilege. By default, authenticated researchers can access or download nothing (deny-by-default). They have to be explicitly authorized to access resources and all privilege escalations are logged.
Violations – Terms of service are publicly postedand describe appropriate use of data and resources in AnVIL. As a part of FedRAMP Moderate impact system authorization, we document response plans for any detected unauthorized behavior in the platform. These procedures are annually audited by third parties.
Request Review – The DUOS system leverages a consistent process across all datasets and data access committees (DACs) to standardize access request review, according to best practices from the Global Alliance for Genomics and Health (GA4GH) Data Access Committee Review Standards (DACReS) policy.
Data Access. DUOS satisfies the following DMS requirements around access to human participant data:
Fidelity to Consent – DUOS serves as a source of record for data use limitations (DUL) and DUL agreements.
Restricted Use Compliant – DUOS serves as the source of record for which researchers have access to which data.
Clear Use Guidance – The DUOS system is used to codify data access use limitations as well as data access approvals.
Request Review – The DUOS system enables data access committees to receive, review, and adjudicate data access requests in consistent review procedures leveraging the Global Alliance for Genomics and Health (GA4GH) Data Use Ontology (DUO).
Fidelity to Consent – The DUOS system facilitates mapping of participants consented data use preference to the Global Alliance for Genomics and Health (GA4GH) Data Use Ontology (DUO), which maintains a high fidelity with NIH’s commonly used data use limitations (DULs).
Restricted Use Compliant – The DUOS system requires all researchers requesting access to data to agree to an attestation statement and data access agreement in which they affirm they will not attempt to re-identify participants, or inappropriately use or share data in ways that are inconsistent with participants consent.