Implement FAIR scientific data principles when building HCLS data lakes

July 5, 2023 Yadukishore Tatavarthi

The FAIR data principles were first proposed in a seminal paper published in 2016 in the Journal Scientific Data. It was written by a group of international experts in data management and curation. To address the challenges that the research community is facing, they proposed FAIR Principles as a framework for making data more discoverable, accessible, interoperable, and reusable.

These principles quickly gained traction across the scientific community. They have since been widely adopted as best practice for data management in a variety of fields, including healthcare and life sciences (HCLS). In this post, we will take a closer look at how HCLS organizations can build data-driven applications on AWS, by adopting the FAIR scientific data management principles.

FAIR aims to break down data silos by providing guidelines to make data:
Findable. Metadata and data should be searchable and should be easily located.
Accessible. Metadata and data should be accessible to all users.
Interoperable. Data should be formatted in a way that it can be stored, accessed, processed by multiple applications, and integrated with other data. Additionally, metadata should include qualified references to other metadata.
Reusable. Metadata should include rich business and technical context. It should be well described so that it can be replicated.

Real-world data challenges

Following are few challenges that are commonly encountered while building data-driven applications in the HCLS space.

Data fragmentation: HCLS data is often fragmented across different systems, formats, and organizations, making it difficult to integrate and analyze. This can lead to inefficient research, duplication of effort, and missed opportunities for collaboration.

Privacy and security concerns: HCLS data is often sensitive and subject to strict regulatory requirements for privacy and security. This can make it challenging to share data between organizations, even when it could be beneficial for research or patient care.

Lack of standardization: HCLS data is often recorded in different formats, using different terminology, which can make it difficult to integrate and analyze. This lack of standardization can also lead to errors in data interpretation and miscommunication between researchers and clinicians.

Data silos: HCLS data is often stored in silos, such as electronic health record systems, clinical trial databases, or research repositories. This can make it difficult for researchers and clinicians to access the data they need, when they need it.

By applying FAIR Data Principles, organizations can accelerate data sharing, improve data literacy (comprehension of data), and increase overall transparency and reusability when working with data.

How to apply FAIR to HCLS Data with AWS

To address these challenges, AWS offers a rich set of services and features that can be used to build modern data platforms. We’ll now introduce each of the FAIR Principles and how AWS services can help you achieve them. To enable the FAIR principles, organizations must have a good data strategy. By aligning their data strategy with FAIR principles, organizations can optimize data management practices, promote collaboration, enable data driven decision making, and foster innovation.

Findable: How humans and machines discover data

The first principle of FAIR is that data should be findable. This means that data should have a unique and persistent identifier that can be used to locate it. In addition, data should be described with rich metadata, including information about the data creator, the data content, and the context in which the data was created. This information should be easily searchable and indexed, making it possible for others to discover and access the data.

How can AWS Services help?

Multiple AWS services can be used to discover and find data in an easier way. Modern data architectures built by customers acknowledges the fact that data can exist in data lakes, data warehouses, and purpose-built analytical data stores.

Bringing data from multiple disparate systems into a single location is the core function of data lakes. Amazon Simple Storage Service (Amazon S3) can be used as an object storage service to store any type of data for different use cases. It is a foundation for a data lake.

The AWS Glue Data Catalog provides a uniform repository where disparate systems can store and find metadata (such as versioning, schema evolution, databases, tables, schemas, and tags). You can use metadata to query and transform data in a consistent manner across a wide variety of applications. AWS Glue Data Catalog is the foundation for scalable data lakes, purpose-built analytical data stores, and services like Amazon EMR, Amazon Redshift, Amazon Athena, Amazon Kinesis. Having a centralized catalog helps different personas discover the data more quickly.

Organizations can further improve findability of data by using tags. Tags can be associated with Amazon S3 objects, as well as objects registered in AWS Glue Data Catalog, such as databases, tables, and columns. These tags provide additional context to different personas (business, operational, and technical). They can also be used for chargeback purposes. For example, a sales department can chargeback to clinical users based on the amount of usage.

Third-party data can be found using AWS Data Exchange, a data marketplace with over 3,500 products from over 300 providers. This data can be delivered through files, APIs, or Amazon Redshift queries directly to data lakes, applications, analytics, and machine learning (ML) models.

You can use Amazon Macie to secure your data. It uses machine learning and pattern matching to discover sensitive data at scale, including personally identifiable information (PII) such as names, addresses, and credit card numbers.

Catalog, discover, govern, share, and analyze your data using Amazon DataZone, a data management service. Use the Amazon DataZone catalog for data searching to reduce time from weeks, to days. Datasets are published to the catalog, and you can access and search for data through the Amazon DataZone portal. Search lists return results based on the cataloged data. Select your desired dataset and learn more about it in the business glossary. After you confirm your selected dataset, you can request access and start your analysis.

For example, users can catalog real-world data (RWD) sources, including clinical and claims data. RWD producers build business taxonomy using Amazon DataZone, and list their data products to make them more discoverable in the Amazon DataZone portal. They are then ready for data consumers using Amazon Athena. Users can enhance existing datasets by finding clinical data and mapping it to different data models, like Observational Medical Outcomes Partnership (OMOP). They can then list this new product for other consumers.

Customers can also build their own custom data marketplaces by using one, or a combination of these services based on their needs.

Accessible: How humans and machines access data

The second principle of FAIR is that data should be accessible. This means that data should be stored in a sustainable and trusted repository and be retrievable by their identifiers using a standardized communication protocol. Corresponding metadata should be easily accessible even when the data is no longer available, such as with deletes and updates.

How can AWS Services help?

Data stored in object stores or purpose-built analytical stores can be easily accessed using standard communication protocols. All the metadata gets stored in a centralized Data Catalog and can be accessed independently.

To provide data that is accessible, it should be stored in a durable and reliable repository. Amazon S3 is a popular storage service for storing data in the cloud, providing high durability, availability, and scalability. In addition, Amazon S3 provides a RESTful API that enables standard protocols, such as HTTP or HTTPS, to access the data. This makes it easier for different applications to access the data using a standard protocol.

The metadata stored in the AWS Glue Data Catalog can be readily accessed from a variety of AWS services. These include AWS Glue, Amazon Athena, Amazon EMR, Amazon Redshift Spectrum, as well as third-party services. In addition, users can interact with the AWS Glue Data Catalog programmatically using software development kit (SDK) Libraries and AWS Command Line Interface (CLI) utilities.

AWS systems support many open protocols (for example, HTTP, HTTPS, TCP, SSH, and others) and formats (such as iceberg, hudi, parquet, csv, and more). There are no vendor proprietary formats or lock-in.

To ensure that data remains accessible even when it is no longer available, keep track of changes to data and metadata over time. AWS provides a number of tools for data versioning and change management, such as Amazon S3 versioning and AWS Glue versioning. These tools enable you to track changes to data and metadata over time, so you can always access the correct version of the data and metadata. Data lakes built on open formats like iceberg and hudi support transactional capabilities and time travel.

To ensure that data is accessible to authorized users, access controls should be implemented to prevent unauthorized access. AWS provides a number of tools for data access control, such as AWS Identity and Access Management (IAM) and AWS Lake Formation. Lake Formation provides central access controls for data in your data lake. You can define security policy-based rules for your users and applications by role in Lake Formation. You can integrate with AWS Identity and Access Management for authentication of those users and roles. This will enable you to provide fine-grained access controls, so that only authorized users can access the data.

Amazon DataZone extends governance controls through AWS Glue Data Catalog, AWS IAM, and Lake Formation. The service operates within your infrastructure without relying on individual credentials. Amazon DataZone also provides simplified access to analytics with personalized views for data assets through a web-based application. You can use analytics without having to sign in to the AWS Management Console or understand the underlying AWS analytics services.

Interoperable: How different systems and data sources exchange data

The third principle of FAIR is that data should be interoperable. This means that data should be in a machine-readable format and should be easily exchanged between different systems and tools without needing specialized or proprietary software. Use of standardized vocabularies and ontologies should be promoted to ascertain that data can be integrated and analyzed across different domains and disciplines.

Healthcare interoperability refers to health data exchange between information systems like electronic healthcare records (EHR), pharmacy systems, diagnostic imaging systems, laboratory systems, and claims systems. With interoperability, health data can be communicated between electronic systems, between organizations, and across geographical boundaries with standardized protocols.

Enabling interoperability requires that data be captured in electronic systems, with standards for the content, transport, vocabulary and terminology, privacy and security, and identifiers. Following are some recommendations for functional interoperability solutions.

  • Use a published standard to structure and transmit health data. Many standards are currently used across healthcare, such as Healthy Level Seven Version 2 (HL7 v2), HL7 Version 3 (HL7 v3), HL7 Fast Healthcare Interoperability Resources Version 4 (FHIR), X12 Electronic Data Interchange (EDI), and the Digital Imaging and Communications in Medicine (DICOM) standard.
  • Expose either an API or an integration server to other systems seeking to exchange data.
  • Check received data for conformance to the interoperability standard being used.
  • Apply transformations to data received or sent and map the internal formats of systems of record, like EHRs, to a structural and semantic form dictated by the standard.

How can AWS Services help?

Use Amazon HealthLake for scenarios that require enrichment of health data in FHIR format or downstream analytics and ML. Amazon HealthLake can improve semantic interoperability by applying natural language processing (NLP). It then links concepts in unstructured data, to terms in standard health ontologies, like ICD-10-CM, SNOMED CT, and RxNorm. Amazon HealthLake Analytics supports interoperable standards such as FHIR.

Amazon HealthLake Imaging, a new HIPAA-eligible capability makes it easy to store,
access, and analyze medical images in DICOM format at petabyte scale. This new capability is designed for fast, subsecond medical image retrieval in your clinical workflows. You can access it securely from anywhere (web, desktop, or phone) and with high availability. This will drive your existing medical viewers and analysis applications from a single encrypted copy of the same data, to the cloud with normalized metadata, and advanced compression.

To detect and return useful information in unstructured clinical text, such as physician’s notes, discharge summaries, test results, and case notes, use Amazon Comprehend Medical. Amazon Comprehend Medical. It uses natural language processing (NLP) models to detect entities, which are textual references to medical information such as medical conditions, medications, or Protected Health Information (PHI).

Amazon Omics helps healthcare and life science organizations build at-scale to store, query, and analyze genomic, transcriptomic, and other omics1 data. By removing the undifferentiated heavy lifting, you can generate deeper insights from omics data to improve health and advance scientific discoveries.

With Amazon Omics, you can quickly ingest and transform genomics data formats such as (g)VCF, GFF3, and TSV/CSVs into Apache Parquet. You can make the genomics data accessible through analytics services such as Amazon Athena. You can transform both variant data (data from an individual sample) and annotation data (known information about positions in the genome).

The Healthcare Interoperability Stack on AWS reference architecture describes how to build a modular interoperability platform to ingest, parse, and store healthcare data of any shape, size, and format.

Reusable: How to understand and interpret the data

The fourth principle of FAIR is that data should be reusable. This means that data should be easily understandable and interpretable, enabling others to use it in their research, analysis, and decision-making processes. This requires that data is well-documented and accompanied by clear and comprehensive metadata, so others can understand how the data was created and how it can be used.

Data reusability heavily relies on findability principles. Data must be found in order to be reused. This is critical for life sciences companies who purchase market data from data providers. Add additional context to the published data with metadata information. For example, when you publish the data and solution in the marketplace, you can provide additional information such as licensing agreements, a data dictionary, sample data, operational metadata, and more.

How can AWS Services help?

To help providers and subscribers access data without needing to physically migrating large volumes, use AWS Data Exchange.

AWS Marketplace offers a curated catalog of third-party solutions that can help you modernize care, improve patient outcomes, comply with regulations, and unlock the potential of healthcare data. Whether you are a provider, payor, or health-tech organization, AWS Marketplace can help you easily discover, procure, deploy, and manage cloud technology and data management solutions.

To help people discover and share datasets that are available via AWS resources, see the Registry of Open Data on AWS.

Amazon Redshift data sharing is a secure way to share live data for reading across Amazon Redshift clusters. An Amazon Redshift producer cluster can share objects with one or more Amazon Redshift consumer clusters for read purposes, without having to copy the data.

Amazon DataZone introduces data projects for teams to collaborate through datasets. Use data projects to manage and monitor data assets across projects. With data projects, you can create and access business use case groupings of data, people, and tools for collaboration.


In this post, we have discussed the importance of FAIR principles and current challenges in building data-driven applications in the HCLS space. We’ve demonstrated how you can adopt FAIR principles for your data with the help of various AWS services. We encourage you to consider how your organization can build architectures to support these principles.

Further Reading:

1 Omics refers to the study of specific factors within a cell, tissue, or organism. It uses technology to measure and characterize molecules.

Previous Article
Manufacturing analytics in regulated industries with MachineMetrics on AWS
Manufacturing analytics in regulated industries with MachineMetrics on AWS

MachineMetrics on AWS supports automated production monitoring and analytics, while maintaining strong secu...

Next Article
Improve Patient Safety Intelligence Using AWS AI/ML Services
Improve Patient Safety Intelligence Using AWS AI/ML Services

Today, healthcare organizations rely on a combination of automated and manual processes to compose, review,...