Common techniques to detect PHI and PII data using AWS Services

December 19, 2022 Navaneeth Jalagam

Identifying sensitive data such as protected health information (PHI) and personally identifiable information (PII) and taking appropriate action to safeguard it, is an important step to achieve Health Insurance Portability and Accountability Act (HIPAA) compliance in the Healthcare and Life Sciences industry vertical.

There are a few different options on how to do this on Amazon Web Services (AWS), depending on the use cases and services being used. We will list out the common techniques available to detect and, in some cases, mask the sensitive data using various AWS services. We’ll also direct you to the appropriate resource (blog, video, tutorial, documentation) to help you achieve your desired outcome.

What is PII and PHI data?

What is PII?
PII stands for Personally Identifiable Information. PII includes general information that can be used to identify or locate an individual. It covers records such as financial, medical, educational, or employment.

What is PHI?
PHI stands for Protected Health Information. PHI is health-related information (medical records) disclosed that is stored or transmitted. PHI is a cluster under PII obtained from providing healthcare services. PHI is utilized to identify a person by using physical or mental conditions from past or present records.

Real-world challenges with PHI/PII data and compliance

The healthcare industry is one particular industry that faced an overnight transformation due to the Covid-19 pandemic. People were no longer able to see their doctors in-person, forcing both patients and health care providers to rapidly adapt to the new normal. Although evolving technology has helped transition to digital healthcare, it has also introduced a new set of HIPAA compliance challenges.

Ways HIPAA compliance and data protection have become more difficult:

  • The growth of telehealth services led to a significant increase in PHI transmission
  • The migration to remote work requires securing more employee data
  • Patients demanding more control over their healthcare information
  • There are more cyber attacks and threats to PHI

Best practices for HIPAA Compliance

Most HIPAA violations can be prevented by implementing HIPAA regulations into maintaining policies and procedures and ensuring all individuals with access to patient information receive the proper training. Below are the best practices for keeping your organization HIPAA Compliant.

  • Identify where all your data is. Compliance starts from data mapping all your databases, removable devices, archives, or cloud storages. To implement effective safeguards, you need a blueprint of PHI/PII data storage locations across your organization.
  • Encrypt patient data. Encryption is an important method to secure data, as the data becomes unintelligible to unauthorized persons. Encryption also provides a way to verify the origin and integrity of the data, reducing the risk of accessing data from suspicious sources.
  • Apply de-identification methods. One recommended de-identification technique under the HIPAA Privacy Rule is known as the “Safe Harbor” method, which applies to identifiers (such as names, geographical data, dates, telephone numbers, SSN, and medical record numbers). The rationale for de-identification is that once certain identifiers are removed, it is reasonable to believe that the health data is no longer individually identifiable―no longer considered PHI/PII.
  • Conduct security awareness training. Security is a shared responsibility. Company policies should require all employees to undergo regular security awareness training to learn how to recognize, report, or eliminate potential threats. Informed employees who are fully aware of the consequences of data breaches can reduce the risk of unauthorized use and disclosure   of patient PHI/PII.
  • Dispose of old data. When there are no longer any legal requirements to retain patient data, covered entities should take steps to dispose of this data. HIPAA recommends that physical copies of patient records that contain PHI be shredded, burnt or pulverized to be unreadable. For ePHI, clearing (using software or hardware products to overwrite media with non-sensitive data), purging (degaussing or exposing the media to a strong magnetic field in order to disrupt the recorded magnetic domains), or destroying the media (disintegration, pulverization, melting, incinerating, or shredding).
  • Data access control and monitoring. Access to sensitive PHI should only be granted to employees who “need to know” to perform their jobs effectively. Log management systems should be enabled to monitor the use and access to said data.

Techniques to detect/mask PHI and PII data on AWS

Let’s look at a few available techniques on AWS to detect and/or mask sensitive data. We’ve grouped these approaches based on use cases and intended outcomes.

Data stored in Amazon Simple Storage Service (Amazon S3)
To begin, we’ll talk about customers who use Amazon S3 to store their data as part of a data lake or application data. To detect sensitive data residing within their Amazon S3 buckets, Amazon Macie (Macie) is a good place to start. Macie is a fully managed data security and data privacy service to help you discover, monitor, and protect sensitive data in your AWS environment.

With Macie, you can automate discovery and reporting of sensitive data by creating and running sensitive data discovery jobs. A sensitive data discovery job analyzes objects in Amazon S3 buckets to determine whether they contain sensitive data. If Macie detects sensitive data in an object, it creates a sensitive data finding for you. The finding provides a detailed report of the sensitive data that Macie found.

Macie uses a combination of criteria and techniques, including machine learning (ML) and pattern matching, to detect sensitive data. These criteria and techniques, collectively referred to as managed data identifiers, can detect a large and growing list of sensitive data types for many countries and regions. These include multiple types of PII and PHI.

Application access to data in Amazon S3
There are use cases when an application requests access to data residing in Amazon S3. This data may contain sensitive PII information which needs to be redacted. Amazon S3 Object Lambda is a capability that allows you to add your own code to process data retrieved from Amazon S3 before returning it to an application. The video tutorial Configure PII redaction using Amazon S3 Object Lambda Access Points guides you through the process of using Amazon S3 Object Lambda in conjunction with Amazon Comprehend to achieve this outcome. You can also follow this Tutorial: Detecting and redacting PII data with Amazon S3 Object Lambda and Amazon Comprehend for a step-by-step approach.

Data Preparation
If you are a data analyst or a data scientist, then AWS Glue DataBrew is a visual data preparation tool that makes it straightforward to clean and normalize data to prepare it for analytics and machine learning. The blog Introducing PII data identification and handling using AWS Glue DataBrew walks through a solution that identifies potential PII data present in a sample dataset. The blog proceeds to apply various transformations to handle the sensitive data and store the processed, masked and encrypted data securely in Amazon S3. The same outcome can also be achieved through AWS Glue Studio’s Detect PII transform. AWS Glue Studio is a new graphical interface to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue.

Medical Records in Free Form Text and Imaging
Healthcare organizations often have protected health information both on free form texts (such as various online medical forms, physician notes, claims data and more), as well as in images (such as scanned copies of patient forms, X-Rays, or lab reports). Often times the presence, location and format of these protected health information vary widely and this is where an AI-powered masking solution can help detect and mask protected PHI/PII data in both text and image formats. The AI-Powered Health Data Masking solution details how to do just that, in a secure, scalable serverless fashion.

Anonymize or de-identify data for Analytics
We also see organizations wanting to use the valuable medical information in their dataset, while meeting their compliance obligations when it comes to protected health information. For this, they typically anonymize or de-identify their records so it can be further used to drive insights from analytics, such as training AI/ML models. The blog Identifying and working with sensitive healthcare data with Amazon Comprehend Medical demonstrates how you can use AWS Step Functions and Amazon Comprehend Medical to identify sensitive health data and help support your compliance objectives.

Detect and Redact PII using Amazon Comprehend
On the other hand, if you have a use case for detecting and masking PII alone and want to quickly exercise Amazon Comprehend either using the web console or CLI, the blog Detecting and redacting PII using Amazon Comprehend is a great place to start. You can quickly see Amazon Comprehends PII detection and masking capabilities in both near real-time (taking a string of characters as inputs) or in an asynchronous mode taking in multiple files for processing as a batch.

For use cases which involve processing streaming data from a variety of sources in near real-time, Amazon Kinesis Data Firehose is a fully managed service that makes it easy to capture, transform, and load massive volumes of streaming data from hundreds of thousands of sources. The blog Redact sensitive data from streaming data in near-real time using Amazon Comprehend and Amazon Kinesis Data Firehose details how to implement Amazon Comprehend into your streaming architectures to redact PII entities in near-real time using Amazon Kinesis Data Firehose and AWS Lambda.

Data Warehouse
For data that resides in a data warehouse such as Amazon Redshift, the recently announced Dynamic Data Masking feature facilitates the process of protecting sensitive data in your Amazon Redshift data warehouse. With Dynamic data masking, you control access to your data through SQL-based masking policies that determine how Amazon Redshift returns sensitive data to the user at query time.

Another area where sensitive data could be exposed is application logs. Amazon CloudWatch (CloudWatch) Logs is used to monitor, store and access your log files from various AWS sources. Amazon CloudWatch Logs data protection is a new set of capabilities for CloudWatch Logs that leverage pattern matching and machine learning to detect and protect sensitive log data. This feature can be enabled to detect and mask sensitive log data as it is ingested into CloudWatch Logs or as it is in transit. The blog Protect Sensitive Data with Amazon CloudWatch Logs walks you through the process of enabling this feature.

Data Migration
AWS Database Migration Service (AWS DMS) is a managed migration and replication service that helps move your database and analytics workloads to AWS quickly, securely, and with minimal downtime and zero data loss. The blog Data masking using AWS DMS provides a solution to implement data masking while replicating data using AWS DMS from Amazon Aurora PostgreSQL cluster to Amazon S3.


There are some challenges to protecting real-world data that contain sensitive information such as PHI and PII for Healthcare and Life Sciences organizations. It should always be an ongoing goal to be HIPAA compliant. By utilizing best practices and a few of the techniques we’ve described to detect, and in some cases mask or redact or de-identify, the sensitive data using various AWS services compliancy shouldn’t be an issue. To know how AWS can help you contact an AWS Representative.

Further Reading

Previous Article
Advance pediatric care using Amazon HealthLake for scalable FHIR-based data analytics
Advance pediatric care using Amazon HealthLake for scalable FHIR-based data analytics

Blog is guest authored by Meen Chul Kim from Children’s Hospital at Philadelphia The ability to quickly and...

Next Article
Accelerating radiology imaging workflows with relevant clinical context on AWS
Accelerating radiology imaging workflows with relevant clinical context on AWS

Having ready access to clinical context helps radiologists eliminate assumptions and better apply their ski...