What is Data De-identification and Why is It Important?

Sophie Stalla-Bourdillon
on May 7, 2021
Last edited: November 4, 2024
Default alt text

Data de-identification is a form of dynamic data masking that refers to breaking the link between data and the individual with whom the data is initially associated. Essentially, this requires removing or transforming personal identifiers. Once personal identifiers are removed or transformed using the data de-identification process, it is much easier to reuse and share the data with third parties.

Data de-identification is expressly governed under HIPAA, which is why most people associate the data de-identification process with medical data. However, data de-identification is also important for businesses or agencies that want or need to mask identities under other frameworks, such as CCPA and CPRA, or even GDPR.

HIPAA names two different methods of de-identifying data: Safe Harbor and Expert Determination.

Safe Harbor

The Safe Harbor method of de-identification requires removing 18 types of identifiers, like those listed below, so that residual information cannot be used for identification:

  • Names
  • Dates, except the year
  • Telephone numbers
  • Geographic data
  • Fax numbers
  • Social Security numbers
  • Email addresses
  • Medical record numbers
  • Account numbers
  • Health plan beneficiary numbers
  • Certificate/license numbers
  • Vehicle identifiers and serial numbers, including license plates
  • Web URLs
  • Device identifiers and serial numbers
  • Internet protocol addresses
  • Full face photos and comparable images
  • Biometric identifiers
  • Any unique identifying number, characteristic, or code

Any of these identifiers can classify health information as protected health information (PHI), which limits use and disclosure. Data de-identification tools with sensitive data discovery can detect and mask such information.

The Safe Harbor method, which is usually praised for its simplicity and low cost, is not well adapted for all use cases: It can either be overly restrictive, leaving too little utility within the data, or overly permissive, leaving too many indirect identifiers in the clear.

Expert Determination

Expert determination involves applying statistical and scientific principles to data to achieve a very small risk of re-identification. This method makes it possible to tailor the de-identification process to the use case at hand while also maximizing utility; it is therefore praised for its flexibility.

Expert determination is sometimes considered too costly to use because it requires the involvement of an expert in statistics, who can be expensive to source. However, the expert determination method enables the use of quantitative methods to lower the re-identification risk, which opens the door for leveraging data generalization and automation.

Limited Data Sets

HIPAA also allows limited data sets to be released for research, public health, or healthcare operations. Personal identifiers are removed from these data sets, with the exception of date of birth, date of death, age, location, and dates of treatment and discharge. Since limited data sets do still contain some identifying information, they remain protected as PHI under HIPAA.

What is the Value of De-identified Information?

There are several benefits to de-identifying data:

  • Since it is no longer considered to be identifying, you may not be required to report breaches or data leaks. This can limit your risk exposure and protect individuals.
  • De-identifying data facilitates reuse and makes it easier to share with third parties, through, for example, secure data licensing. Another scenario may be a pharmaceutical company that licenses de-identified patient data under HIPAA to analyze trends and prescription patterns that help verify efficacy and identify market opportunities. De-identifying data can also allow researchers to provide public health warnings without revealing PHI. By analyzing de-identified data in aggregate, researchers and officials can identify trends and potential red flags, and take the necessary steps to mitigate risks to the general public.

It is important to note that de-identification is not a guarantee that data is being processed fairly and ethically; assessing the impact of the processing is necessary to achieve that goal.

Data de-identification has been particularly valuable in the medical field, and it is at the heart of research that has led to breakthroughs and discoveries that improve patient care. Kaiser Permanente, for example, uses de-identified data in partnership with Samsung to improve remote monitoring of cardiac rehab patients. Early results show significantly lower readmission rates when using a smartwatch-based program than traditional rehabilitation regimens.

Innovation partnerships that leverage de-identified data also have the potential for other advances in medical research. McKinsey estimates that artificial intelligence and machine learning using health records could save the US medical industry $100 billion annually by improving the efficiency of research and clinical trials.

How to De-Identify Data

Data de-identification is typically managed in a two-step process.

The first step consists of classifying and tagging direct and indirect identifiers. Identifiers that are unique to a single individual, such as Social Security numbers, passport numbers, and taxpayer identification numbers are known as “direct identifiers.” The remaining types of identifiers are known as “indirect identifiers,” and generally consist of personal attributes that are not unique to a specific individual on their own. Examples of indirect identifiers include height, ethnicity, hair color, and more. Though not independently unique, indirect identifiers can be used in combination to single out an individual’s records.

Once the data classifiers have been verified and represent what is within data sources, it is possible to automate the tagging process. This makes the de-identification process vastly more efficient for data teams.

Data can then be de-identified through the combination of various dynamic data masking techniques and data access controls. These technical and organizational measures impact both the data’s appearance and its environment, including who can access the data and for which purposes, among other contingencies. One of the primary techniques data engineering and operations teams use to mask data before de-identifying it is pseudonymization.

Pseudonymization 

Pseudonymization, despite being a very useful security method, generally does not achieve de-identification on its own. That is due in part to the fact that pseudonymization is usually defined as masking direct identifiers, so it does not necessarily take indirect identifiers into account.

Pseudonymization can transform direct identifiers through various masking techniques, though some are stronger than others. Salted hashes, for example, offer a formal guarantee that hidden values cannot be reasonably connected to individually identifiable information without knowledge of the salt, or the random input data. Recent privacy and data protection legislations require that this random data be kept separate from pseudonymized data through technical and organizational measures, making salted hashes one of the stronger masking techniques.

The HIPAA Safe Harbor method, on the other hand, aims to remove both direct and indirect identifiers. However, its fixed list of indirect identifiers does not work well for all use cases and does not necessarily achieve a very small re-identification risk, as Expert Determination does. This is particularly true when specific types of indirect identifiers, such as gender or ethnicity, are present in the data source.

Data masking tools and solutions simplify the process of masking identifiers with hashing, regular expression, rounding, conditional masking, and replacing with null or constants. Format-preserving masking maintains the length and type of the value, making it possible to derive greater utility. It is also possible to allow data users to submit an unmasking request for very sensitive attribute values.

Once direct identifiers have been masked, data engineering and operations teams can apply methods of de-identification.

De-identification Methods

There are two primary de-identification methods: generalizing and randomizing.

Generalizing (k-anonymization)

K-anonymization is a data generalization technique that is implemented once direct identifiers have been masked. The k-anonymization process reduces re-identification risks by hiding individuals in groups and suppressing indirect identifiers for groups smaller than a predetermined number, k. This aims to mitigate identity and relational inference attacks. This de-identification technique can help reduce the need for data redaction in data sets, which helps increase its utility without compromising data privacy.

Immuta enables data teams to apply dynamic k-anonymization at query time from any of your organization’s databases, allowing you to safely, seamlessly prepare sensitive data for use – without legal and privacy concerns or risk-prone data copies.

In some data sets, even when direct identifiers are masked, it may be possible to determine a patient’s identity from other available information with the data. A well-known study showed, for example, that 87% of the US population could be identified using only three indirect identifiers: gender, birthdate, and zip code. Generalization, or k-anonymization, can reduce this risk of re-identification by ensuring individuals within the same cohort share the same indirect identifiers.

Experts caution that in today’s evolving data landscape, singular approaches cannot guarantee protection against re-identification — especially in the healthcare industry. K-anonymization works best in combination with attribute-based access control and real-time data use monitoring, as well as randomization to protect sensitive attributes.

Randomizing (differential privacy and randomized response)

Differential privacy is a randomization technique that is implemented once direct identifiers have been masked. There are two approaches to differential privacy: local and global.

Local differential privacy is a data randomization method that usually applies to sensitive attributes and offers a mathematical guarantee against attribute-based inference attacks. This is accomplished by randomizing attribute values in a way that limits the amount of personal information inferable by an attacker while still preserving some analytic utility, since gathering too much information on a specific record can undermine privacy. Individuals whose data is included in the queried data set are therefore able to deny the specific attributes attached to their records. Technology companies like Google and Apple, which collect a wide range and huge amount of personal data, have adopted local differential privacy.

Global differential privacy is a method that randomizes aggregate data. This approach constrains data users to only formulate aggregate queries (e.g. count, average, max, min, etc.), and offers a mathematical guarantee against identity-, attribute-, participation-, and relational-based inference attacks. Individuals whose data is included in the queried data set are therefore able to deny their participation in the data set as a whole. The US Census Bureau, for example, employs global differential privacy because aggregation on its own is insufficient to preserve privacy.

With a data security platform that enables dynamic data masking, like Immuta, data teams are able to automate both randomized response and differential privacy. Randomized response helps achieve local differential privacy for specific columns that require a high level of protection, while global differential privacy enables computation of aggregate statistics in a privacy-preserving fashion.

Is De-identified Data Considered PHI?

So long as proper de-identification processes are followed and, in practice, a data audit trail is created, once data is de-identified it is no longer considered PHI under HIPAA.

For this reason, data de-identification is crucial when it comes to public health emergencies — while the need for real-time data is essential, so is guaranteeing privacy, confidentiality, and compliance. De-identifying data allows important health information to be disseminated without sacrificing privacy or confidentiality.

Reducing Compliance Costs

Immuta’s ability to automatically enforce policies for HIPAA Safe Harbor or Expert Determination on-read means data teams can avoid copying the data, and identifiers in the data set remain in the database for those with authorized access and need. This article demonstrates how Immuta’s automated Safe Harbor method policies and auditing can be used to de-identify a data set store in Amazon RDS for PostgreSQL.

De-identify Data at Scale

Automation makes it possible to scrub rich data sets at scale but this should only be done with the right policies in place. Rules must take into account who will be using the data, what purpose the data will be used for, and when it will be used. Immuta’s dynamic approach to automated data security and access control solves this with attribute- and purpose-based restrictions, which are applied at query time and therefore are easily scalable. Data security and compliance teams are able to create managed rules based on data usage, with no technical expertise required.

[Read More] RBAC vs. ABAC: Future-Proofing Access Control

Depending on the use case, access and monitoring can be increased or decreased so you easily customize and contextualize rules.

Reduce Reliance on Old Data

For Cognoa, a digital behavioral health company, Immuta natively and dynamically applies purpose-based restrictions to data, and enforces access and policy restrictions in real-time based on data users’ needs. This allows multiple parties to view the same data set, while blocking access to unauthorized portions of the data for individual uses.

In practice, Immuta’s dynamic data masking capabilities eliminated the use of data snapshots that could be months old and had to be cleansed and imported into a separate database. Immuta captures these snapshots to help track data and how it has been changed over time, including the impact of policy changes on outputs. Now, when Cognoa’s data users run a query, they are interacting with current data and are assured it meets compliance standards, and a complete data audit trail is provided to document compliance.

By automatically monitoring, logging, and providing reporting on every action within your data platform, Immuta shows you data access, policy changes, data use purposes, and exact queries in real-time. These reports and audit logs help provide transparency for data engineering and compliance teams. Immuta Detect also provides continuous data security monitoring and posture management by identifying potential threats to data and anomalous activity.

Transform the Way You Secure and Share Sensitive Data

Data de-identification and pseudonymization do not have to be complicated or difficult to implement and monitor. The Immuta Data Security Platform streamlines what used to be time-consuming, risk-prone approaches to data protection, enabling data teams to be more efficient and extract more value from their data. In addition to de-identification, Immuta offers a broad set of data security and privacy tools that can be scaled across cloud data platforms, including:

Request a demo to learn how Immuta can simplify operations, improve data security, and unlock more value from your data.

Ready to get started?

your data

Put all your data to work. Safely.

Innovate faster in every area of your business with workflow-driven solutions for data access governance and data marketplaces.