How Does Data Classification Help Protect Data Privacy?

SOPHIE STALLA-BOURDILLON on April 28, 2023
Last edited: November 4, 2024
Default alt text

As data breaches and cyber attacks become more common, protecting data privacy is an increasingly important concern for companies that use data to compete. According to Cybercrime Magazine, the total cost of cybercrimes is an estimated $8 trillion, and is expected to climb more than 30% in the next two years. But despite their best efforts to avoid becoming a statistic, companies face significant challenges in protecting their sensitive data from potential threats, while meeting regulatory and internal compliance standards. There are simply too many data assets, users, and policies to manage, particularly in modern decentralized environments.

One of the first steps – and most effective approaches – to protecting data privacy is a Data Classification. At a high level, this process helps identify and categorize data based on content and context and, in doing so, determine its classification and sensitivity level. In this blog, we will explore how data classification helps protect data privacy and the benefits it provides for companies that need to unlock more value from their data.

Why Is Data Privacy Important?

If companies’ sensitive data is inadvertently shared, stolen, or exposed, they risk facing financial losses, reputational harm, loss of competitive advantage, or – in the worst case – all of these consequences. Prioritizing data privacy isn’t just a matter of avoiding fines and penalties, or about using sensitive data in a way that’s ethical, transparent, and trustworthy; it’s also about gains in operational efficiency from streamlining access to controlled data.

However, in practice, the volume of data, users, and cloud platforms pose significant challenges. Beyond that, sensitive data categorization also poses a significant challenge to data protection. The sensitive category is broad, and may range from highly sensitive data such as PHI (Protected Health Information) and PII (Personally Identifiable Information) to non-sensitive data such as zip code, race, or gender. Even so, these “non-sensitive” data types could be indirect identifiers, meaning that in the right scenario they could be used to identify an individual, and thus violate data privacy standards. Therefore, sensitive data must be masked, anonymized, or excluded from analytics data sets, depending on regulatory requirements and a company’s internal data governance policies.

Data Classification 101

Next, we are going to explain what data classification is and how it relates to sensitivity level. In doing so, we will explain what organizational compliance postures are, why automatic data classification first requires that data is categorized, and how both categorization and classification depend critically on data context (at rest and in use).

Finally, we’ll explain the three step process for data classification in Immuta, and how to leverage the Immuta Data Security Framework to quickly implement data classification for your organization.

What is Data Classification?

How do organizations deal with potential for data misuse? One option might just be to put everything through the same processes: lock it down, guard it tightly, and review every access with extreme scrutiny.

While safeguarding everything on the same level as the most sensitive data certainly gets the job done, it’s not a very appealing option. For one, it’s expensive, both in terms of human effort and lost organizational efficiency, as data becomes effectively unavailable outside of narrow silos. At scale, it’s hard to even know what data an organization has, let alone how to process it.

The problem is clear. Not all data is created equally, so why should we expect things to have the same impact if released or misused? Instead, modern organizations typically adopt an approach to classify data by its potential impacts.

Data Classification Standards

In fact, just like other cybersecurity standards and guidance (see SOC2 or NIST cybersecurity framework for example), ISO/IEC 27001 – the ubiquitous global standard for information management – emphasizes the importance of classification in protecting information and access control management. Annex A of ISO/IEC 27001 goes on to recommend organizations develop a data classification policy specifying classification criteria into a defined set of classification levels, as specified in ISO/IEC 27002 section 5.12. Specifically,

The classification can be determined by the level of impact that the information’s compromise would have for the organization. Each level defined in the scheme should be given a name that makes sense in the context of the classification scheme’s application.

What Does This Look Like In Practice?

Many organizations ultimately end up with a data classification policy specifying a three or four level classification system. Common systems utilize some variation of Public, Internal, Confidential, and Restricted as classification levels. Typically, these are broken down along the following lines:

Classification Description Internal Sensitivity/Availability External Sensitivity/Availability
Public Data is approved for public release. Low sensitivity. Broad internal availability. Low sensitivity. Broad external availability.
Internal Data is not approved for public release, but misuse or improper release is unlikely to cause more than minimal harm. Low to moderate sensitivity. Broad internal availability. Moderate sensitivity. Available to select third parties by approval.
Confidential Misuse or improper release may reasonably cause moderate harm Moderate to high sensitivity. Internal availability limited to approved groups, may be further restricted by use case. High sensitivity. Available to select third parties by approval and through a restrictive agreement.
Restricted Misuse or improper release may reasonably cause substantial harm. High sensitivity. Internal availability is extremely limited, requiring a narrowly approved use case. Extremely high sensitivity. Possibly governed by law. Available only to select third parties by approval and through a restrictive agreement.

These levels typically come along with a policy that specifies general treatment of data by classification including, for example, that restricted data must be stored and processed on approved systems. It also includes specific rules for certain categories of data, which say things like: Employee Performance Review data is to be marked Confidential and shall be available only to Human Resources.

Generically, an organization’s data classification policy spells out which categories of data receive which classifications, as well as any usage restrictions. These policies specify in great detail what could broadly be summarized as the organization’s compliance posture.

Compliance Posture

Roughly speaking, compliance posture represents an organization’s position on the classification and use of various categories of data. A compliance posture is informed by one or more compliance laws and regulations applicable to the organization. Themes and sometimes interesting differences become apparent when looking across many organizations in the same industry vertical, and tend to vary somewhat predictably as a function of industry vertical, operational jurisdiction, and – to the extent the data pertains to people – the nature and locations of the data subjects.

For example, due to HIPAA, clinical health care providers in the United States tend to view any data stored in conjunction with the patient’s medical record as protected health information (PHI), while companies operating under CCPA may sometimes store (non-patient) health data along with non-health data.

In effect, this means that policies restricting the availability of medical information look very different in the two kinds of organizations. In the clinical setting, it pertains to all fields in the data, including items that in isolation are non-medical. Anything joined to the patient record in processing becomes part of it; while in the broader setting under CCPA, the policy only applies to select data elements and can be effectively realized with column-level controls.

Context is Key. Context is Dynamic.

The distinction in the organizational understandings of medical data in the preceding example highlights the importance of understanding context. An address, in isolation, is just an address. An address as part of a patient medical record in a U.S. clinical context? Well that’s protected health information. Classifying the data therefore first requires categorizing the data. Moreover, to separate non-clinical addresses from protected health information, this categorization must account for the context.

Now context is a tricky thing. It moves. Imagine that we have two tables, one containing anonymized health information, devoid of all identifiers. The other contains patient identifiers without any clinical information. The Safe Harbor provisions of the HIPAA privacy rule allow the information to be legally exempted from the title of protected health information. At rest we have two tables, neither of which contains protected health information. However, if queried together then, at query time we create, process, and possibly output protected health information.

Good Compliance Posture Goes Beyond Good Compliance

Having access to anonymized health data is handy. The relaxed conditions on the data means that medical researchers can more easily access it. Fewer hoops to jump through is a big win for operational efficiency. This data would otherwise be off limits, outside of patient treatment. In other words, good data classification hygiene streamlines access to tricky data!
However, this comes with a new risk. Namely, the risk of accidental re-identification of de-identified data, which somewhat paradoxically increases as the researchers are now able to internally share the data more broadly. In this case, a database query took in two data sets, neither of which contained protected health information, and accidentally created PHI in the process. Situations like this are why Immuta Detect offers Dynamic Query Classification.

The Data Classification Journey

Data classification in Immuta can be thought of as a three-step process comprising discovery, categorization, and sensitivity determination.

Immuta System Component Process Step Rule Definitions
Data Classification Sensitivity Determination Customer Framework Implementing Data Classification Policy
Data Classification Categorize Zero or More Frameworks, e.g. DSF
Sensitive Data Discovery (SDD) Discover SDD Pattern Rules

Discover

First, sensitive data discovery (SDD) performs a process known as entity discovery to infer the semantic meanings of data elements. It’s in this phase that information like email addresses, credit card numbers, names, etc. get labeled as such.

Categorize

Next, the framework categorization engine analyzes the data in the context in which it appears, under one or more frameworks, to infer the roles of various data elements from their relationships to each other.

What is a framework? It’s a set of categorization rules for implementing a compliance posture. To make things easy, we supply a base framework, called the Immuta Data Security Framework (DSF), which contains rules and categorizations for identifying when a given data context, such as table or in processing a query, contains – among other things – personal, financial, and/or health information.

For example, say your organization falls under HIPAA. To define a framework implementing a HIPAA-like compliance posture, you:

  1. Introduce a framework tag called HIPAA.PHI and set its sensitivity level at 2, which in this example is chosen to be at a level above other forms of non-medical personal information.
  2. Next, define a pair of rules. The first leverages the DSF to further categorize any data categorized as Immuta DSF.Health as HIPAA.PHI. The second tag categorizes any column accessed in the same context as HIPAA.PHI as HIPAA.PHI.

The effect of this framework is simple. The first rule says that any column recognized as Health data by the DSF (which includes, for example, medical diagnostic codes), should be categorized as PHI under this framework. The next rule says that any data appearing in a table or query along with anything categorized as PHI must also be categorized as PHI.

If needed, even the DSF can be edited. Say the HIPAA organization merges with a hospital system that perhaps uses a previously unrecognized medical record number (MRN) format. Here, the organization would simply add a pattern rule for Discovery to recognize and tag the former competitor’s MRN as Former Competitor.MRN, and add rules to the DSF to ensure it is classified as DSF.Medical Record Name. From this, internal logic in the DSF will automatically further determine that it is a health-related personal direct identifier, and higher level categorization rules will continue to work automatically. For example, a Former Competitor.MRN, appearing among billing information would then automatically be understood by the DSF as Personal Financial information, among other things.

Classify

Finally, zero or more classifications are assigned which, as a function of context, determine the sensitivity level.

Continuing our example and leveraging the ability to have frameworks that build on each other, an organization utilizing the Public, Internal, Confidential, Restricted system implements a [Framework] with the four categories. For each of the four classification levels, the company sets its sensitivity levels accordingly, with say, Public at 0, through Restricted at 3. If the HIPAA organization is using such a system, it may choose to update the rules of this framework to add one that categorizes HIPAA.PHI as Restricted.

Functionally, both of the Categorize and Classify steps occur by the same Immuta system component, namely frameworks. However, this time for a different purpose. Here, the Classification step implements the company’s data classification policy by further categorizing into a named classification level, thereby indicating sensitivity.

Wrapping Up

Most organizations have sensitive data and access policies to protect data privacy. However, applying the right level of sensitivity and correctly enforcing appropriate policies can be complex and overwhelming for data owners. As a result, sensitive data is often left unprotected and vulnerable to unauthorized access.

Data Classification is an important tool in the data security arsenal to automatically classify data based on sensitivity and context. This allows organizations to effortlessly, and with utmost flexibility, enforce their data privacy policies and protect sensitive data from unauthorized access according to its sensitivity level.

Try it yourself.

See Immuta's Data Security Framework in action.

your data

Put all your data to work. Safely.

Innovate faster in every area of your business with workflow-driven solutions for data access governance and data marketplaces.