In today’s day and age, we’re accustomed to technological advances and capabilities being uncovered all the time. However, mere availability does not necessarily correspond to immediate adoption.
This is at least somewhat true for differential privacy. The first seminal contribution on the topic was published in 2006 by Microsoft Distinguished Scientist Cynthia Dwork, who said that “in many cases, extremely accurate information about the database can be provided while simultaneously ensuring very high levels of privacy.” Yet despite being one of the strongest existing privacy enhancing technologies (PETs) for de-identification, it’s not always top of mind for data engineering, data science, and data governance teams.
Why is this, and how is differential privacy a game changer for data teams looking to reduce re-identification risks and derive insights and utility from their data? Let’s explore what data-driven organizations stand to gain with differential privacy.
What is differential privacy?
Differential privacy is a privacy enhancing technology in which randomized noise is injected into the data analysis process. Compared to other dynamic data masking methods and PETs that are not process-based — such as masking or generalization — each time a query is made, differential privacy makes it possible to calibrate the noise to that query. This helps to precisely navigate the trade-off between privacy and utility in an effort to maximize data’s value while preserving data privacy.
What are the most common approaches to differential privacy?
There are two common approaches to differential privacy:
1. Global Differential Privacy
Global differential privacy ensures that an individual whose data is being queried is able to deny his/her participation in the data set used to produce analysis results. This means participation in the data set will not significantly increase the likelihood of re-identification.
Global differential privacy offers a number of desirable properties:
- It protects against post-processing — including processing by adversaries with access to external or even future information. Differential privacy modifies the analysis process to help ensure that query results do not depend heavily on any items or data points within the database.
- Its use of randomization also promises that the final result is not only a possible output on a version of the database that does not include this item, it is almost as likely. Thus, observing a particular result does not reveal whether or not any single data point is present.
With global differential privacy, you can only ask questions that will generate aggregates (e.g. minimum, maximum, average, count and sum). In principle, aggregates can be used in a variety of cases, such as to analyze data to improve processes, products, or services, to create customer profiles to ensure product or service maintenance, to derive insights that drive development of new goods or services, and more. Creating profiles on the basis of aggregates is likely the least obvious use case — and one that requires skill and expertise — but is feasible.
What does this look like in practice? Let’s say you’d like to find out how many customers in Company Z’s database bought product B after buying product A. Although an individual, John Smith, appears to have purchased both products, the presence of his record will have only a slight influence on the resulting count. This is because differential privacy ensures that the result obtained by running this particular account occurs with nearly the same probability as a version of the database that does not include John’s data. As a result, John could plausibly argue that he never bought any product from Company Z.
2. Local Differential Privacy
If aggregates are too broad for the analysis you wish to perform — for instance, if you need access to individual-level data — local differential privacy is a strong alternative. Unlike global differential privacy, with local differential privacy an individual cannot deny participation in the data set, but can deny the contents of their record. The output of the process is therefore individual-noised records. Local differential privacy has great potential for supervised machine learning but is generally underutilized.
The key to fully implementing differential privacy and local differential privacy is to understand that machine learning models should be built within controlled environments, which rely upon strict data access control and role allocation amongst several lines of defense. In this environment, you would:
- Start by building a version of the model without differential privacy, note its baseline performance, then throw it away.
- Next, iteratively build models with more and more noise until you reach a minimum acceptable threshold for performance or a maximum acceptable threshold for privacy loss.
- Release the model into production, assuming the privacy loss is acceptable.
Why is differential privacy a game changer?
Data anonymization approaches have been around for a long time. The most widely used technique is masking values in data to hide their true meaning while still providing utility in the data. Years ago, cutting the last digits from a zip code to remove precision was considered sufficient to protect data subjects’ exact locations, but with today’s advanced technology, that method will no longer cut it.
To put a fine point on the shortcomings of traditional, non-differential privacy methods of anonymization, consider this example: It’s 1985 and you’ve been asked to name the run time of The Terminator — what do you do? Probably get in your car, drive to the nearest video store, and read the VHS tape’s box — it’d be a time commitment, to say the least. However, you got that question today, you could answer in a matter of seconds using a search engine. Now, let’s flip the script: It’s 1985 and you’re asked to name the popular movie made in 1984 with a run time of 107 minutes. Difficult then, not so much now.
Without access to additional information or resources — as would be the case in 1985 — protecting the movie title by masking the title, actors, and synopsis, might be sufficient. This is the way traditional anonymization techniques function. But with more data than ever available at our fingertips, this approach is risky. That’s why statistical bureaus could use traditional anonymization techniques last century, but when Netflix releases data today, they get mud in their face. This is called a link attack, and the proliferation of accessible information makes such attacks increasingly easy to achieve.
Here’s where differential privacy comes in. Differential privacy is the first and only way of guaranteeing that any individual record within a dataset cannot be identified. It goes without saying that providing a statistical guarantee of privacy makes sharing data much simpler. Data sharing comes with a host of benefits, like unlocking secondary use cases for existing data (not originally collected for that purpose), selling data, collaborating with skilled external data engineers and scientists, executing data exchanges that make you and your collaborator more powerful together, or even identifying philanthropic use cases, all while protecting your data subjects’ privacy.
How does the math behind differential privacy work?
Typically, when you read about differential privacy you get a math equation thrown at you with unintelligible explanations. But when you remove the fine details, it’s actually pretty simple.
Let’s look at an example where there’s sensitive information to protect: finding out the proportion of people who hide purchasing information from their spouses. You gather 100 people in a room and ask them to pick a number between 1-10, but not to share their answers. Next, you ask them to raise their hand if they hide purchases from their spouse or picked a three in the previous question. By asking two questions you’ve injected noise into the response, providing plausible deniability to everyone who raises their hands. Based on the number of hands raised, and knowing the probability of randomly choosing a three, you can calculate the true proportion of spouses who hide purchasing information while also protecting their privacy. In essence, the noise injected into the response protects individuals’ privacy.
Now, what if you ask the group to raise their hands if they hide purchases, chose a three in the first question, and are wearing a pink shirt. People may be more apprehensive because with few pink shirts in the crowd, the question is more sensitive. In this case, you’d probably need more noise. This could be done by saying everyone has to pick a number between 1-3 instead of 1-10.
This technique of adding noise to data during the data collection time is known as randomized response, which is a local differentially private mechanism. Google has documented how it uses randomized response to collect anonymous usage statistics in their Chrome browser.
What are the common challenges to implementing differential privacy?
If the math behind differential privacy is so straightforward, why aren’t more data teams leveraging it? Historically, a few major challenges have stood in the way:
Challenge 1: Aggregate Questions Only
As mentioned, differential privacy requires restricting questions to aggregates only. Since noise is added to the response, answers must be numerical. This means you’re unable to ask for literal rows in the data, only aggregate questions of it.
Challenge 2: Determining Sensitivity
It’s difficult to understand how sensitive a question is based on a database query. This would require prior knowledge that specific data is sensitive and assigning noise accordingly, in addition to taking into account the possibility that any type of sensitive data could exist in any combination within a test group. In the example from the previous section, you’d have to know there aren’t many people wearing pink shirts and plan for various numbers of pink shirts being present in the group. It’s a nearly impossible scenario to anticipate.
Challenge 3: Privacy Budget
If you’ve ever been to a restaurant when it opens, you know how enjoyable not having to yell or lean across the table to have a conversation is — you can hear everything clearly. But as the restaurant gets busier, you have to talk more loudly, listen more closely — and before long, can you be 100% sure what you heard your fellow diner say was actually what they said? If differential privacy is a restaurant and the number of queries being run are other diners, it’s easy to see how as queries are added, the data gets noisier and less reliable. Privacy budget is the equivalent of the restaurant’s capacity limit — in differential privacy, it caps the number of questions that can be asked of a data set. However, just as a capacity limit decreases the chances of you seeing a familiar face at a restaurant, the privacy budget limits data exploration and insight gathering.
Challenge 4: Noisy Data
Watch any interview and you’ll see that when the interviewer asks their subject a sensitive question, the answer is filled with vague or irrelevant information. Similarly, with differential privacy, highly sensitive queries will likely return noisy data — data that’s not well suited for analysis. This is often a dead end since it maximizes data privacy but minimizes its utility.
How is differential privacy implemented?
Many data teams struggle with differential privacy implementation, but protecting data with differential privacy doesn’t have to be complicated. Modern data access control tools like Immuta dynamically enforce differential privacy on data without requiring a custom database or custom query language.
Immuta’s differential privacy technique works like any other policy and can be easily added to data exposed from any database in your organization through our advanced data policy builder, so noise is only added for users that don’t meet the policy’s conditions.
Since adding noise based on the sensitivity of a question is the heart of differential privacy, and because noise isn’t injected during data collection, Immuta dynamically adds it to query results at run time. In fact, Immuta can add noise in such a way that statistically guarantees the privacy of individual records, on the fly, just like all other policies in Immuta.
Let’s revisit the common challenges of differential privacy implementation to see how Immuta helps overcome them:
- Aggregate Questions Only: Immuta acts as a virtual control plane between data analysts/scientists and databases, enabling enforcement of SQL restrictions on the types of SQL questions you can ask. In other words, this virtual control plane provides the perfect injection point for restrictions on the SQL statements.
- Determining Sensitivity: Immuta’s sensitive data discovery capability quickly and dynamically detects the sensitivity of a question relative to the available data by intercepting and managing the query.
- Privacy Budget: Immuta sits between data and data consumers, allowing it to capture questions that have already been answered and provide a previously-calculated noisy response instead of generating a new one. Based on how often your data is changing — since most data isn’t static — you’re able to tell Immuta how often to refresh the noise in its responses.
- Noisy Data: With the aforementioned sensitive data discovery, Immuta can understand a question’s sensitivity. Instead of adding a significant amount of noise for very sensitive questions, Immuta simply blocks the query and instructs data consumers to ask something less sensitive. This aids data scientists in learning to use differential privacy and protects them from using responses that are wildly inaccurate.
With Immuta’s dynamic privacy enhancing technologies, like differential privacy, data teams have achieved 100% growth in permitted use cases by safely unlocking sensitive data while increasing data engineering productivity by 40%. Clearly, differential privacy doesn’t have to be a challenge, but it should be top of mind for data science and governance teams.
Curious how Immuta’s differential privacy and other features work in action? See for yourself when you request a demo.