Data has been called “the new gold” for its ability to transform and automate business processes; it has also been called “the new uranium” for its ability to violate the human right to privacy on a massive scale. And just as nuclear engineers could effortlessly enumerate fundamental differences in gold and uranium, so too must data engineers learn to instinctively identify and separate dangerous data from the benign.
Take, for instance, the famous “link attack” that re-identified the medical records of several high-profile patients of Massachusetts General Hospital. In 1997, MGH released about 15,000 records in which names and patient IDs had been stripped from the database. Despite the precautions, Harvard researcher Latanya Sweeney was able to connect publicly available voter information to these anonymized medical records by joining them on three indirect identifiers: zip codes, birthdays, and genders. This left Sweeney, with only a handful of records to sift through, to re-identify many individuals — most notably, the Massachusetts governor’s patient records.
More than twenty years later, every business is an MGH and every person with internet access is a potential Latanya Sweeney. Yet, we all want a world where data is handled responsibly, shared cautiously, and leveraged only for the right purposes. Our greatest limitation in realizing that world is not one of possibility but responsibility; it’s not a question of “How?” but “Who?”.
I believe data engineers must be the ones to take ownership of the problem and lead. Controlling the re-identifiability of records in a single dashboard is good analytics hygiene, but preserving privacy in the platform delivering the data is crucial. Managing privacy loss is a systemic problem demanding systemic solutions — and data engineers build the systems.
The mandate to protect privacy does not translate to a boring exercise in implementing business logic; it presents exciting new technical challenges. How can we quantify the degree of privacy protection we are providing? How can we rebuild data products — and guarantee they still function — after an individual requests that their data be deleted? How can we translate sprawling legal regulations into comprehensible data policies while satisfying data-hungry consumers?
We will need to formulate a new set of engineering best practices that extends beyond the familiar domains of security and system design. Determining what is best practice requires much practice, though. It is essential that engineering leaders push their teams to understand and address the pertinent issues: the strengths and weaknesses of data masking, techniques like k-anonymity and differential privacy, and emerging technologies such as federated learning. Ultimately, data engineers should know the practice of privacy by design as intuitively as they do the principle of least privilege.
The alternative, if history is any guide, is a world in which institutions publish “anonymized” data to the world, and clever people and organizations reconstruct and repurpose private data for their own ends. Managing privacy, far from being an abstract concept for just philosophers and lawyers, has become a concrete problem perfectly suited for data engineers. It’s time they made it their own.
To read more about this topic and other essential tips for data engineers, check out
97 Things Every Data Engineer Should Know, available here.
Stephen Bailey is Director of Data & Analytics at Immuta, where he strives to implement privacy best practices while delivering business value from data. He loves to teach and learn, on just about any subject. He holds a PhD in educational cognitive neuroscience from Vanderbilt and enjoys reading philosophy.