Blog
Data Governance Anti-Patterns: The Copy & Paste Data Sharing Method

Data Governance Anti-Patterns: The Copy & Paste Data Sharing Method

Published January 8, 2019

Last edited: November 5, 2024

Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data governance, they’re everywhere. Today’s anti-pattern probably isn’t thought of as a “pattern” at all because it feels so obvious, and (on the surface) is too “easy” – you share data by giving someone a copy. This is plain wrong, and we’ll explain why.

Imagine you’re a small mom and pop pizza place and have been diligent about storing all of your data about inventory, deliveries, and orders in a database. You know you could use powerful machine learning techniques to be more predictive about all that you do, but obviously don’t have the data scientists on staff to assist. To do so, you hire external consultants and share your data and business goals with them so that they can build the models for you.

This isn’t an uncommon scenario. In big organizations, such as banks, this sharing occurs internally between groups. Also, large organizations share with each other in order to gain value from their combined data.

The key word here is “share.” How does that actually happen? In almost every organization we work with, the method is consistent and problematic (at least before Immuta comes in): copy and paste. Specifically, the data is copied out of the database, likely transformed in some way, then that static file is shared. In fact, this explains why data scientists are so familiar with working with files because they’re always at the tail end of the copy and paste process.

Why does everyone gravitate to copy and paste? A few reasons.

Anonymization of data must occur: Going back to our pizza joint example, it’s likely that the owners would want to mask their customers’ names and personal information before sharing their data with the consultants. This requires some kind of extract/transform/load (ETL) process to copy the data out of the database, mask it appropriately, and then dump it somewhere to use as the final output to share. This technique isn’t limited to small pizza shops. It’s almost universally used in every large organization we meet with.
Legal: Anonymization techniques can be very complicated and involve sign-off from experts as well as auditing of where data is being shared (think GDPR). This requires manual workflows and the involvement of several employees beyond just IT. In many cases, data owners use this as an excuse not to share at all.
Database security: IT doesn’t want to manage more accounts to the database, especially with third parties so they copy and paste instead, rather than adding more accounts. Remember how complex adding accounts can get if IT is also following our anti-patterns.

The Problems With Copy and Paste

It turns out that the copy and paste method of data sharing directly hurts data science programs and leads to massive frustrations for the downstream data consumers. There are a few reasons behind this:

Data consumers have no process to discover or request access to the data across the organization.
They typically have to wait months (yes, months) for the data to arrive.
They’re working with a static snapshot of the data, which is typically months old – if not simply outdated – because of the above process.
They’re required to sign data usage agreements and therefore must be very careful about how they subsequently share the data with their colleagues.

And on top of that, the organization is frustrated because:

They lose insight into who has what data and how it’s being used.
They significantly increase storage costs as many anonymized versions of data need to be stored for various different user scenarios.
They have complex ETL “spaghetti” to manage the creation of all the anonymized copies.
It isn’t clear how the anonymization policies are actually implemented (or if they are correct) across different data systems (see anti-pattern 1).
Biggest of all, their data science initiatives are stymied because frankly, none of this works and nobody can access data in the way they need.

So how should you avoid this anti-pattern?

A great analogy is how Netflix led to the demise of Blockbuster. Blockbuster was the copy and paste method. After searching through the store on foot, you got the raw video, watched it, remembered to rewind, and returned it. Netflix changed that. Instead of copy and paste, they provided a live feed to the movie over the internet. The value here was discovering the movie you wanted through a web search then having immediate access from your living room, without moving from your couch.

Data science programs need Netflix, not Blockbuster. With Blockbuster, they fall apart.

The Value of a Data Control Plane

In order to provide live access to data, we recommend implementing what Immuta terms a “data control plane.” This control plane is placed on top of your databases as an abstraction layer which provides discoverability, policy authoring, data access, full audit and request history, and dynamic anonymization at data-access-time. Taking this approach resolves all of the above issues.

Data owners can share their data with complex access and anonymization policies that are enforced at request and query-time. No ETL jobs, no extra storage.
Policies can be reviewed or enhanced by legal and compliance – the policies should be written in a way that’s simple to understand by all employees.
The control plane, which reduces your surface area for a security breach, acts as an abstraction to your database and doesn’t require new accounts to be created.
Data scientists can rapidly discover data, request access, and immediately be entitled to the data based on the logic of the access policy.
The data scientists’ are accessing live, up-to-date data through industry standard access patterns.
The data scientists are 1) comforted knowing they followed a recognized access process and 2) can share work (code) with their colleagues, knowing they’ll be able to access the data through this same control plane.
Full audit of all actions are captured with reporting capability to fully understand who is using what data for what purpose.

The data control plane is likely overkill for the pizza joint, but is absolutely critical for large organizations with disparate data silos as well as small organizations with very complex data policies to enforce (such as HIPAA and GDPR). With Immuta, this control plane can serve as the foundation for your data privacy initiatives.

A Guide to Enabling Inter-Domain Data Sharing

For many, the appeal of a decentralized data architecture relates to its potential for enhanced collaboration. But to achieve this kind of streamlined collaboration, your team must first establish a system of secure, self-service domains. In a previous blog, we explored how to make decentralized data mesh architectures a reality based on phData’s...

5 Steps to Make Data a Strategic Asset for Geospatial Intelligence

In 2021, the National Geospatial-Intelligence Agency (NGA) published its new data strategy, which seeks to improve how data is developed, managed, accessed, and shared to maintain an advantage in geospatial intelligence. In its strategy, the organization pinpoints goals and action plans that the NGA, the Department of Defense (DoD)/Intelligence Community (IC),...

Moving from Legacy BI Extracts to Modern Data Security & Engineering

Before we can talk about modernizing from a legacy Business Intelligence (BI) extract, we need to answer the questions: why are they used? And what are they? The “why” behind extracts boils down to improved query performance on published dashboards. You can see more details about the “why” for data extracts in...

your data

Put all your data to work. Safely.

Innovate faster in every area of your business with workflow-driven solutions for data access governance and data marketplaces.

Book a demo

Platform Services

Metadata Registry

Data Discovery & Classification

Policy Entitlement Engine

Unified Audit

Data Domains

Apps

Data Marketplace

Data Access Governance

Ecosystem Partners

Native and API Integrations

Get Started

Take a tour of Access Governance

Take a tour of Data Marketplace

Schedule a live demo

Find a consulting partner

Data problems we solve

Unify data access control

Publish & find data products

Create & enforce policy

Monitor & audit data usage

Speed business innovation

Roles we empower

Data Product Owner

Data Consumer

Data Steward

Data Governor

Data IT

Industries we transform

Financial Services

Health & Life Sciences

Public Sector

The E-Trade Moment for Data

Get in the know

Blog

Resource Center

Data Fundamentals

Get a deeper look

Demo Hub

How-To Guides

Schedule a Live Demo

Get connected

Events & Webinars

Sign Up for Our Newsletter

Get support

Documentation

Customer Support

Get inspired

About us

Who We Are

Leadership

Customers

Partners

News

Connect with us

Careers

Upcoming Events

Contact Us

Customer spotlight