One of the most consistent challenges that we have seen when working with large global enterprises is a tendency to treat data catalogs and data marketplaces as the same objective, rather than separate ones. Conflating the two may also explain why data and analytical engineering teams, along with business users, feel disconnected from or apathetic toward data catalog initiatives. And when these stakeholders aren’t bought in, it’s only a matter of time until projects stall.
In short, it’s clear from our experience that modern data stacks require both a data catalog and a data marketplace – the key is to understand their differences and how best to leverage them.
A Primer on Data Catalogs vs. Data Marketplaces
Before diving into the benefits of a tech stack that includes a data catalog and data marketplace, we will first investigate the differences between them.
There are two simple rules when thinking about data catalogs versus data marketplaces:
- Data Catalogs are for builders
- Data Marketplaces are for consumers
Taking these rules further, a great analogy to consider is the “App Store” on your smartphone.
Before building an app – and committing resources to doing so – a company must believe it to be valuable. From there, development teams use software development kits (SDKs), tools, and libraries – provided by Apple and Google for their respective App Stores – to build applications faster and within the bounds of the platforms’ requirements. As more apps are published more quickly and with better quality, Apple, Google, and the original development company make more money.
Once an app is created, it is published to the App Store so consumers can use – and sometimes buy – them. You, the consumer, search the App Store to find apps that will be helpful or entertaining to you, you request to download them, they are provisioned to your phone, and then you use them.
Here is a diagram that explains the application lifecycle:
Green represents where the builders interact. Blue represents where the consumers interact.
Builders are responsible for application publishing, but at that point they are no longer building – they are delivering what they built, which is why it is gray in the figure above. Builders also monitor the app’s usage metrics and reviews in order to guide new requirements, completing the circle.
This App Store analogy also aligns to the philosophies of data mesh. In a data mesh, different business units own the creation and publishing of data products, just as different companies own the building and publishing of apps in the App Store. This article won’t dig deep into data mesh philosophies, but it does assume the data mesh decentralization philosophy is present.
Data Catalogs are for Builders
Your data engineering and/or analytical engineering teams – your builders – are responsible for developing valuable data products that help your organization grow and innovate. If those data products are not valuable, your data engineering team is not valuable. And worse, your organization will miss out on achieving data-driven initiatives, which is likely the lifeblood of your business.
Therefore, data products should be treated no differently than apps. They should have a vision and a product roadmap that spans from idea to R&D, release, maintenance, and retirement. And to ensure they are valuable, you must build them quickly, reliably, and correctly, and keep them well maintained to avoid downstream cascading breaking changes.
But, while both Apple and Google give developers tools, libraries, and SDKs for building apps, providing a framework for building data products is a much harder job. Meeting the same needs is far more difficult when data products are involved because they require:
- Exploring the lineage and observability of models, their tests, and metrics
- Exploring the reliability, quality, goals, sensitivity, classification, and handling procedures of data inventory
There is no out-of-the-box solution for this. However, there is a tool to help make it easier: the Data Catalog. Data catalogs provide the frameworks and interfaces to manage and collect the information above, so that data product creation can happen from the inventoried building blocks maintained in the catalog.
Data Marketplaces are for Consumers
Consumers thirst for valuable data products. And if there was a way to simply find them and understand how they provide value, consumers would use them. This is what a data marketplace is for.
To be clear, some marketplaces are used to sell data externally. But by and large, almost all large organizations we work with implement marketplaces to deliver data products internally, and in turn drive their corporate missions and data-driven initiatives. Marketplaces may be termed data mesh or data exchanges, but are the same in principle – each is meant to facilitate the discovery and delivery of data products within your organization.
Data product delivery, or provisioning, is key to a marketplace. Seeing what data products exist but being unable to actually query them through your data platform or BI tool is, quite frankly, useless. This would be like finding the perfect app on the App Store, but having no way to download it – you’d have to file a ticket with Apple and wait a week. A true data marketplace supports approval workflows and will provision data access controls that apply direct grants to your data platform(s) at request fulfillment time, providing near-real-time access upon approval.
Finally, it’s important to note that builders can also be consumers. Ideally, new data products can be created using upstream data products. This makes the new, downstream data products more powerful, reliable, and well understood. To do this, builders can use their data catalog to understand data products’ technical details, but request access to them through the marketplace, just like any other consumer.
Why a Catalog-Only Strategy Fails
In short, a catalog-only strategy fails because it inadequately tries to serve two use cases and two types of users, rather than focusing on sufficiently serving one use case and type of user. By making the gray stages from our earlier diagram green, you make the catalog useless to your consumers and builders.
This is not the fault of the catalog vendors – remember, they are simply giving you the framework and interfaces. It’s the builders’ responsibility to implement that framework to match your own data engineering strategies, metadata, and rules – which should not include marketplaces.
Why Does the Catalog-Only Strategy Fail the Builders?
If catalogs are meant for builders, why would adding consumer use cases impact them?
Let’s return to our App Store analogy. Imagine for a moment that the App Store didn’t just publish your final, complete app – it also exposed all the libraries, code, and data models behind your app. And in order to publish, you were required to explain each of those elements to your consumers – and the consumers could rate them.
This is the case when data catalogs are used as marketplaces – all of the intermediate data used to build the final data products are also exposed. Not to mention, data can come from different sources, so a single builder team may absorb the full responsibility of maintaining all the metadata in the catalog – no matter where the data came from. By default, that one team is perceived as accountable for the entire data catalog.
Certainly there are some catalogs that allow visibility permissioning around objects published to them. But it is still on the builders to manage those visibility controls, in addition to the access controls they must manage in the data platform.
Bottom line: using a data catalog as a marketplace places much more unwanted work and implicit responsibility on the builders. Essentially, every move they make is exposed to everyone – or they have to work hard to not make that the case. This drives data and analytical engineering teams – the builders – back to their individual data platforms to manage their work in secrecy, which in turn makes the catalog less useful. This phenomenon is also a driver behind data platforms such as Databricks and Snowflake adding catalog capabilities such as lineage, quality, and tagging.
Furthermore, what good is a data product if nobody can find it amongst the sea of inventory in the catalog? Going back to our analogy, discovery is a solved problem for apps because of the Apple- and Google-provided App Stores – but it is not a solved problem for data products without a marketplace. Therefore, builders become frustrated by being unable to deliver value to the business because they quite literally have no way to deliver it.
Why Does the Catalog-Only Strategy Fail the Consumers?
In order for the catalog to be useful to the builders, they need all of their data to exist there. This means data consumers are left searching and wondering what data is curated and “gold standard”-ready as a data product that they can consume, versus “bronze data” that they can’t trust.
The consumers must do an enormous amount of work self-documenting and rating tables in the catalog because:
- It’s too much work for the smaller builder teams to do since everything – even data they don’t own – is exposed, or
- The builders have already abandoned the catalog for the reasons mentioned above.
In general, since analysts only see the end state of the data and therefore don’t have the full picture, they can’t maintain a useful data catalog.
Even if only the gold data is exposed to consumers in the catalog, potentially as formal data products, data catalogs still have lots of bells and whistles. These are generally relevant to the builders, but not to the consumers – distracting focus from their core mission of finding useful data and getting access to it.
Data catalogs also do not actually provision access. They provide inventory and details about your data estate and processing, but they are not data access management tools. So, it is unnatural to expect your catalog to provision access to your consumers, which will cause them great frustration.
Relatedly, catalogs do not possess approval flows. If you are provisioning access based on data user requests, you need approval configurations and flows natively integrated into the marketplace. This drives teams to externalize not only the access provisioning, but the request and approval process as well. As a result, either the data consumers need to leave the catalog, or the catalog must have complex integrations with the request tool. In the end, this creates a convoluted architecture that is difficult to maintain, and difficult for the consumers (and approvers) to use.
Data Catalog + Data Marketplace: The Full Solution
Complementing your data catalog with a data marketplace offers you a more robust, performant solution. With this strategy, you have two purpose-built solutions for both use cases and both types of users, just like our App Store analogy. As you can see, the green parts of the lifecycle are catalog-specific, and the gray parts of the lifecycle are marketplace-specific.
In this structure, the builders are able to implement the catalog to meet their needs – but not more – without concern that all of their work is exposed.” And since the builders no longer need to document everything in the data platform (including data they don’t own), they can focus on documenting what matters: the data products they own and publish. They are incentivized to do so, since the data products are how they drive value in the organization – a better measure of success than how well they maintain a generalist data catalog.
On the other side of the equation, the consumers are able to enter a clean, purpose-built marketplace interface to find and learn about data products, request access, and have that access automatically provisioned when approved through native approval flows.
The governance users are able to thoroughly understand what data is being requested and if they should approve access natively in the marketplace, where the data products and user activity monitoring are also present.
Conclusion
Your data catalog is a powerful tool for your builders, but data consumers are becoming increasingly important and powerful within organizations – they need a tool that’s built with their needs in mind. Consider combining both a data catalog and a data marketplace that offers integrated workflows for automated data access provisioning with native approval flows. You’ll get the best of both worlds – control for builders, fast access for the consumers – and drive greater efficiency and innovation for the company.
Read more about how data marketplaces facilitate collaboration in this blog from analyst Sanjeev Mohan.
Get started today.
Schedule a meeting with our team to see how Immuta powers data marketplaces.