Data Mesh is a topic that has gained a lot of momentum in the past few weeks due to the launch of the Data Mesh Learning community. I first learned about Data Mesh when Prof Frank Neven pointed me to Zhamak Dehghani’s article “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.” Note that Gartner’s Mark Beyer introduced a similar thought under the same name in a Maverick research article back in 2016.
Data Mesh was music to my ears because it centers enterprise data management around people and process instead of technology. That is the main message of my talk “The Socio-Technical Phenomena of Data Integration and Knowledge Graphs” (short version at the Stanford Knowledge Graph course, long version at UCSD). I’ve also been part of implementations, and known colleagues who have implemented approaches that I would consider Data Mesh.
In this post, I want to share my point of view on data mesh. Note, that these are my views and opinions and do not necessarily reflect the position of my employer, data.world.
The way we have been managing enterprise data basically hasn’t changed in the past 30 years. In my opinion, the fundamental problems of enterprise data management are:
1. We have defined success from a technical point of view: physically integrate data into a single location (warehouse/lake) is the end goal. In reality, success should also be defined from a social perspective: those who need to consume the data to answer business questions.
2. We do not treat data with the respect it deserves! We would never push software code to a master branch without comments, without tests, without peer review. But we do that all the time with data. Who is responsible for the data in your organization?
These problems have motivated my research and industry career. I was thrilled to discover Data Mesh and Zhamak Dehghani’s article because it clearly articulate a lot of the work that I have done before and given me a lot of ideas to think about.
Data mesh is NOT a technology. It is paradigm shift towards a distributed architecture that attempts to find an ideal balance between centralization and decentralization of metadata and data management.
The success is highly dependent of the culture (people, processes) within an organization, and not just the tools and technology, hence the paradigm shift. Data mesh is a step in changing the mindset of enterprise data management.
Important principles of data mesh
In my opinion, the two key principles of a data mesh are:
1. Treat data as a first class citizen: data as a product
2. Just how in databases you always want to push down filters, etc, why not push down the data back to the domain experts and owners.
Organization/Social/Culture
Let’s talk about organization/social/culture first. Technology after. I have experienced the ideal balance between centralization and decentralization of metadata and data management as follows:
Centralize the core business model (concepts and attributes). For example, the core model of an e-commerce company is simple: Order, Order Line, Product, Customers, Address. An Order has order date, currency, gross sales, net sales, etc. These core concepts and attributes should be defined by a centralized data council. They provide the definitions, the schema definitions (firstname
vs first_name
vs fname
). It is CORE, which means that they do not boil the ocean. Per a past customer experience, when we started, the core model started out with 15 concepts and 40 attributes. 3 years later, it’s at 120 concepts and 500 attributes. Every concept and attribute has a reason for existence.
Decentralize the mapping of the application data to the core business model. This needs to be done by the owners of the data and applications because they are the ones who understand best what that data means.
“But everyone is going to have a different way of defining Customer!” Yes, and guess what… that’s fine! We need to be comfortable, and even encourage this friction. People will start to complain and this is exactly how we know what the differences are. This friction will eventually get boiled up to the centralized data council who can help sort things out. A typical result is that new concepts get added to the core business model. This friction helps prioritize the importance and usage of the data. Document this friction.
“My concept doesn’t exist in the core business model!” That’s fine! Create a new one. Document it. If people start using it and/or if there is friction, the centralized data council will find out and it will eventually become part of the core business mode.
If you are an application/data owner, and you don’t use the core business model, you are not being a good data citizen. People will complain. Therefore, you will be incentivized to make use of the core business model, and extend it when necessary.
People are at the center of the data mesh paradigm
We must have Data Product Managers. Just like we have product managers in software, we need to have product managers for data. Data Consumer to Data Product Manager: “is your data fit for my purpose?”. Data Product Manager to Data Consumer: “is your purpose fit for my data?”
We must have Knowledge Scientist: data scientist complain that they spend 80% of their time cleaning data. That is true, and in reality it is crucial knowledge work that needs to be done, understanding what actually is meant by “net sales of an order” and how is that physically calculated in the database. For example, the business core model states that the concept Order
has an attribute netsales
which is defined by the business as gross minus taxes minus discounts, However there is no netsales attribute in the application database. The mapping could be defined as a SQL query as SELECT ordered, sales - tax - discount as netsales FROM order
For largish organizations, business units will have their own data product managers and knowledge scientist.
Technology is part data mesh too
A data catalog is key to understand what data exists and document the friction. (Disclaimer: My employer is data.world, a data catalog vendor). Data catalogs are needed to catalog the as-is application-centric databases: the database systems that consists of thousands of tables and columns to understand what exists. This will be used by the data engineers and knowledge scientist to do their job to create the data products. Data catalogs will also be used to enable the discover and (re-)use of data products: each domain will create data products to satisfy business requirements. These data products need to discovered by other folks in the organization. Therefore, the data catalog will be used to catalog these data products so others can discover them.
Technologies such as data virtualization and data federation can be used to create the clean data views which can then be consumed by others. Hence, they can be used as implementations for a data mesh.
Knowledge Graphs is a modern manifestation of integrating knowledge and data at scale in the form of a graph. Therefore it is perfectly suited to support the data mesh paradigm. Furthermore, the RDF graph stack is ideal because data and schema are both first class citizens in the model. The core business models can be implemented as ontologies using OWL, and they can be mapped to databases using the R2RML mapping language, data can be queried in SPARQL in a centralized or federated manner. Even though the data is in the RDF graph model, it can be serialized in JSON. There is no schema language for property graphs (Disclaimer: My academic roots are in the semantic web and knowledge graph community, I’ve been part of W3C standards and I’m currently the chair of the Property Graph Schema Working Group)
Core Identities should probably be maintained by the centralized data council, offering an entity resolution service. Another advantage of using the RDF graph stack is that universal identifiers is part of the data model itself by using URIs.
The linkage between core business models and the applications models are source-to-target mappings which can be seen as transformations that can be represented in a declarative language like SQL and tools such as dbt. Another advantage of using RDF knowledge graphs is that you have standards to implement this: OWL to represent the business core models, R2RML to represent mappings from application database models to the business core models.
There are existing standard vocabularies such as W3C Data Catalog Vocabulary that can (and should) be used to represent metadata.
Another approach that is very aligned with Data Mesh is the Data Centric Architecture.
Final Words
Data Mesh is about being resilient. Efficiency comes later. Actually, it won’t be efficient early on. It’s disruptive, due to many changes. This will probably be inefficient in the beginning. It will enable just a few teams and use-cases. But it will be the starting point of the data snowball.
The push needs to come from the top, executive level. This is aspirational. There is no ROI is not short term.
We need to encourage the bottom up. Data mesh is a way for each business unit to be autonomous and not be bottlenecked by IT.
The power of the data mesh is that everyone governs for their own use case, the important use cases get leveled up so it can be consumed by high level business use cases without over engineering.
Finally, check out our Catalog and Cocktails podcast episode on Data Mesh (and takeaways)
Want to learn more about Data Mesh? Barr Moses wrote a Data Mesh 101 with pointers to many other articles. (Barr Moses will be a guest on Catalog and Cocktails in a few weeks!)