As always, it is a tremendous honor to be invited to a Dagstuhl seminar. Last week, I attended the seminar on “Big Graph Processing Systems”
During the first day, every participant presented where they were coming from and what is their research interest for 5 min. There was an interesting mix of large scale processing and graph database systems researchers with a handful of theoreticians. My goal was to push for the need of getting users involved in the data integration process and I believe I accomplished my goal.
The organizers pre-selected three areas to ground the discussions:
Abstraction: While imperative programming models, such as vertex-centric or edge-centric programming models, are popular, they are lacking a high-level exposition to the end user. To increase the power of graph processing systems and foster the usage of graph analytics in applications, we need to design high-level graph processing abstractions. It is currently completely open how future declarative graph processing abstractions could look like.
Ecosystems: In modern setups, graph-processing is not a self-sustained, independent activity, but rather part of a larger big-data processing ecosystem with many system alternatives and possible design decisions. We need a clear understanding of the impact and the trade-offs of the various decisions in order to effectively guide the developers of big graph processing applications.
Performance: Traditionally, performance and scalability are measures of efficiency, e.g. FLOPS, throughput, or speedup, are difficult to apply for graph processing, especially since performance is non-trivially dependent on platform, algorithm, and dataset. Moreover, running graph-processing workloads in the cloud leverages additional challenges. Such performance-related issues are key to identify, design, and build upon widely recognized benchmarks for graph processing.
I participated in the Abstractions group because it touches more on topics of my interest such as graph data models, schemas, etc. Thus this report only takes into account the discussions I had in this group.
Setting the Stage
During a late night wine conversation with Marcelo Arenas (wine and late night conversations is crucial aspect at Dagstuhl), we talked about the two kinds of truth:
“An ordinary truth is a statement whose opposite is a falsehood. A profound truth is a statement whose opposite is also a profound truth” – Niels Bohr
If we apply this to a vision, we can consider an ordinary vision and a profound vision.
An example of an ordinary vision: we need to make faster graph processing systems. This is ordinary because the opposite is false: we would not want to design slower graph processing system.
With this framework in mind, we should be thinking about profound visions.
Graph Abstractions
There seems to be an understanding, and even an agreement in the room, that graphs are a natural way of representing data. The question is WHY?
Let’s start with a few observations:
Observation 1: there has been numerous types of data models and corresponding query languages. Handwaving, we can group these into tabular, graph, and tree, with so many different flavors.
Observation 2: What goes around comes around. We have seen many data models come and go several times in the past 50 years. See the Survey of Graph Database Models by Renzo Angles and Claudio Gutierrez and even our manuscript on the History of Knowledge Graphs.
So, why do we keep inventing new data models?
Two threads came out of our discussions
1) Understand the relationship between data models
Over time, there has been manifold data models. Even though the relational model continues to be the strongest, graph data models have increasing popularity, specifically RDF Graphs and Property Graphs. And who knows, tomorrow we may have new data models that will gain force. With all of these data models, it is paramount to understand how these models relate amongst each other.
We have seen approaches that study how these data models relate to each other. During the 90s, there was a vast amount of work of connecting XML (tree data model) with the relational data model. The work that we did on mapping relational data to RDF graphs, which led to the foundation of the W3C RDB2RDF Direct Mapping standard. The work of Olaf Hartig on RDF* that maps RDF Graphs with Property Graphs.
These approaches have the same intent: understand the relationship between data model A and B. However, all of these independent approaches are disconnected?
The question is: what is a principled approach to understand the relationship between different data models?
Many questions come to mind:
- How do we create mappings between different data models?
- Or should we create a dragon data model that rules them all, such that all data models can be mapped to the dragon data model? If so, what are all the abstract features that a data model should support?
- What is the formalism to represent mappings? Logic? Algebra? Category Theory?
- What are the properties that mappings should have? Information, Query and Semantics preserving, composability, etc.
2) Understand the relationships between data models and query languages with users
It is our understanding (“our feeling”) that a graph data model should be the ultimate data model for data integration.
Why?
Because graphs bridge the conceptualization gap between how end users think about data and how data is physically stored. Over and over again we were say that “graphs are a natural way of representing data”.
But, what does natural even mean? Natural to whom? For what?
Our hypothesis is that the lack of understanding between data and users is the reason why we keep inventing new data models and query languages. We really need to understand the relationship between data models and query languages with users. We need to understand how users perceive the way data is modeled and represented. We need to work with scientists and experts from other communities to design methodologies, experiments and user studies. We also need to work with users from different fields (data journalist, political scientist, life science, etc.) to understand the users intents.
Bottomline, we need to realize that user studies are important and we need to work with the right people.
This trip report barely scratches the surface. There were so many other discussions that I wish I was part of. We are all working on a vision paper that will be published as a group. We are expecting to have a public draft by March 2020.
Overall, this was a fantastic week and the organizers did a phenomenal job.
Absolutely love this paragraph!
“Our hypothesis is that the lack of understanding between data and users is the reason why we keep inventing new data models and query languages. We really need to understand the relationship between data models and query languages with users. We need to understand how users perceive the way data is modeled and represented. We need to work with scientists and experts from other communities to design methodologies, experiments and user studies. We also need to work with users from different fields (data journalist, political scientist, life science, etc.) to understand the users intents. “
It would be best to also take into account the hardware constraints, that will shape the data model at some level of the data tower. The frontend is still up for grab, as there is many practitioners with various skills.
I still think that there is no workaround more skilled knowledge workers.