Knowledge Graphs and their Role in the Knowledge Engineering of the 21st Century Dagstuhl Trip Report

It was an incredible honor to be invited to the Dagstuhl Seminar on Knowledge Graphs and their Role in the Knowledge Engineering of the 21st Century organized by Paul Groth, Elena Simperl, Marieke van Erp, Denny Vrandecic. The second time I visit Dagstuhl this year! 

I was thrilled that we started the seminar talking about the history of knowledge engineering, a topic very dear to my heart. 

What is Knowledge Engineering? Bradley Allen, one of the very first knowledge engineers of the expert systems era, defined it in a crisp and succinct manner: the practice of building processes that produce knowledge, and reminded us of Ed Feigenbaum’s definition: The applied side of AI. Brad also touched on the topic of why expert systems failed. This can be answered from two different perspectives: 1) wasn’t able to commercialize because the 80s was a mainframe era and the expert systems were built on their own machines (prolog, lisp). 2) Didn’t fail because it just became common practice. The main issue is that there was a lack of developer buy-in. 

Deborah McGuinness also gave one of the historical talks and reminded everyone that access to Knowledge Engineering has always had a very high bar: complicated software, needed training and it was expensive. The creators of the tools never met with the developers and the field took too much focus on formalism and tractability without practically.

The main takeaways I had revolved on two topics: 1) Users and Methodologies, and 2) Language Models

Users and Methodologies

Elena Simperl kicked off the seminar challenging us to think about what an upgraded Knowledge Engineering reference book would look like. For example, what would the NeOn book or Common KADS book look like today? 

Elena also presented the notion of user-centric knowledge engineering with a set of questions to consider

– Who are the users?
– What are the users’ tasks and goals?
– How does the user interact with the knowledge graph?
– What are the users’ experience levels with it, or similar environment?
– What functions do the users need?
– What information might the users need, and in what form do they need it?
– How do users think knowledge engineering tools should work?
– Is the user multitasking?
– Are they working on a mobile phone, desktop computer etc?
– Does the interface utilize different input modes, such as touch, speech, gestures or orientation?
– How can we support multi-disciplinary teams? How can we support remote work, decision making, conflict resolution?

What thrills me the most is that knowledge engineering truly bridges the social and technical aspects of computer science. This gets us outside our comfort zone because we need to consider doing case studies, literature reviews, user studies to:

– understand personas, scenarios, use cases, tasks, emerging practices
– define process blueprints, design patterns, requirements for tool support

In a break out group with Sören Auer, Eva Blomqvist, Deborah McGuinness, Valentina Presutti, Marta Sabou, Stefan Schlobach, and Steffen Staab, started out by sharing existing knowledge engineering methodologies, discussing our experiences and devising what methodologies should look like today. 

My position is that knowledge graphs and the investment in Knowledge Engineering is to address the known cases of today and unknown use cases of tomorrow. Past methodologies focus just on the knowledge/ontology side and not the data side, namely how to map data to knowledge. Furthermore, I believe there are two distinct scenarios two consider: the enterprise scenario where the goal is to answer fairly specific questions and the common-sense scenario where the goal is exploratory search to answer questions on generic topics (this is the Amazon, Google world). Additionally, understanding the producer and consumer personas is crucial, because at the end of the day, the consumers want data products. Figuring out how to devise methodologies in a distributed and decentralized world is going to be paramount. Schema.org is concrete evidence that it can happen, we just need to figure out how to embrace the complexity. Finally, the data world needs more data modeling training!

The themes that came up through our discussions were data creation, data integration, ontology engineering, systems development, requirements engineering, business value case case, where each of these segments need to be synced and evaluated. 

We unanimously agreed that there is a need for a synthesis of knowledge engineer methodologies mapping to the tools, settings and requirements of today with the goal of defining upgraded knowledge engineering methodologies for today that can serve as general purpose education. Furthermore, we should start out by editing a Knowledge Engineering book similar to the Description Logic Handbook, which compiles the state of the art such that it can be used by knowledge engineering courses today. The ultimate goal is to define a textbook that could be used for a bachelors course. Ambitious, but we need to think big!

I appreciate a final reminder from Steffen: Methodology is like a cooking recipe, you don’t need to follow it exactly as-is. 

Language Models

Language Models, such as BERT and GPT-3, have shown impressive results in terms of language generation and question answering. Lise Stork gave an overview of the state of the art on Automated Knowledge Graph Construction which surveyed automated methods to construct knowledge graphs from unstructured sources. This immediately raised the questions, what about 1) automatic knowledge graph construction from structured data  (e.g. tabular, relational), namely given a structured data as input, the output is a knowledge graph and 2) automatically mapping structured data to a knowledge graph, namely, given structured data and an existing knowledge graph as input, the output is an augmented knowledge graph. 

This second stream of work considers the traditional data integration challenges of schema matching and entity linking. Language models have common sense knowledge. However, to do the mapping, it would also need to have specific business/domain knowledge, which may not exist in language models today.

During our break out session, Sören started to experiment with OpenAI (https://beta.openai.com/playground) and we provided anecdotal evidence that language models can perform some form of data mapping on typical textbook examples, but they quickly fail when data structures are more enterprise-y (and thus less likely to be included in the language model training data).

This is an area ripe for research. The SemTab challenge deals with the second stream of work but the participants up to now do not use language models (CORRECTION, most do not but DAGOBAH, leading system at SemTab2021 does a mix language models with heuristics to get the proper interpretation. See comment below by Raphael Troncy). I’m very eager to follow the upcoming Table Representation Learning workshop at NeurIPS. 

We should be careful to not just jump on language models bandwagon and start pounding on that hammer to see what works/sticks. Having said that, the notion of Machine behavior is an interesting one because we should study the behavior of these AI systems and understand why something works/sticks. This may be the opportunity to delve into other areas of research design such as single-subject design which I have recently been getting interested in. 

Additional Observations

Manual authored knowledge from subject matter experts is precious. Therefore we need to define automatic generation of knowledge graphs at scale. However, human curation of automatically generated knowledge is needed for trust. Thus People provide precious knowledge and trust, while Machines provide scale. 

Knowledge Engineering, “it is mostly, or even all about the processes and ecosystems” – Deborah McGuiness

Knowledge acquisition and maintenance is expensive! Need to herd cats across all departments. Initial developer buy-in can be hard to achieve, leading to less than enthusiastic support. 

The FAANG knowledge problems have a focus on general and common sense knowledge and are interesting and challenging. On the other hand, enterprise problems are much more specific. The issue is that academia is mostly exposed to general/common sense issues through efforts such as DBpedia, Wikidata and do not have access to concrete enterprise problems. How can we get academia exposed to the enterprise problem? Releasing sample schemas, data, queries, etc? 

The topic of Bias was presented by Harald Sack. The following types of biases were discussed: 1) Data bias: the data collection for the knowledge graph or simply from the available data, 2) Schema Bias: the chosen ontology, or simply embedded in ontologies and 3) Inferential Bias: the result of drawing inferences. Furthermore, biases in knowledge graph embeddings may also arise from the embedding method. I heard from the bias break out group that their takeaway is that they don’t know what they don’t know. Definitely a need for more socio-technical bridging and work with people who know what they are talking about. 

Additional Questions

Bradley posed the question: How can we convince others that knowledge engineering is mainstream software engineering? What is the narrative to convince/talk to other communities on why they should care? It’s all about methodologies and it should be tied to current processes. For example, we already define Product Requirement Documents (PRD) in software which contains knowledge and requirements about the software. We should take this as an inspiration. 

Knowledge Engineering can be very expensive. How to reduce this cost? 

What kind of Knowledge Engineering methodologies and processes (in addition to tooling and training) are needed? 

The seminal Knowledge Engineering paper by Studer et. al. extended the Gruber definition of ontology to  “a formal, explicit specification of a shared conceptualisation.” I ask myself, what does “shared” mean today? 

How do we let humans efficiently check a large amount of data before a product launch? This is where metadata plays a key role. How good is the data, and what does “good” mean? Do we know where the data comes from? Do we know how to audit our data to make it less biased? Do we know how the data came about? Do we know how the data is used? We can make rules that discover inconsistencies and incompleteness, and suggest anomalies. But how would we classify feedback from end users? How is the feedback channeled? These are questions that are being addressed by the data catalog market, so academia can and should learn from the state of the art, be critical and see what’s missing and devise opportunities.  For example, is there, or shall I say, where is the bias in a metadata knowledge graph? If metadata is being reported from a subset of systems then that could be reporting bias. If recommendations are made, it may be biased because it lacks cataloging the metadata from other systems. What level of metadata granularity should be captured and what type of bias would that have? 

How can we be overly inclusive with knowledge in order to get more folks “on our side”? Ontologies can be defined and stored in multiple forms. Even spreadsheets, that’s inclusive!

What can knowledge graphs capture? What can’t it capture? How do they represent what is changing in the world vs what is static? Seems like it’s going back to the traditional discussions of finding the balance of expressive languages in Knowledge Representation and Reasoning (i.e. description logic!), but the dynamicity (fast paced world) is the phenomena of today.

Final Thought

The title of this seminar was “Knowledge Graphs and their Role in the Knowledge Engineering of the 21st Century” and surprisingly, there was little emphasis on graphs. This is a good thing because the focus was KNOWLEDGE!. Paul Groth suggested that we go back to the terms Knowledge Base and Knowledge Based-Systems

In our Communications of ACM Knowledge Graph article, Claudio Gutierrez and I wrote: 

If we were to summarize in one paragraph the essence of the developments of the half century we have presented, it would be the following: Data was traditionally considered a commodity, moreover, a material commodity—something given, with no semantics per se, tied to formats, bits, matter. Knowledge traditionally was conceived as the paradigmatic “immaterial” object, living only in people’s minds and language. We have tried to show that since the second half of the 20th century, the destinies of data and knowledge became bound together by computing.

(Claudio and I stayed at Dagstuhl a few years ago to start writing that paper)

Today, our quest is to combine knowledge and data at scale. I would argue that the Semantic Web community has focused on integrating Data in the form of a graph, namely RDF, and Knowledge in the form of ontologies, namely OWL. However, the future of Data and Knowledge should consider all types of data and knowledge: tables, graphs, documents, text, ontologies, rules, embeddings, language models, etc. We are definitely heading into uncharted territory. And that is why it’s called science!

Thanks again to all the organizers for bringing us together and special thanks to Dagstuhl for being the magical place where amazing things happen! Can’t wait to be invited to come back. 

Bringing Graph Databases and Network Visualization Together Dagstuhl Seminar Trip Report

I had the honor to organize my first Dagstuhl Seminar on Bringing Graph Databases and Network Visualization Together. I’ve been lucky to have the opportunity to attend multiple times, but this was very important because I was an organizer. 

Let me start out by saying that I’ve always been skeptical about the value of graph and network visualization. That skepticism is what motivated me to organize a seminar to bring these two communities together. I was very lucky to have met Hsiang-Yun Wu and Da Yan at a previous Dagstuhl Seminar on Graph Databases and then be introduced to Karsten Klein in order to organize this seminar. 

My main takeaways:

– Not surprising, there is a sizable gap between the Graph Database and Network Visualization communities.
– There are low hanging fruits to bridge the gap, both from an academic and practical point of view
– Biggest open problem: cool vs usefulness 

Network Visualization layouts focus on the graph structure

Graph layouts focus on the structure of the graph. The layouts are evaluated based on aesthetics such as minimizing cross edges, length of edges, etc. The AHA moment I had is when I learned that layout algorithms consider the input graph to be undirected and unlabeled. Thus, graph layouts focus on the structure (i.e. syntax) of the graph and not the meaning (i.e. semantics). This may explain why many knowledge graphs look unusable when they are visualized with default layouts such as forced directed layouts because knowledge graphs have … knowledge, represented as directions on the edges and labels on the nodes! This realization was a huge AHA moment for everyone. 

Visualizing “large” graphs

Fabrizio Montecchiani gave an overview of visualizing large graphs. Network visualization research focuses on scalable layout algorithms for “large” graphs. I put large in quotes on purpose because large is hard to define. 1 million nodes? 10 million nodes? For me those are small graphs. For others, they are large. Current research on scalable graph layout focuses on implementations based on GPUs, Parallel and distributed algorithms, and big data frameworks. Some examples: 

– Forced directed layout on an Intel Xeon CPU X5650 @ 2.67 GHz and nVidia GF100 [Quadro 5000] 10M vertices and 20M edges took about 36 seconds per iteration (but the number of iterations might be very large!) See Yunis et al. Scalable Force Directed Graph Layout Algorithms Using Fast Multipole Methods. ISPDC 2012
– Multilevel algorithm based on maxent-stress optimization. Parallel implementation based on OpenMP (shared memory). Workstation: Octa-Core Intel Xeon E5-4640 processors (32 cores, 64 threads) @ 2.4 GHz. 1M vertices and 3M edges took about 27 seconds. See Meyerhenke et al. Drawing Large Graphs by Multilevel Maxent-Stress Optimization. IEEE Trans. Vis. Comput. Graph. (2018)
– Implementation based on Apache Giraph https://multigila.graphdrawing.cloud/

These are surveys to consider:
– Hu and Shi. Visualizing large graphs. WIREs Comput Stat, 7: 115-136 (2015)
– von Lazndesberger et al. Visual Analysis of Large Graphs: State-of-the-Art and Future Research Challenges. Comput. Graph. Forum 30(6): 1719-1749 (2011)

Another AHA moment we had is realizing that the network visualization community expects graphs to mainly be representing (large) data instead of (small) metadata or schemas. Different sizes, different use cases, different users, different problems. 

Who are the users?

Tatiana von Landesberger gave a presentation on Qualitative Evaluation. My personal realization is that we, the graph database community, are missing out on the possibility of learning so much more because we do not do qualitative evaluations. We focus on building systems and on quantitative evaluations while ignoring the end users. The opportunity is that we are exposed to many types of users. The big AHA moment is that the network visualization considers that there is only one user: the subject matter expert who is the final user making a decision about an analysis (i.e Doctor, Journalist, Scientist). From a database perspective, there are many types of users: data engineer, data analyst, database developer, ontologist, taxonomist, knowledge scientist, data stewards, analytics engineer, and of course the final “business user”. Explaining the entire flow of data of how it gets integrated eventually into a (graph) database and how it involves a variety of personas was an AHA moment for network visualization folks.   

So many layouts!

It was great that the CTO of Yworks, Sebastian Müller, attended the seminar. Yworks is a spin-off of the University of Tuebingen founded in 2001. They attend most of the graph drawing and network visualization conferences, learn about the latest graph layouts and implement them. This is why they have a library of 100s of layouts. I have to say, this is very overwhelming. How do I know which layout to use? I sat down with Sebastian and told him about a specific problem, task, user and he immediately suggested a couple of layouts to consider. Thank you Sebastian… but this isn’t scalable!

Cool vs Useful

A common phrase is “wow, this is cool.. but how useful is this?” Seems like the community is just starting to scratch the surface on this topic. Stephen Kobourov shared the thought that we should be looking at this as phases of usefulness. A graph visualization that has a cool WOW factor should drive engagement, which should then lead to serendipitous discovery and ultimately lead to accomplishing a specific task. For example, a graph visualization can provide context which drives trust. Catia Pesquita described how a subject matter expert confirming the results of an ontology matching system could benefit by visualizing parts of the ontology in order to provide context. Visualizing that two concepts match is not useful by itself. However, visualizing the surrounding concepts can provide important context to confirm if that match is trustworthy or not. 

Visualization and Querying

Walter Didimo, Beppe Liotta and Fabrizio Montecchiani presented work on Visual Graph Query and Analysis for Tax Evasion Discovery in conjunction with the Italian Revenue Agency. I really like how this research group is working on how to apply network visualization research. Other aspects that were discussed were on visualizing the results, specifically how/why provenance of a result and also paths that may be returned in a result.

Opportunities

The big open problem is defining what is useful. This is definitely a socio-technical phenomena and there is a lot of work to be done. 

A way to start tackling usefulness is by refocusing on the different types of users and the variety of tasks they may have. For example, we started to outline all the roles and the possible interactions between each of those roles. Each interaction has a specific task which may or may not be supported by a visualization. This is how we were trying to break down the problem to smaller pieces. 

By understanding the different roles, interactions and tasks, these could then be associated with existing graph layouts and metaphors. It would be great to have this taxonomy of graph layouts, connected to use cases, roles, tasks, etc. 

Final Words

I am extremely lucky to have the opportunity  to spend time with some of the brightest minds in Graph Databases and Network Visualizations. There were 7 people from Graph Databases and 15 people from Network Visualizations, all of us in person except for Da Yan. On behalf of the organizers, thank you to each and everyone of you for coming in person for this seminar during these hard times: Michael Behrisch, Walter Didimo, Nadezhda Doncheva, Henry Ehlers, George Fletcher, Carsten Görg, Katja Hose, Pavel Klinov, Stephen Kobourov, Oliver Kohlbacher, Beppe Liotta, Fabrizio Montecchiani, Sebastian Müller, Catia Pesquita, Falk Schreiber, Hannes Voigt, https://visva.cs.uni-koeln.de/, Markus Wallinger 

This is a strong reminder that science is a social process and in-person meetings are extremely valuable. 

It truly is amazing how a diverse group of people can get together and very quickly start working together. At one point, all of us were collaboratively working on a Google doc defining the outline of a vision paper we plan to publish together. We also had a lot of bonding time and even during one evening, we each gave short talks about random, non-work topics while enjoying beer and wine: Airplane accidents, Music is NOT a universal language, The other pandemic: Bananas, Dead Language in Japan, History of money and debt, Oldest bank in the world, Portugal Nuns and Eggs, Everesting and Underwater rugby.

Dagstuhl also was phenomenal on how they managed Covid. We all had to present negative covid tests before arriving and we got tested on Monday, Wednesday and Friday. Luckily nobody tested positive.

We have several next steps

– Start a slack community in order to create a community between graph databases and network visualization (To Be Announced soon!)
– Write a vision paper, similar to the CACM The Future is Big Graphs paper
– Organize workshops and panels in database and network visualization conferences

Our goal is to foster new research opportunities and we plan to meet again in Dagstuhl in a couple of years, with a larger group, to review the progress that will hopefully be made.  

My Takeaways from the Data Architecture Panel at Knowledge Graph Conference

I had the honor to moderate the Data Architecture panel at the 2021 Knowledge Graph Conference. The panelist were:

Zhamak Dehghani, Director of Emerging Technologies at ThoughtWorks and the founder of Data Mesh concept
Teresa Tung, Chief Technologist of Accenture’s Cloud First group
Jay Yu, Distinguished Architect and Director, Enterprise Architecture and Technology Futures Group at Intuit

This panel was special edition of the Catalog and Cocktails podcast that I host, an honest, no-bs, non-salesy conversation about enterprise data management. We will be releasing the panel as a podcast episode soon, so stay tuned!

Live depiction of the panel

In the meantime, these are my takeaways from the panel:

What are the incentives?

– Need to understand the incentives for every business unit.
– Consider the common good of the whole, instead of individualism
– Example of incentive: put OKRs and bonus on the shareability and growth of the users of your data products

Knowledge Graph and Data Mesh

– Knowledge Graph is an evolution of master data management.
– Data Mesh is an evolution of data lake.
– Knowledge Graph and Data Mesh complement each other. They need to go together.
– However, we still need to figure out how to put them together.

Centralization vs Decentralization

This was the controversial part of the discussion.
– Jay’s position is that the ultimate goal is to unified data and decentralized ownership of domain is a step in that direction. Zhamak and Teresa do not fully agree.
– Intuit’s approach: There are things that should be fix (can’t change, i.e. address), flexible (ability to extend) and customize (if you need to hit the ground running)
– Is the goal to unify data or have unifiable data?
– Centralization and Decentralization: sides of the same coin
– Centralize within a same line of business that is trying to solve the same problem. But can’t expect to follow all the new demands of data in the world.

People

– Need to have an answer to “what’s in for me?” question. See incentives takeaway.
– Consider Maslow’s hierarchy of needs
– Be bold, challenge the status quo
– Follow the playbook on change management

Honest, no-bs: What is a Data Product?

– Native data products which are close to the raw data. Intelligent data products which are derived from the native data products
– A data product is complete, clean, documented, with knowledge about the data, explanation on how people can use it, understand the freshness, lineage, useful
– If you find something wrong with the data product, you need to have ways of providing feedback.
– Data Products needs to have usability characteristics.
– Data has a Heartbeat: it needs to be alive. The code keeps the data alive. Code and data need to be together. Otherwise it’s like the body separated from the soul. (Beautifully said Zhamak!)

What is the deal with the Data Mesh?

Data Mesh is a topic that has gained a lot of momentum in the past few weeks due to the launch of the Data Mesh Learning community. I first learned about Data Mesh when Prof Frank Neven pointed me to Zhamak Dehghani’s article “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh.” Note that Gartner’s Mark Beyer introduced a similar thought under the same name in a Maverick research article back in 2016.

Data Mesh was music to my ears because it centers enterprise data management around people and process instead of technology. That is the main message of my talk “The Socio-Technical Phenomena of Data Integration and Knowledge Graphs” (short version at the Stanford Knowledge Graph course, long version at UCSD). I’ve also been part of implementations, and known colleagues who have implemented approaches that I would consider Data Mesh.

In this post, I want to share my point of view on data mesh. Note, that these are my views and opinions and do not necessarily reflect the position of my employer, data.world.

The way we have been managing enterprise data basically hasn’t changed in the past 30 years. In my opinion, the fundamental problems of enterprise data management are:

1. We have defined success from a technical point of view: physically integrate data into a single location (warehouse/lake) is the end goal. In reality, success should also be defined from a social perspective: those who need to consume the data to answer business questions.

2. We do not treat data with the respect it deserves! We would never push software code to a master branch without comments, without tests, without peer review. But we do that all the time with data. Who is responsible for the data in your organization?

These problems have motivated my research and industry career. I was thrilled to discover Data Mesh and Zhamak Dehghani’s article because it clearly articulate a lot of the work that I have done before and given me a lot of ideas to think about.

Data mesh is NOT a technology. It is paradigm shift towards a distributed architecture that attempts to find an ideal balance between centralization and decentralization of metadata and data management.

The success is highly dependent of the culture (people, processes) within an organization, and not just the tools and technology, hence the paradigm shift. Data mesh is a step in changing the mindset of enterprise data management.

Important principles of data mesh

In my opinion, the two key principles of a data mesh are:

1. Treat data as a first class citizen: data as a product
2. Just how in databases you always want to push down filters, etc, why not push down the data back to the domain experts and owners.

Organization/Social/Culture

Let’s talk about organization/social/culture first. Technology after. I have experienced the ideal balance between centralization and decentralization of metadata and data management as follows:

Centralize the core business model (concepts and attributes). For example, the core model of an e-commerce company is simple: Order, Order Line, Product, Customers, Address. An Order has order date, currency, gross sales, net sales, etc. These core concepts and attributes should be defined by a centralized data council. They provide the definitions, the schema definitions (firstname vs first_name vs fname). It is CORE, which means that they do not boil the ocean. Per a past customer experience, when we started, the core model started out with 15 concepts and 40 attributes. 3 years later, it’s at 120 concepts and 500 attributes. Every concept and attribute has a reason for existence.

Decentralize the mapping of the application data to the core business model. This needs to be done by the owners of the data and applications because they are the ones who understand best what that data means.

But everyone is going to have a different way of defining Customer!” Yes, and guess what… that’s fine! We need to be comfortable, and even encourage this friction. People will start to complain and this is exactly how we know what the differences are. This friction will eventually get boiled up to the centralized data council who can help sort things out. A typical result is that new concepts get added to the core business model. This friction helps prioritize the importance and usage of the data. Document this friction.

My concept doesn’t exist in the core business model!” That’s fine! Create a new one. Document it. If people start using it and/or if there is friction, the centralized data council will find out and it will eventually become part of the core business mode.

If you are an application/data owner, and you don’t use the core business model, you are not being a good data citizen. People will complain. Therefore, you will be incentivized to make use of the core business model, and extend it when necessary.

People are at the center of the data mesh paradigm

We must have Data Product Managers. Just like we have product managers in software, we need to have product managers for data. Data Consumer to Data Product Manager: “is your data fit for my purpose?”. Data Product Manager to Data Consumer: “is your purpose fit for my data?”

We must have Knowledge Scientist: data scientist complain that they spend 80% of their time cleaning data. That is true, and in reality it is crucial knowledge work that needs to be done, understanding what actually is meant by “net sales of an order” and how is that physically calculated in the database. For example, the business core model states that the concept Order has an attribute netsales which is defined by the business as gross minus taxes minus discounts, However there is no netsales attribute in the application database. The mapping could be defined as a SQL query as SELECT ordered, sales - tax - discount as netsales FROM order

For largish organizations, business units will have their own data product managers and knowledge scientist.

Technology is part data mesh too

A data catalog is key to understand what data exists and document the friction. (Disclaimer: My employer is data.world, a data catalog vendor). Data catalogs are needed to catalog the as-is application-centric databases: the database systems that consists of thousands of tables and columns to understand what exists. This will be used by the data engineers and knowledge scientist to do their job to create the data products. Data catalogs will also be used to enable the discover and (re-)use of data products: each domain will create data products to satisfy business requirements. These data products need to discovered by other folks in the organization. Therefore, the data catalog will be used to catalog these data products so others can discover them.

Technologies such as data virtualization and data federation can be used to create the clean data views which can then be consumed by others. Hence, they can be used as implementations for a data mesh.

Knowledge Graphs is a modern manifestation of integrating knowledge and data at scale in the form of a graph. Therefore it is perfectly suited to support the data mesh paradigm. Furthermore, the RDF graph stack is ideal because data and schema are both first class citizens in the model. The core business models can be implemented as ontologies using OWL, and they can be mapped to databases using the R2RML mapping language, data can be queried in SPARQL in a centralized or federated manner. Even though the data is in the RDF graph model, it can be serialized in JSON. There is no schema language for property graphs (Disclaimer: My academic roots are in the semantic web and knowledge graph community, I’ve been part of W3C standards and I’m currently the chair of the Property Graph Schema Working Group)

Core Identities should probably be maintained by the centralized data council, offering an entity resolution service. Another advantage of using the RDF graph stack is that universal identifiers is part of the data model itself by using URIs.

The linkage between core business models and the applications models are source-to-target mappings which can be seen as transformations that can be represented in a declarative language like SQL and tools such as dbt. Another advantage of using RDF knowledge graphs is that you have standards to implement this: OWL to represent the business core models, R2RML to represent mappings from application database models to the business core models.

There are existing standard vocabularies such as W3C Data Catalog Vocabulary that can (and should) be used to represent metadata.

Another approach that is very aligned with Data Mesh is the Data Centric Architecture.

Final Words

Data Mesh is about being resilient. Efficiency comes later. Actually, it won’t be efficient early on. It’s disruptive, due to many changes. This will probably be inefficient in the beginning. It will enable just a few teams and use-cases. But it will be the starting point of the data snowball.

The push needs to come from the top, executive level. This is aspirational. There is no ROI is not short term.

We need to encourage the bottom up. Data mesh is a way for each business unit to be autonomous and not be bottlenecked by IT.

The power of the data mesh is that everyone governs for their own use case, the important use cases get leveled up so it can be consumed by high level business use cases without over engineering.

Finally, check out our Catalog and Cocktails podcast episode on Data Mesh (and takeaways)

A Catalog and Cocktails podcast episode on Data Mesh

Want to learn more about Data Mesh? Barr Moses wrote a Data Mesh 101 with pointers to many other articles. (Barr Moses will be a guest on Catalog and Cocktails in a few weeks!)

My Most Memorable Events of 2020: New Podcast, New Book and 20+ Talks

Every start of the year I like to reflect on the most memorable events of the previous year (this was 2019). It’s the start of 2021, and I ask myself, what was my most memorable event of 2020. Couldn’t come up with just one, so here are a few:

Honest, no-BS, non-salesy data podcast

Who would have thought that I would start a podcast! With my partner in crime, Tim Gasper, we host Catalog and Cocktails, an honest, no-bs, non-salesy podcast about enterprise data management. We record it live every Wednesday 4pm CT. We use the first 30 minutes of the show to record the podcast episode, and then open up the Zoom call right after for everyone to join in the discussion.

We began this podcast in May 2020, and it’s turned into something greater than we could have ever imagined. Throughout the past 30 episodes we have discussed a wide range of topics: data governance, data quality, data lineage, knowledge graphs, data culture, build vs buy, ROI and much more. 

We have had guests to chat on various topics:

– Claire Cahill from The Zebra on the role of the data product manager 
– Dean Allemang, Fabien Gandon, and James Handler, authors of the book Semantic Web for the Working Ontologist 
– Dwayne Desaulniers from AP on evolving data culture practices
– Jeremy Baksht from Ascential on data marketplace 
– Jeff Feng from Airbnb on how they built their internal data catalog

In 2021, our podcast is going to evolve and will have many guests to join the conversation. Listen to it on your favorite podcast app (Apple Podcast, Spotify), like and subscribe!

Designing and Building Enterprise Knowledge Graphs” Book

Ora Lassila and I submitted a complete first draft of the book “Designing and Building Enterprise Knowledge Graphs” to the publisher on Dec 31! 

I’ve been writing this book for a while now (longer than I want to admit). A silver lining of the pandemic is that I was able to focus more time on the book. Additionally, it was an honor that Ora joined me as a co-author. If you are interested in a sneak peak, let me know! 

20+ talks 

I value the opportunity to share my thoughts and ideas about data management with a wider audience. In 2020 I gave over 20 invited talks!

Back in October 2019, I gave a keynote at the Ontology Matching Workshop: The Socio-Technical Phenomena of Data Integration and Knowledge Graphs:

Data Integration has been an active area of computer science research for over two decades. A modern manifestation is as Knowledge Graphs which integrates not just data but also knowledge at scale. Tasks such as Domain modeling and Schema/Ontology Matching are fundamental in the data integration process. Research focus has been on studying the data integration phenomena from a technical point of view (algorithms and systems) with the ultimate goal of automating this task. 

In the process of applying scientific results to real world enterprise data integration scenarios to design and build Knowledge Graphs, we have experienced numerous obstacles. In this talk, I will share insights about these obstacles. I will argue that we need to think outside of a technical box and further study the phenomena of data integration with a human-centric lens: from a socio-technical point of view. 

The talk was very well accepted and I received numerous invitations to give it again: 

DSG Seminar at University of Waterloo (Invited by Semih Salihoglu) Video
Ghent University Data Science Seminar (Invited by Ruben Verborgh
University Hasselt (Invited by Frank Neven
Invited Lecture CS520 Knowledge Graph at Stanford (Invited by Vinay Chaudhri) Video 
Knowledge Graph Conference 
Tech Innovations Forum at Columbia University 
– Guest Lecture at Lehigh University – (Invited by Jeff Heflin)
– Guest Lecture at University of Texas at Austin – (Invited by Ying Ding)
– Guest Lecture at Universitat Politècnica de Catalunya (Invited by Oscar Romero)
Keynote at 8th Linked Data in Architecture and Construction Workshop (LDAC2020) 
– Guest Lecture at University of British Columbia (Invited by Laks Lakshmanan)
Data Lab Seminar at Northeastern University (Invited by Wolfgang Gatterbauer_
Distinguished Speaker Series in Data Science and AI at University of Illinois Chicago (Invited by Isabel Cruz)
Database Lab Research Seminar at UC San Diego (Invited by Arun Kumar) – Video

I started giving talks on the History of Knowledge Graph. I gave a keynote talk at the OSLC Fest (video) and a longer version as a tutorial with Prof Claudio Gutierrez at the Conference on Information and Knowledge Management (CIKM 2020)

At data.world I get to work on how to combine open and enterprise data catalogs. I was invited to give a talk on this topic titled, Open to an Enterprise Data Catalog and Back in the European Data Portal webinar series (video).  

I closed the year giving a talk at the Knowledge Connexions Conference with Bryon Jacob title (DataCatalog)<-[poweredBy]->(KnowledgeGraph)

I also gave numerous invited talks to large companies and startups.

Final Thoughts

As expected I did not travel a lot in 2020 (my last trip was March 11). During the first months of 2020, I flew 37,000 miles and visited Canada, Belgium, Netherlands and India (in 2019 I flew 143,000 miles and visited 13 countries). Can’t wait to get back to travel in the second half of 2021 hopefully!

International Semantic Web Conference (ISWC) 2020 Trip Report

The International Semantic Web Conference (ISWC) is my “home” conference. I’ve been attending since 2008 (only missed 2010) and it’s the place where I reconnect with friends and colleagues and get to meet new people. It is always a highlight of my year. In ISWC 2017 was in Vienna, ISWC 2018 was in Monterey, California, ISWC 2019 was in Auckland, and this year we were supposed to be in Athens! Oh COVID!
My takeaways:

  • Realization that we need to understand users!
  • Are we educating enough the new generation of computer scientist? No, they need to learn about knowledge engineering!
  • Creative RDF Graph Data Management
  • Data, data data
  • Of course… embeddings, neuro-symbolic and explainable AI were hot topics
  • This is an eclectic community!

Users, users, users

My current research interest is on understanding the social-technical phenomena of data integration. Therefore my eyes and ears were focused on topics about users. I know I’m biased here, but for me, one of the strongest topics at ISWC this year was about users.

It all started with AnHai Doan‘s keynote at the Ontology Matching workshop. The main takeaway: evaluation is not about how much you can automate, instead on how much user productivity increases.

This was music to my ears! In previous conversations I’ve had with AnHai, I was happily surprised to know that we were both tackling problems in similar ways: let’s break down the problem into several steps, let’s figure out how to solve it manually, that becomes the baseline and from there we can improve. AnHai has been focusing on entity linking, while I have been focusing on schema/ontology matching. Many lessons learned from AnHai’s experience (his startup was acquired by Informatica earlier this year):

Users came up on the topic of ontology/schema matching:

I had a “hallway” chat with Catia Pesquita and she mentioned the need of a “usefulness metric” which I think is spot on. In the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching. I mentioned the following in our slack discussion:

Food for thought: how about some sort of User Metric that measures the productivity of a user. For example, if a System A has lower precision than System B but it takes a lot of effort to setup/maintain/… System B, then maybe I would prefer System A. This is just an example. I’m brining this up because I’m seeing a trend throughout the conference about the realization to understand more how Users are involved.

The community is in agreement that we need to expand the SemTab challenge to take in account users. I’m excited to be part of the organizing committee of this challenge for next year.

Cassia Trojahn presented a paper “Generating Expressive Correspondences: An Approach Based on User Knowledge Needs and A-Box Relation Discovery” which tackles two things that I’m interested in: complex matches (the real world is not about simple 1-1 schema matches) and users’ needs. Unfortunately, there was no user evaluation. This is the next step that we need to take as a community. Evaluate the cost of having users and how we can decrease that cost.

Larry Hunter‘s keynote  was on using semantic technologies in life science. I was fascinated by the honest discussion on how to integrate the data: paying experts to manually verify. Larry clearly acknowledged that we need genuine experts to validate the mappings. In his case, the experts are doctors and they are expensive. Therefore, he needed a budget in order to hire people to do the expert mappings. And this requires time. We lack methodologies to do this in a systematic way. While listening to his presentation, all I could think about was the need to focus on user productivity.

Users came up from a developer standpoint:  One thing that I cringe is when I read or hear people making claims about users or developers without any evidence supporting these claims. The work of LD-Flex and OBA is tackling how to make RDF and Linked Data more accessible to developers. During the discussion of these papers, Ruben Taelman and Daniel Garijo made statements such as “developers like X, they prefer Y, etc”. These are anecdotes. Funny thing is that the anecdotes of the LD-Flex and OBA authors contradicted each other. My takeaway: it’s still not clear what developers would prefer.

Miriam Fernandez gave a splendid vision talk. She questioned the following: “the web we have created… the view is of the creators… is this really a shared conceptualization? How much of the knowledge we have created that is on the web contains alternative facts?” and asked everyone:

In Peter Fox‘s vision talk, he asked “Are we educating enough of the new generation? ” IMO, we are not (see next section).  He also reminded us about humans in the loop, which is a topic that is gaining a lot of traction in so many other fields of Computer Science.

My overall takeaway here is that  by looking at traditional problems and their incremental solutions from a social-technical perspective, we can make the science much more interesting. We need to define evaluation metrics for users. As I mentioned in a slack discussion:

As a community, we need to push ourselves into the uncomfortable position of doing user evaluations. Is this hard? OF COURSE!! but heck, that’s what makes it interesting. Life can’t be that easy

Knowledge Science (a.k.a Knowledge Engineering 2.0)

Another interest I have is understand how to bridge the gap between the data producers and data consumers. I’ve argued for the need of Knowledge Scientist and Data Product Managers  (listen to our podcast episode on this topic) to fill that gap.

Oh was I happy to see this topic as Elena Simperl‘s vision. She asked: “What do we know about the technical, scientific and social aspects involved in the building, maintaining and using knowledge based systems? ” This community has a lot to say because many  come from the Knowledge Acquisition community. A lot of open and hard questions are still open: How do we capture common senses knowledge (there was a tutorial on that), culture and diversity?  There are many HARD questions that we need to ask ourselves about modeling knowledge that are outside of our comfort box (how do we model negative facts or ignorance?). From a tooling perspective, where is the equivalent of jupyter notebooks for knowledge modeling (Gra.fo is a step in that direction that combines knowledge modeling with collaboration) ? Elena stated that the next wave of AI will not succeed unless we study these hard questions, and I fully agree. If we do not understand the knowledge about our data, it is going to continue to be garbage in, garbage out. Similarly, Peter Fox asked: “Are we educating enough of the new generation?” IMO, we are not! My takeaway: we need to teach knowledge engineering to the next generation of students and we need to research (again) knowledge engineering taking in account the data discipline of the 20th century (knowledge engineering was popular in the 1990s),  and interdisciplinary methods. Additionally, we need to work with other communities. 

Together with Cogan Shimizu, Rafael Gonçalves, and Ruben Verborgh, we organized the PRAXIS workshop: The Semantic Web in Practice: Tools and Pedagogy. Our goal was to have a WORKshop and gather a community focused on the collection, development, and curation of pedagogical best practices, and the tools that support them, for the Semantic Web and Knowledge Graph communities.

We had a successful event discussing the need of creating a syllabus for a bachelors course on Knowledge Engineering/Science. We acknowledge that courses in a masters program is already too late. We are going to start cataloging existing semantic web, knowledge graph, knowledge engineering courses. Stay tuned because we will need your help!

Overall, I believe we are realizing that we need to reinvent the role of Knowledge Engineering for the 2020s: Knowledge Science (a.k.a Knowledge Engineering 2.0)

Creative RDF Graph Data Management

Around a decade ago, there was a lot of RDF data management work at ISWC that resembled work that could have been published at a database conference. Largely, that type of work has gone away. This year I was happily surprised to see this topic come back with novel and creative approaches.

Tentris is a tensor-based triple store where an RDF graph is mapped to a tensor and SPARQL queries are mapped to Einstein summations and leverages worst case optimal multi join algorithms. Juan Reutter gave the keynote AGM bound and worst-case optimal joins + Semantic Web a 4th Workshop on Storing, Querying, and Benchmarking the Web of Data. AGM bound is “one of the most important results in DB theory this century” which has led to the rise of “worst-case optimal” join algorithms. This is a very popular topic in the database community and the semantic web community should look into. Trident is a new graph storage engine for very large Knowledge Graphs with the goal of improving scalability while maintaining generality , support for reasoning, and runs on cheap hardware.

I was very excited to see the work on HDTCat, a tool to merge the contents of two HDT files with low memory footprint. RDF HDT is a compressed format for RDF (basically the parquet files for RDF). This is a problem we encountered at data.world a while back. Every dataset ingested in data.world is a named graph represented in RDF-HDT so when running a SPARQL query over multiple named graphs, we encountered this issue when the data was large: it just used too much memory. It was very nice to see that the solution presented in HDTCat is similar to what we did at data.world to solve this problem.

I enjoy seeing how the community is looking at how to extend SPARQL in many different ways:

Turing complete to support graph analytics (i.e. page rank)

With similarity joins

combining Graph data with Raster and Vector data (The GeoSPARQL+ paper was a best student paper winner) and studying SPARQL query logs and user sessions to understanding user behavior.

Data, data, data

The vision talks speakers had some fantastic insights about data.

Barend Monds reemphasized the need for FAIR data and how we should keep metadata and data separate. For many  applications, you need to consume only metadata initially.  Barend made a bold and strong statement: Invest 5% of research funds in ensuring data are reusable. Jeni Tennison, gave a heartfelt message: don’t use data for negative aspects of life, data and access to data is political, access to data should be the norm and we need a world where data works for everyone. Fabien Gandon reminded us that the web connects ALL things.  Stefan Decker made an important call to take persistent identifiers seriously:

This seems like a small issue but it is CRUCIAL. If we were to think about persistent identifiers correctly from the beginning, I postulate the many of the data integration problems we suffer would go away.

Oh, and I  believe Peter Fox coined the term semantilicious. Is your data semantilicious?

Of Course…

Of course the expected hot topics were present (take a look at the list of accepted papers).

Of course …there was a lot of work presented about embeddings!

Of course … the combination of neuro-symobolic approaches was a hot topic. This was in Uli Sattler‘s vision. Take a look at the Common Sense Knowledge Graph tutorial.

Of course… Explainable AI was a topic. In particular I appreciated Helena Deus‘ vision of incorporating the bias in the model such that the model could be avoided if it’s not applicable (if the model is trained on lung images, don’t use it on brain).

More Notes

We had a lot of great social events: Ask Me Anything with Craig Knoblock, Jim Hendler, Natasha Noy, Elena Simperl, Mayank Kejriwal. We also had meetups: Women in Semantic Web and Semantic Web Research in Asia. The Remo platform worked very well for “hallway” conversations.

The vision talks and sister conference presentations are awesome. Please keep that!

In an AMA conversation with Jim, he shared Tim Berners-Lee pitch for the semantic web: my document can point to your document, but my spreadsheet can’t point to yours. In other words, my data can’t point to your data

Need to take a look at “G2GML: Graph to Graph Mapping Language for Bridging RDF and Property Graphshttp://g2gml.fun/

Need to take a look at “FunMap: Efficient Execution of Functional Mappings for Scaled-Up Knowledge Graph Creationhttps://github.com/SDM-TIB/FunMap

Need to take a look at “Tab2Know: Building a Knowledge Base from Scientific Tables

From what I heard, the Wikidata workshop was a huge hit.

My friends at UPM gave a Knowledge Graph Construction tutorial. I believe this topic has a lot of interesting scientific challenges when users come into play. A lot of opportunities here!

Chris Mungall gave an interesting keynote at Ontology Design Patterns workshops on how to use design patterns to align ontologies in the life science. What I appreciated about his talk is the practicality of his work. He is taking the theory into practice.

How do you represent negative facts in a knowledge graph? See “Enriching Knowledge Bases with Interesting Negative Statements“. Larry Hunter brought up something related in his keynote: how do you represent ignorance?

How can we make RDF/Linked Data/Knowledge Graphs friendl for developers. See SPARQL Endpoints and Web API Tutorial, OBA: An Ontology-Based Framework for Creating REST APIs for Knowledge Graphs:

And LDFlex:

I realized that I didn’t spend time at industry/in-use talks (Last year I spent most of my time at in-use/industry talks). Need to review those papers.

Something new to add to my history of knowledge graph work:

Kavitha Srinivas gave a cool keynote on knowledge graph for code

RDF* always seems to come up

Oh, and in case you need some ideas:

My Daily takeaways

Congrats to all the winners

My main takeaway: this is an eclectic community!

The semantic web community is truly an eclectic community. In this conference you can see work and talk to people about Artificial Intelligence, Knowledge Representation and Reasoning, Ontology Engineering, Machine Learning, Explainable AI, Database, Data Integration, Graphs, Data Mining, Data Visualization, Human Computer/Data Interaction, Streaming Data, Open Data, Programming Languages, Question Answering, NLP and of course, the Web! Therefore if you feel that you don’t fully fit in your research community because you are dabble in other areas, the semantic web community may be the place for you!

This is also a diverse community

I’m very proud to be part of the community and to consider it home! I miss hanging out with all my friends, having inspiring conversations, dancing, eating, and making new friends.

Massive THANK YOU to the entire organizing committee to make this an amazing virtual event, specially the general chair Lalana Kagal!

Hopefully “see” you next year in Albany, NY!

SIGMOD 2020: A Virtual Trip Report

This past week was the SIGMOD conference. I’ve always known that there is a lot of overlapping work between the database and semantic web community so I have been attending regularly since 2015. The topics I’m most interested in are on data integration and graph databases, which is the focus of this trip report.

I was really looking forward to going to Portland but that was not possible due to Covid. Every time I attend SIGMOD I get to know more and more people so I was a bit worried that I wouldn’t get the same experience this year. This virtual conference was FANTASTIC. The organizers pulled off a phenomenal event. Everything ran smoothly. Slack enable a deep and thoughtful discussion because people spent time organizing their thoughts. There were social and networking events. Gather was an interesting experience to simulate the real world hallway conversations. The Zoomside chats were a AMA (ask me anything) with researchers on different topics. The Research Zoomtable were relaxed discussions between researchers about research topics. The panels on Women in DB, The Next 5 Years and Startups generated a lot of discussion afterwards on slack. Oh and how can we forget the most publicized and popular event of all: Mohan’s Retirement Party.

Science is a social process. That is why I value so much conferences because it is the opportunity to reconnect with many colleagues, meet new folks, discuss ideas, projects, etc. The downside of a virtual conference is that your normal work week continues (unless you decide to disconnect 100% from work during the week). I truly hope that next year we will be back to some sense of normality and we can meet again f2f.

My takeaways from SIGMOD this year are the following:

1) There is still a gap between real world and academia when it comes to data integration. I personally believe that there is science (not just engineering) that needs to be done to bridge this gap.
2) Academia is starting to study data lakes and data catalogs. There is a huge opportunity (see my previous point).
3) There is interest from academia to come up with novel interfaces to access data. However will just be an academic exercise with very little real world impact if we don’t understand who is the user. To do that, we need to connect more with the real world.
4) Graphs continue to gain more and more traction in industry.

I’m very excited that this community is looking into the needs and features of data catalogs because this is topic dear to my heart because I am the Principal Scientist at data.world, which is the only enterprise data catalog that is cloud-native SaaS with Virtualization and Federation, powered by a Knowledge Graph.

RESEARCH

There was a very interesting slack discussion about research and the “customer” that was sparked after the panel “The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?”.

AnHai Doan commented that the community understands the artificial problems in the research papers instead of understand the real problems that customers face. Therefore there is a need to identify the common uses cases (not corner cases) that address 80% of customers needs and own those problems and own them.

To that, Raul Castro Fernandez pointed out that systems work is disincentivized because reviewers always come back with “just engineering.” Personally, if there is a clear hypothesis and research question, with experiments that provide evidence to support the hypothesis, then the engineering is also science. Otherwise, it is engineering.

Joe Hellerstein chimed in, with spot on comments, that are not worth summarizing, so here they are verbatim:

I would never discourage work that is detached from current industrial use; I think it’s not constructive to suggest that you need customers to start down a line of thinking. Sounds like a broadside against pure curiosity-driven research, and I LOVE the idea of pure curiosity-driven research. In fact, for really promising young thinkers, this seems like THE BEST reason to go into research rather than industry or startups

What I tend to find often leads to less-than-inspiring work is variant n+1 on a hot topic for large n. What Stonebraker calls “polishing a round ball”.

Bottom line, my primary advice to folks is to do research that inspires you.

“...if you are searching for relevance, you don’t need to have a friend who is an executive at a corporation. Find 30-40 professionals on LinkedIn who might use software like you’re considering, and interview them to find out how they spend their time. Don’t ask them “do you think my idea is cool” (because they’ll almost always say yes to be nice). Ask them what they do all day, what bugs them.  I learned this from Jeff Heer and Sean Kandel, who did this prior to our Wrangler research, that eventually led to Trifacta. It’s a very repeatable model that simply requires different legwork than we usually do in our community.

– Joe Hellerstein in a slack discussion

DATA INTEGRATION

Most of the data integration work continues to be on the topic of data cleaning and data matching/entity matching/entity resolution/…. Makes sense to me because this is an area where there are opportunities to continue automating because there is a lot of data. The following papers are on my to-read list:

Given how data lineage is an important feature of data catalogs, I was keen to attend the Provenance session. At data.world, we represent data lineage as provenance using PROV-O. Unfortunately I missed it and was able to catch the tail of the Zoomtable discussion. My biased perception is that the academic discussions on provenance are disconnected from reality when it comes to data integration. I shared the following with a group of folks: “From an industry perspective, data lineage is something that companies ask for from data integration/catalog/governance companies. The state of the art in the industry is to extract lineage from SQL queries, stored procedures, ETL tools and represent this visually. This can now be done. Not much science here IMO. There is a push to get lineage from scripts/code in Java, Python. What is the academic state of the art of reverse engineering Java/Python/… code used for ETLing?“.

Zach Ives responded that there has been  progress in incorporating UDFs and certain kinds of ETL operations, with human expertise incorporated, but he wasn’t aware of doing this automatically from Java/Python code.

Join this webinar if you are interested in learning about how we are using data lineage to tackle complex business problems.

I was pointed to the following that I need to dig in

As AnHai noted on a slack discussion, there is still a need to bridge the gap between academia and the real world. Quoting him:

For example, you said “we understand the problems in entity matching and provenance”. But the truth is: we understand the artificial problems that we define for our research papers. Not the real problems that customers face and we should solve. For instance, a very simple problem in entity matching is: develop an end-to-end solution that uses supervised ML to match two tables of entities. Amazingly, for this simple problem, our field offers very little. We did not solve all pain points for this solution. We do not have a theory on when it works and when it doesn’t. Nor do we have any system that real users can use. And yet this is the very first problem that most customers will face: I want to apply ML to solve entity matching. Can you help me?

– AnHai Doan in a slack discussion


DATA ACCESS

I’m observing more work that intends to lower the barrier for accessing data via non-SQL interfaces such as natural language, visual and even speech! The session on “Usability and Natural Language User Interfaces” was my favorite one because the topics were “out of the box” and “curiosity-driven”. I am very intrigued by the QueryVis work to provide diagrams to understand complicated SQL queries. I think there is an opportunity here, but the devil is in the details. The SpeakSQL paper sparked a lot of discussion. Do we expect people in practice to dictate a SQL query? In the Duoquest paper, the researchers combine the Natural Language Interface approach with Programming-by-example, where a user provides a sample query result. I’ve seen this PBE approach over and over in the literature, specifically for schema matching. At a glance it seems an interesting approach but I do not see the real world applicability… or at least I’ve never been exposed to a use case where the end user has and/or is willing to provide a sample answer. However, I am be wrong about this. This reminds my of AnHai’s comments on corner cases but at the same time, this is curiosity-driven research.

Papers on my to-read list:

Another topic very dear to me is data catalogs. Over the past couple of years I’ve been seeing research on the topics of dataset search/recommendation, join similarity, etc., that are important features for data catalogs. I’m looking forward to digging into these two papers:

On this topic of data lakes, I’m really really bummed that I missed several keynotes:

I can’t wait to watch the videos.

GRAPHS

If you look at the SIGMOD research program, there are ~20 papers on the general topic of graphs from ~140 research papers, plus all the papers from the GRADES-NDA workshop. The graph work that I was most attracted came from industry: Alibaba, Microsoft, IBM, TigerGraph, SAP, Tencent, Neo4j.

I found it intriguing that Alibaba and Tencent are both creating large scale knowledge graphs to represent and model common sense of their users. Cyc has been on it for decades. Many researchers believe that this is the wrong approach. But then 10 years ago schema.org came out as a high level ontology that the web content producers are adhering too. Now we are seeing these large companies creating knowledge bases (i.e. knowledge graphs) that integrates not just knowledge and data at scale, but also common sense. Talk about “what goes around comes around.”

Every year that I attend SIGMOD, it is a reminder that the database and semantic web community must talk to each other more and more. Case in point: IBM presented DB2 Graph where they retrofit graph queries (Property Graph model and Tinkerpop) on top of relationally-stored data. I need to dig into this work, but I have the suspicion that it overlaps with work from the semantic web community. For example, Ultrawrap, Ontop, Morph, among others, are systems that execute SPARQL graph queries on relational databases (note: Ultrawrap was part of my PhD, foundation for my company Capsenta which was acquired by data.world last year). There are even W3C standards to map Relational Data to RDF Graph (i.e. Direct Mapping, R2RML). Obviously the focus of the semantic web community has study these problem from the perspective of RDF Graphs and SPARQL. Nevertheless, it’s all just a graph so the work is overlapping.  In the spirit of cross-communication, I was thrilled to see Katja Hose‘s keynote at GRADES-NDA where she presented work from the semantic web community such as SPARQL Federation, Linked Data Fragments, etc.

Another topic that was brought up by Semih Salihoglu was the practical uses of graph analytics algorithms. This discussion was sparked by the paper “Graph Based Benchmark Suite.” It was very neat to know that Neo4j has actually started to categorize the graph algorithms that are used in practices. In their graph data science library, algorithms exist within three tiers: Production-quality, beta and alpha. These tiers serve as proxies for what is being used in the real world.

Papers on my to-read list:

WHO IS THE USER?

A topic that came up in the “Next 5 Years” panel was the need for results to be “used in the real world” and for tools to be “easy to us”. This is inevitable in research because the opposite would be a falsehood (do research so it’s used in an artificial world and hard to be used). I believe that a missing link between “used in the real world” and “easy to use” is to understand the USER. I also believe it is paramount that the database research community understands who are the users in the real world. It’s not just data scientist. We have data engineers, data stewards, data analyst, BI developers, Knowledge scientist, Product Managers, Business User, etc. I believe that we need to look at the data integration and the general data management problem not just from a technical point of view (which is what the database community has been doing for 20+ years), but from a social aspect: understanding the users, processes and how they are connected using end-to-end technology solutions. This takes us out of our comfort zone, but this is what is going to push the needle in order to maximize the input.

For the past year, I’ve been advocating to research the phenomena of data integration from a socio-technical angle (see my guest lecture at the Stanford Knowledge Graph course), provide methodologies to create ontologies and mappings and the new role of the Knowledge Scientist.

Joe Hellerstein provided another great comment during our slack discussion:

Building data systems for practitioners who know the data but not programming (current case in point—public health researcher) is a huge challenge that we largely have a blindspot for in SIGMOD. To fix that blindspot we should address it directly. And educate our students about the data analysis needs of users outside of the programmer community.

– Joe Hellerstein in a slack discussion

While watching the presentations of the “Usability and Natural Language User Interfaces” session, I kept asking me: who is the user?What are the characteristics that define that user? Sometimes this is well defined, sometimes it is not. Sometimes it is connected with the real world, sometimes it is not.

The HILDA workshop and community is addressing this area and I’m very excited to get involved more. All the HILDA papers are on my to-read list. I’m leaving with a very long list of papers to read and new connections.

Thanks again to the organizers for an amazing event. Can’t wait to see what happens next year.

Additional notes:

Oh, and a final reminder to students:

My Most Memorable Event of 2019

I travelled a lot in 2019, actually a bit more than in 2018 but stayed more in Austin. I continue to be a United 1K, for my fourth consecutive year.  I flew 143, 812 miles which is equivalent to 5.8 around the earth. I was on 101 flights and spent 350 hours (~14.6 days) on a plane. I visited 13 countries (including 3 new countries): Chile, Colombia, Cuba (new), Ecuador, France, Germany, Greece, Italy, Mexico, Netherlands, New Zealand (new), Paraguay (new) and Switzerland. I was on the road for a total of 142 days (~40% of the year) from which I spent 34 days in Europe, 28 days in Colombia, 18 days in Chile among others.

Given all this travel and everything I did in 2019 and everything that occurred throughout the year, I asked myself: what was my most memorable event of 2019?

In this post, I’ll focus on the the business aspect 🙂

As I mentioned in my blogpost where I announced the acquisition of Capsenta by data.world, I stated two main reasons why I was excited: perfect technical and mission/vision match. After 6 months, the reasons still hold!

My personal career aspiration is to have a cycle which starts from a basic research problem that is motivated by industry needs. However at the beginning of the cycle, the industry may not understand the importance and value of the basic research problem. Time goes by, the research matures and solutions are developed. Industry continues to evolve and eventually their problems and needs catch up to the solution that has been developed in the research. Time to commercialize! During commercialization is when you truly bridge research and industry; theory and practice (giving a talk on this topic was my most memorable event of 2018).

I started this cycle with the following basic research question: “How, and to what extent, can Relational Databases be integrated with the Semantic Web?” (see page 4 of my PhD dissertation). When I got exposed to the Semantic Web in mid 2000s, I always thought that companies would want to put “a semantic web” on top of their relational databases but they would not be able to move their data from their relational databases. Lo and behold, around 2010 we started to get knocks on our door asking about how to map relational databases to RDF graphs and OWL ontologies and that is how Capsenta started.

During commercialization, one of the many lessons I learned is that the business cares about the solutions to their problems and not necessarily the technology behind the solution. This is a mistake I see all the time: marketing and selling the technology instead of the solution. At Capsenta, we started to focus on the problem that business users are not able to answer their own questions in BI tools due to a conceptualization gap between business user’s mental model of the data and the actual physical representation of the data. We solved that problem using semantic web technologies.

With the acquisition of Capsenta by data.world, I feel that I have closed this first cycle. It took 10 years!

This cycle has been a tremendous learning experience for me. I will be writing a post about all the lessons learned through this cycle.

2020 will definitely bring a lot of exciting developments. Stay tuned!

Trip Report on Big Graph Processing Systems Dagstuhl Seminar

As always, it is a tremendous honor to be invited to a Dagstuhl seminar. Last week, I attended the seminar on “Big Graph Processing Systems

During the first day, every participant presented where they were coming from and what is their research interest for 5 min. There was an interesting mix of large scale processing and graph database systems researchers with a handful of theoreticians. My goal was to push for the need of getting users involved in the data integration process and I believe I accomplished my goal.

The organizers pre-selected three areas to ground the discussions: 

Abstraction: While imperative programming models, such as vertex-centric or edge-centric programming models, are popular, they are lacking a high-level exposition to the end user. To increase the power of graph processing systems and foster the usage of graph analytics in applications, we need to design high-level graph processing abstractions. It is currently completely open how future declarative graph processing abstractions could look like.

Ecosystems: In modern setups, graph-processing is not a self-sustained, independent activity, but rather part of a larger big-data processing ecosystem with many system alternatives and possible design decisions. We need a clear understanding of the impact and the trade-offs of the various decisions in order to effectively guide the developers of big graph processing applications.

Performance: Traditionally, performance and scalability are measures of efficiency, e.g. FLOPS, throughput, or speedup, are difficult to apply for graph processing, especially since performance is non-trivially dependent on platform, algorithm, and dataset. Moreover, running graph-processing workloads in the cloud leverages additional challenges. Such performance-related issues are key to identify, design, and build upon widely recognized benchmarks for graph processing.

I participated in the Abstractions group because it touches more on topics of my interest such as graph data models, schemas, etc. Thus this report only takes into account the discussions I had in this group. 

Setting the Stage

During a late night wine conversation with Marcelo Arenas (wine and late night conversations is crucial aspect at Dagstuhl), we talked about the two kinds of truth: 

“An ordinary truth is a statement whose opposite is a falsehood. A profound truth is a statement whose opposite is also a profound truth” – Niels Bohr

If we apply this to a vision, we can consider an ordinary vision and a profound vision. 

An example of an ordinary vision: we need to make faster graph processing systems. This is ordinary because the opposite is false: we would not want to design slower graph processing system. 

With this framework in mind, we should be thinking about profound visions. 

Graph Abstractions

There seems to be an understanding, and even an agreement in the room, that graphs are a natural way of representing data. The question is WHY?

Let’s start with a few observations: 

Observation 1: there has been numerous types of data models and corresponding query languages. Handwaving, we can group these into tabular, graph, and tree, with so many different flavors. 

Observation 2: What goes around comes around. We have seen many data models come and go several times in the past 50 years. See the Survey of Graph Database Models by Renzo Angles and Claudio Gutierrez and even our manuscript on the History of Knowledge Graphs

So, why do we keep inventing new data models? 

Two threads came out of our discussions

1) Understand the relationship between data models

Over time, there has been manifold data models. Even though the relational model continues to be the strongest, graph data models have increasing popularity, specifically RDF Graphs and Property Graphs. And who knows, tomorrow we may have new data models that will gain force. With all of these data models, it is paramount to understand how these models relate amongst each other. 

We have seen approaches that study how these data models relate to each other. During the 90s, there was a vast amount of work of connecting XML (tree data model) with the relational data model. The work that we did on mapping relational data to RDF graphs, which led to the foundation of the W3C RDB2RDF Direct Mapping standard. The work of Olaf Hartig on RDF* that maps RDF Graphs with Property Graphs.

These approaches have the same intent: understand the relationship between data model A and B. However, all of these independent approaches are disconnected?

The question is: what is a principled approach to understand the relationship between different data models? 

Many questions come to mind: 

  • How do we create mappings between different data models? 
  • Or should we create a dragon data model that rules them all, such that all data models can be mapped to the dragon data model? If so, what are all the abstract features that a data model should support? 
  • What is the formalism to represent mappings? Logic? Algebra? Category Theory? 
  • What are the properties that mappings should have? Information, Query and Semantics preserving, composability, etc.

2) Understand the relationships between data models and query languages with users

It is our understanding (“our feeling”) that a graph data model should be the ultimate data model for data integration. 

Why? 

Because graphs bridge the conceptualization gap between how end users think about data and how data is physically stored. Over and over again we were say that “graphs are a natural way of representing data”. 

But, what does natural even mean? Natural to whom? For what? 

Our hypothesis is that the lack of understanding between data and users is the reason why we keep inventing new data models and query languages. We really need to understand the relationship between data models and query languages with users. We need to understand how users perceive the way data is modeled and represented. We need to work with scientists and experts from other communities to design methodologies, experiments and user studies. We also need to work with users from different fields (data journalist, political scientist, life science, etc.) to understand the users intents. 

Bottomline, we need to realize that user studies are important and we need to work with the right people.

This trip report barely scratches the surface. There were so many other discussions that I wish I was part of. We are all working on a vision paper that will be published as a group. We are expecting to have a public draft by March 2020.

Overall, this was a fantastic week and the organizers did a phenomenal job.


International Semantic Web Conference (ISWC) 2019 Trip Report

My takeaway from this year’s ISWC 

  • Less is sufficient
  • Theory and practice is happening more and more and it’s getting rewarded
  • We need to think bigger
  • Semantic and Knowledge Graph technologies are doing well in industry

So let’s start! This was a very exciting and busy ISWC for me. 

For the past two years, Claudio Gutierrez and I have been researching the history of knowledge graphs (see http://knowledgegraph.today/). We culminated this work with a paper and a tutorial at ISWC. It was very well received:

I also gave the keynote “The socio-technical phenomena of data integration” at the Ontology Matching Workshop

Part of my message was to push the need of the Knowledge Scientist role.

I gave a talk on our in-use paper “A Pay-as-you-go Methodology to Design and Build Enterprise Knowledge Graphs from Relational Databases” 

In conjunction with Dave Griffith, I also gave an industry presentation on how we are building a hybrid data cloud at data.world. Finally, I was also on an industry panel.

Oh, and data.world had a table

And our socks were a hit

Let’s not forget about Knowledge

Jerome Euzenat’s keynote was philosophical. His key message was that we have gotten too focused with data and we are forgetting knowledge and how knowledge evolves. 

I agree with him. In the beginning of the semantic web, I would argue that the focus was on ontologies (i.e. knowledge). From the mid 2000s, the focus shifted to data (Linked Data, LOD) and that is where we have been. We should not forget about knowledge. And it’s because of this:

I would actually rephase Jerome’s message and say that we should not just forget about knowledge, but we should not forget about combining data and knowledge at scale. 

A don’t forget:

Outrageous Idea

There was a track called Outrageous Idea and the outrageous issue was that most of the submissions were rejected because they weren’t considered outrageous by the reviewers. This lead to an interesting panel discussion. 

The semantic web community has a track record of being outrageous:

  • The idea of linking data on the web was crazy and many thought it would not happen. 
  • Even though SPARQL is not a predominant query language, one of the largest repositories of knowledge, Wikidata, is all in RDF and SPARQL. 
  • Querying the web as if it were a database was envisioned in the early 90s, and given the linked data on the web, it actually became possible (see Olaf Hartig PhD dissertation and all the work that was spawned. No wonder his 2009 paper received the 10 year prize this year. Congrats my friend!). 
  • Heck, the semantic web itself is an outrageous idea (that hasn’t yet been fulfilled). 

However, there is a sentiment that this community is stuck and focused on incremental advances. Something needs to change. For example, we should have a venue/track to publish work that may lack a bit of scientific rigor because it is visionary, may not have a well defined research questions or clearly stated hypothesis (because we are dreaming!), or the evaluation is preliminary/lacking because it’s still not understood how to evaluate or what to compare to. Rumor has it that there will be some sort of a vision track next year. Let’s see!

Pragmatism in Science

It was great to see scientific contributions combining theory and implementation, thus being more pragmatic. A catalyst, in my opinion, was the Reproducibility Initiative. Several papers had a “Reproduced” tag to note that the results were implemented, the code was available and that a third party reproduced the results. One of the best research paper nominees, “Absorption-Based Query Answering for Expressive Description Logics” won the Best Reproducibility award. These researchers are well respected theoreticians, and it’s very interesting to see how they are interested in bridging their theory with practice.  

The best research paper, “Validating SHACL constraints over a SPARQL endpoint” which is highly theoretical, also has experimental results and made their code available:  SHACL2SPARQL

I’m seeing this trend also in the database community. For example, the Graph Query Language (GQL) for Property Graphs standardization process, will be accompanied by a definition of formal semantics, which is being led by theoreticians including Leonid Libkin.

I’m also starting to see this interest the other direction: researchers who focus more on building systems are being more rigorous with their theory and experiments. For example, the best student research paper was “RDF Explorer: A Visual SPARQL Query Builder” (see rdfexplorer.org). The computer science team partnered with an HCI researcher to make a user study and providing scientific rigor to their work (and ultimately getting nominated and winning an award). 

Bottomline, it’s my perception that the theoritians want to make sure that their theory is actually used, and systems builders are focusing more and more on the science and not just the engineering. This is FANTASTIC!

Table to Knowledge Graph Matching 

One of the big topics at the conference was the “Tabular Data to Knowledge Graph Matching” challenge. The challenge consisted of three tasks:

  • CTA: Assigning a class (:Actor) from a Knowledge Graph to a column
  • CEA: Matching a cell to an entity (:HarrisonFord) in the Knowledge Graph 
  • CPA: Assigning a property (:actedIn) from the Knowledge Graph to the relationship between two columns 

The matching was to DBpedia. The summary of the challenge in one slide: 

For example the team from USC, Tabularisi, at a high level, created candidate matches by using DBpedia Spotlight and used TF-IDF and that was sufficient to get decent results. 

The winner of the challenge, Mtab, in my opinion, over-engineered their approach for DBpedia, which is how they were able to win the challenge. 

DAGOBAH, from Orange Labs had two approaches. The first baseline that used DBpedia Spotlight and compared it against a sophisticated approach using embeddings. The embedding approach was slightly better, but more expensive. 

There were other approaches such as CSV2KG and MantisTable:

My takeaway: “less is sufficient.” Seems like we can get sufficient quality by not being too sophisticated. In a way, this is good. 

More notes

Olaf Hartig gave a keynote at Workshop on Querying and Benchmarking the Web of Data (QuWeDa). His message: 

Albert Meroño ‏presented work on modeling and querying lists in RDF Graphs ( Paper. Slides ). Really interesting 

I really need to check out VLog, a new rule based reasoner on Knowledge Graphs  VLog code. Java library based on the VLog rule engine Paper

SHACL and Validation

OSTRICH is an RDF triple store that allows multiple versions of a dataset to be stored and queried at the same time. Code. Slides.

Interesting to see how the Microsoft Academic Knowledge Graph as created in RDF. http://ma-graph.org/

An interesting dataset, FoodKG: A Semantics-Driven Knowledge Graph for Food Recommendation https://foodkg.github.io/index.html

Translating SPARQL to Spark SQL is getting more attention. Clever stuff in the poster: Exploiting Wide Property Tables Empowered by Inverse Properties for Efficient Distributed SPARQL Query Evaluation

There wasn’t so much material on Machine learning and embeddings, only one session (not surprising because I guess that type of work gets sent to Machine learning conferences). The couple of things I saw (not complete):

Need to check out http://ottr.xyz/ 

I missed the GraphQL tutorial

Industry

Even though this is an academic/scientific conference, there was still a bit of industry attendees. 

Dougal Watt (former IBM NZ Chief Technologist and founder of Meaningful Technology) gave a keynote, where he was preaching Dave McComb’s message of being data centric. I liked how he introduced the phrase “knowledge centric” which is where we should be heading.

Pinterest and Stanford won the best in-use paper award for  “Use of OWL and Semantic Web Technologies at Pinterest”

Bosch presented their use case of combining semantics and NLP. They are creating a search engine for material scientist to find documents. 

Google was present:

Joint work between Springer and KMi was presented

The Amazon Neptune presented a demo “Enabling an Enterprise Data Management Ecosystem using Change Data Capture with Amazon Neptune” and an industry talk “Transactional Guarantees for SPARQL Query Execution with Amazon Neptune

I learned about Ampligraph.org from Accenture, an “Open source library based on TensorFlow that predicts links between concepts in a knowledge graph.”

Great to see Orange Labs participating in the table to knowledge graph matching challenge (more above). 

Always great to connect with Peter Haase from Metaphacts and meet new folks like Jonas Almeida from NCI.

And that’s a wrap

ISWC is always a lot of fun. In addition to all the scientific and technical content, there is also a sense of community. I always enjoy being part of the mentoring lunch:

We had a fantastic gala dinner

And we even got all the hispanics together:

Take a look at other trip reports (I can now read them after I published mine!)

Avijit Thawani: https://medium.com/@avijitthawani/iswc-2019-new-zealand-bd15fe02d3d4

Sven Lieber: https://sven-lieber.org/en/2019/11/05/iswc-2019/

Cogan Shimizu: https://daselab.cs.ksu.edu/blog/is`wc-2019

Armin Haller: https://www.linkedin.com/pulse/knowledge-graphs-modelling-took-center-stage-iswc-2019-armin-haller/

With that, see you next year: