It was an incredible honor to be invited to the Dagstuhl Seminar on Knowledge Graphs and their Role in the Knowledge Engineering of the 21st Century organized by Paul Groth, Elena Simperl, Marieke van Erp, Denny Vrandecic. The second time I visit Dagstuhl this year!
I was thrilled that we started the seminar talking about the history of knowledge engineering, a topic very dear to my heart.
What is Knowledge Engineering? Bradley Allen, one of the very first knowledge engineers of the expert systems era, defined it in a crisp and succinct manner: the practice of building processes that produce knowledge, and reminded us of Ed Feigenbaum’s definition: The applied side of AI. Brad also touched on the topic of why expert systems failed. This can be answered from two different perspectives: 1) wasn’t able to commercialize because the 80s was a mainframe era and the expert systems were built on their own machines (prolog, lisp). 2) Didn’t fail because it just became common practice. The main issue is that there was a lack of developer buy-in.
Deborah McGuinness also gave one of the historical talks and reminded everyone that access to Knowledge Engineering has always had a very high bar: complicated software, needed training and it was expensive. The creators of the tools never met with the developers and the field took too much focus on formalism and tractability without practically.
The main takeaways I had revolved on two topics: 1) Users and Methodologies, and 2) Language Models
Users and Methodologies
Elena Simperl kicked off the seminar challenging us to think about what an upgraded Knowledge Engineering reference book would look like. For example, what would the NeOn book or Common KADS book look like today?
Elena also presented the notion of user-centric knowledge engineering with a set of questions to consider
– Who are the users?
– What are the users’ tasks and goals?
– How does the user interact with the knowledge graph?
– What are the users’ experience levels with it, or similar environment?
– What functions do the users need?
– What information might the users need, and in what form do they need it?
– How do users think knowledge engineering tools should work?
– Is the user multitasking?
– Are they working on a mobile phone, desktop computer etc?
– Does the interface utilize different input modes, such as touch, speech, gestures or orientation?
– How can we support multi-disciplinary teams? How can we support remote work, decision making, conflict resolution?
What thrills me the most is that knowledge engineering truly bridges the social and technical aspects of computer science. This gets us outside our comfort zone because we need to consider doing case studies, literature reviews, user studies to:
– understand personas, scenarios, use cases, tasks, emerging practices
– define process blueprints, design patterns, requirements for tool support
In a break out group with Sören Auer, Eva Blomqvist, Deborah McGuinness, Valentina Presutti, Marta Sabou, Stefan Schlobach, and Steffen Staab, started out by sharing existing knowledge engineering methodologies, discussing our experiences and devising what methodologies should look like today.
My position is that knowledge graphs and the investment in Knowledge Engineering is to address the known cases of today and unknown use cases of tomorrow. Past methodologies focus just on the knowledge/ontology side and not the data side, namely how to map data to knowledge. Furthermore, I believe there are two distinct scenarios two consider: the enterprise scenario where the goal is to answer fairly specific questions and the common-sense scenario where the goal is exploratory search to answer questions on generic topics (this is the Amazon, Google world). Additionally, understanding the producer and consumer personas is crucial, because at the end of the day, the consumers want data products. Figuring out how to devise methodologies in a distributed and decentralized world is going to be paramount. Schema.org is concrete evidence that it can happen, we just need to figure out how to embrace the complexity. Finally, the data world needs more data modeling training!
The themes that came up through our discussions were data creation, data integration, ontology engineering, systems development, requirements engineering, business value case case, where each of these segments need to be synced and evaluated.
We unanimously agreed that there is a need for a synthesis of knowledge engineer methodologies mapping to the tools, settings and requirements of today with the goal of defining upgraded knowledge engineering methodologies for today that can serve as general purpose education. Furthermore, we should start out by editing a Knowledge Engineering book similar to the Description Logic Handbook, which compiles the state of the art such that it can be used by knowledge engineering courses today. The ultimate goal is to define a textbook that could be used for a bachelors course. Ambitious, but we need to think big!
I appreciate a final reminder from Steffen: Methodology is like a cooking recipe, you don’t need to follow it exactly as-is.
Language Models
Language Models, such as BERT and GPT-3, have shown impressive results in terms of language generation and question answering. Lise Stork gave an overview of the state of the art on Automated Knowledge Graph Construction which surveyed automated methods to construct knowledge graphs from unstructured sources. This immediately raised the questions, what about 1) automatic knowledge graph construction from structured data (e.g. tabular, relational), namely given a structured data as input, the output is a knowledge graph and 2) automatically mapping structured data to a knowledge graph, namely, given structured data and an existing knowledge graph as input, the output is an augmented knowledge graph.
This second stream of work considers the traditional data integration challenges of schema matching and entity linking. Language models have common sense knowledge. However, to do the mapping, it would also need to have specific business/domain knowledge, which may not exist in language models today.
During our break out session, Sören started to experiment with OpenAI (https://beta.openai.com/playground) and we provided anecdotal evidence that language models can perform some form of data mapping on typical textbook examples, but they quickly fail when data structures are more enterprise-y (and thus less likely to be included in the language model training data).
This is an area ripe for research. The SemTab challenge deals with the second stream of work but the participants up to now do not use language models (CORRECTION, most do not but DAGOBAH, leading system at SemTab2021 does a mix language models with heuristics to get the proper interpretation. See comment below by Raphael Troncy). I’m very eager to follow the upcoming Table Representation Learning workshop at NeurIPS.
We should be careful to not just jump on language models bandwagon and start pounding on that hammer to see what works/sticks. Having said that, the notion of Machine behavior is an interesting one because we should study the behavior of these AI systems and understand why something works/sticks. This may be the opportunity to delve into other areas of research design such as single-subject design which I have recently been getting interested in.
Additional Observations
Manual authored knowledge from subject matter experts is precious. Therefore we need to define automatic generation of knowledge graphs at scale. However, human curation of automatically generated knowledge is needed for trust. Thus People provide precious knowledge and trust, while Machines provide scale.
Knowledge Engineering, “it is mostly, or even all about the processes and ecosystems” – Deborah McGuiness
Knowledge acquisition and maintenance is expensive! Need to herd cats across all departments. Initial developer buy-in can be hard to achieve, leading to less than enthusiastic support.
The FAANG knowledge problems have a focus on general and common sense knowledge and are interesting and challenging. On the other hand, enterprise problems are much more specific. The issue is that academia is mostly exposed to general/common sense issues through efforts such as DBpedia, Wikidata and do not have access to concrete enterprise problems. How can we get academia exposed to the enterprise problem? Releasing sample schemas, data, queries, etc?
The topic of Bias was presented by Harald Sack. The following types of biases were discussed: 1) Data bias: the data collection for the knowledge graph or simply from the available data, 2) Schema Bias: the chosen ontology, or simply embedded in ontologies and 3) Inferential Bias: the result of drawing inferences. Furthermore, biases in knowledge graph embeddings may also arise from the embedding method. I heard from the bias break out group that their takeaway is that they don’t know what they don’t know. Definitely a need for more socio-technical bridging and work with people who know what they are talking about.
Additional Questions
Bradley posed the question: How can we convince others that knowledge engineering is mainstream software engineering? What is the narrative to convince/talk to other communities on why they should care? It’s all about methodologies and it should be tied to current processes. For example, we already define Product Requirement Documents (PRD) in software which contains knowledge and requirements about the software. We should take this as an inspiration.
Knowledge Engineering can be very expensive. How to reduce this cost?
What kind of Knowledge Engineering methodologies and processes (in addition to tooling and training) are needed?
The seminal Knowledge Engineering paper by Studer et. al. extended the Gruber definition of ontology to “a formal, explicit specification of a shared conceptualisation.” I ask myself, what does “shared” mean today?
How do we let humans efficiently check a large amount of data before a product launch? This is where metadata plays a key role. How good is the data, and what does “good” mean? Do we know where the data comes from? Do we know how to audit our data to make it less biased? Do we know how the data came about? Do we know how the data is used? We can make rules that discover inconsistencies and incompleteness, and suggest anomalies. But how would we classify feedback from end users? How is the feedback channeled? These are questions that are being addressed by the data catalog market, so academia can and should learn from the state of the art, be critical and see what’s missing and devise opportunities. For example, is there, or shall I say, where is the bias in a metadata knowledge graph? If metadata is being reported from a subset of systems then that could be reporting bias. If recommendations are made, it may be biased because it lacks cataloging the metadata from other systems. What level of metadata granularity should be captured and what type of bias would that have?
How can we be overly inclusive with knowledge in order to get more folks “on our side”? Ontologies can be defined and stored in multiple forms. Even spreadsheets, that’s inclusive!
What can knowledge graphs capture? What can’t it capture? How do they represent what is changing in the world vs what is static? Seems like it’s going back to the traditional discussions of finding the balance of expressive languages in Knowledge Representation and Reasoning (i.e. description logic!), but the dynamicity (fast paced world) is the phenomena of today.
Final Thought
The title of this seminar was “Knowledge Graphs and their Role in the Knowledge Engineering of the 21st Century” and surprisingly, there was little emphasis on graphs. This is a good thing because the focus was KNOWLEDGE!. Paul Groth suggested that we go back to the terms Knowledge Base and Knowledge Based-Systems.
In our Communications of ACM Knowledge Graph article, Claudio Gutierrez and I wrote:
If we were to summarize in one paragraph the essence of the developments of the half century we have presented, it would be the following: Data was traditionally considered a commodity, moreover, a material commodity—something given, with no semantics per se, tied to formats, bits, matter. Knowledge traditionally was conceived as the paradigmatic “immaterial” object, living only in people’s minds and language. We have tried to show that since the second half of the 20th century, the destinies of data and knowledge became bound together by computing.
(Claudio and I stayed at Dagstuhl a few years ago to start writing that paper)
Today, our quest is to combine knowledge and data at scale. I would argue that the Semantic Web community has focused on integrating Data in the form of a graph, namely RDF, and Knowledge in the form of ontologies, namely OWL. However, the future of Data and Knowledge should consider all types of data and knowledge: tables, graphs, documents, text, ontologies, rules, embeddings, language models, etc. We are definitely heading into uncharted territory. And that is why it’s called science!
Thanks again to all the organizers for bringing us together and special thanks to Dagstuhl for being the magical place where amazing things happen! Can’t wait to be invited to come back.
Thanks Juan for this trip report!
One correction:
« This is an area ripe for research. The SemTab challenge deals with the second stream of work but the participants up to now do not use language models. »
This is inaccurate in the case of DAGOBAH, leading system at SemTab2021 and so far doing well in 2022 where we do mix language models with heuristics to get the proper interpretation. See also our RadarStation paper to be presented at ISWC2022!
Apologies for the inaccuracy.