2019 STI2 Semantic Summit Trip Report

Earlier this month, I attended the 2019 STI2 Semantic Summit. The goal of this meeting was to

  1. discuss medium and long term research perspectives of semantic technologies and
  2. define areas of research that we recommend deserves more attention.

This event was attended by 21 researchers: 1 Korean, 3 Americans and the rest from Europe. This is an example why Europe is so dominant in this area.

The meeting followed the open space technology method, which is a pretty cool approach to run meetings. I really like the law of two feet: “If at any time during our time together you find yourself in any situation where you are neither learning nor contributing, use your two feet, go someplace else.”

In my view, there were three main topics that came out of this meeting that deserves much more attention from the research community: Knowledge Science, Decentralization, Human Centric Semantics.

Knowledge Science

This is a topic that I pushed in the meeting. For a while, I’ve been thinking about the relationships between a few areas: 

  1. How do we teach people to create knowledge graphs efficiently?
  2. The 80% of the time that the data scientist spends on cleaning data. 
  3. Knowledge Engineering era of the 80s and 90s. 

I believe that that 80% of the cleaning work is much more complicated and deserves its own role, which should build upon the extensive Knowledge Engineering work that was developed over 25 years and all of this is needed to create knowledge graphs. 
During the summer at SIGMOD I had a great conversations with Paul Groth and George Fletcher on this topic. We decided that it would be valuable to put our thoughts down on paper. Before arriving at the Semantic Summit, I met with Paul and George in Amsterdam to brainstorm and write a manuscript where we introduce the role of a Knowledge Scientist, which is basically the Knowledge Engineer 2.0. The following is an excerpt of what we George, Paul and I wrote:


In typical organizations, the knowledge work to create reliable data is ad-hoc and the results and practices are not shared. Furthermore, in data science teams, a data scientist or data engineer might do this knowledge work, but is not equipped, trained or incentivized to do so. Indeed, from our experience, the knowledge work (e.g. 8 hour conference calls, discussions, documentation, long slack chats, confluence spelunking ) required to create reliable data is often not valued by managers or employees themselves. The tasks and functions of creating reliable data are never fully articulated and thus responsibility is diffuse or non-existent. Who should be responsible?

The Knowledge Scientist is responsible. 

The Knowledge Scientist is the person who builds bridges between business requirements/questions/needs and data. Their goal is to document knowledge by gathering information from Business Users, Data Scientists, Data Engineers and their environment with the goal of making reliable data that can then be used effectively in a data driven organization.


Not a coincidence that this was a main topic. It was my agenda to make this a main topic.

At the Semantic Summit, in one of the breakout groups we discussed our war stories on creating knowledge graphs. Mark Musen talked about Protege’s team experience working with Pinterest to help them build a taxonomy/ontology using Protege. Pinterest originally started by outsourcing this to a team in Asia and building it in Google Sheet. Very quickly it got out of control. The big questions for them was 1) how do we know we are done and 2) how do we know what we are doing is correct.

Umut Simsek discussed about how they are crawling tourism data on the web that is described in schema.org. The advantages is that they focused on specific schema, but that doesn’t mean that everybody who describes their data with schema.org does it with the same semantics in mind. Furthermore, people using schema.org get lost because it’s so big. They don’t know where to start.

Valentina Presutti discussed their use cases of creating knowledge graphs from cultural open data. They are mandated to release open data and they want their data to be used, but at the end, there is no specific task that is defined. I also provided my enterprise knowledge graph war stories. 

All of these problems are very similar to the ones encountered by Knowledge Engineers in the 1980s (i.e. knowledge acquisition bottleneck, etc). We asked ourselves: How is this different today? We discussed two main differences: 

  1. Before, data was not at the center stage. Today, we have big data. 
  2. Before, the tasks were well defined. Today, they seem to be more ambiguous, or at least they start out being ambiguous (i.e. search has changed our lives). 

When I arrived to the event, I was very intrigued to see what the folks in the room would think about my proposal to push this notion of Knowledge Science as an important topic. Would they say that I’m just proposing a reinvention of a wheel? Or is there something novel and challenging here? I think it is somewhere in the middle that gained enthusiasm during the event. It was very cool to have these discussions with the old-timers in the community such as Mark Musen, Rudi Studer, Dieter Fensel and have them agree with me (apparently I can put on my CV now that Dieter Fensel agreed with me).

Decentralization 

This is a topic that continues to come up over the past couple of years but still hasn’t gained the attraction that it deserves (in my opinion). I think it’s because it is a very hard problem and it is not clear where to start from. Ruben Verborgh best describes this problem as: querying large amounts of small data

Today we are comfortable with querying small amounts of big data. Let that be by dumping things into a data lake or doing federation. However, in a decentralized approach, by definition you can’t dump things into a lake. Standard federation techniques do not apply because you need to have knowledge beforehand of all the sources. Think about building queries without knowing where is all the data. This is a different scale of the problem. 

In 2017, there was a Dagstuhl seminar on Federated Semantic Data Management. That report has a gold mine of problems to be addressed. It just doesn’t seem a priority in the community when I personally think it should! Want to do a PhD? Read that dagstuhl report!

Human Centric Semantics 

This was probably the coolest topic that came up. It was very philosophical and hard to grasp at the beginning (and still is!). In one group, we were chatting with Aldo Gangemi and he brought up this need that AI and semantics should not forget about human needs. Another group was discussing about common sense reasoning (e.g. ConceptNet, Cyc, etc). Then, I believe that a Machine Learning breakout group was discussing how ML systems should be culturally aware (more and more countries are becoming multicultural and multilingual). When we got together as a group, all these ideas morphed into a really interesting topic. AI systems can push our social skills by being able to understand relationships between cultures. With this understanding, they can take action, provide recommendations, etc, which shows culturally-aware behavior.

The question is: can common sense knowledge be culturally biased? 

Do we need to distinguish between the common sense knowledge which can be characterized as universal (i.e. laws of physics) from culturally biased common sense (e.g. social norms)? Or is it an oxymoron to ask this question? 

We observe that acquiring common sense knowledge (i.e. via crowdsourcing) will reflect cultural stereotypes and bias and we should take that into account. This type of knowledge is an essential building block for the explainability of AI systems. 
The opportunity is to investigate what it means to combine cultural sensitivity with common sense knowledge in order to come up with innovative approaches of reasoning which can be used for explainability of AI systems, among others.

By coincidence, this post was published during our event Understanding the geopolitics of tech ecosystems by Yann Lechelle, CTO of SNIPS (considered as one of the best AI startups in France) and brought to our attention by Raphael Troncy. An interesting quote

In addition, there is an opportunity to create an AI and technologies that honor our European values, which are the Enlightenment values enhanced by post-war idealism. We must therefore seize these tools to resume the race, another race. Trading performance for privacy. Trading quantity for quality. Trading the economy of attention for the valorization of concentration.”

Additional thoughts

  • Knowledge and data is evolving rapidly. Therefore we need to have clear mechanisms to keep track of the evolution (i.e. provenance) and use that to reason with the data (i.e. provide context). We should be able to define WHAT is being done to the data in order to explain the evolution of the data. We should also define WHY those changes have been done to the data. We have mechanisms in place to describe the WHAT. However, it is not clear how we describe the WHY and how we can reason with it. This is also needed for explainability.
  • Human vs machine in the loop: Today we are hearing a lot about Human in the Loop where the machine is in charge and then the human comes in. But what if we flip that around. How about if the human is in charge and we let the machine come in? Why? Not everything can be automated from start to finish, or at least it shouldn’t be (think about bias in data, etc). Let’s give control back to the human.
  • Machine learning was an obvious topic. Machine Learning systems are black box. How do we explain what is going on in these black boxes? Explainable AI is already a hot topic that is far from being solved. Turning the knobs and figuring out which was the parameter that made the most impact is not sufficient for explainability. Semantics plays a key role to explain and not just for the the what but also the WHY. 
  • We have spam. We have fake news. What about fake knowledge? How do we combat and manage false knowledge? I recently found this article which resonates with this topic.
  • I got a demo from Pedro Szekely on their latest work is on mapping tabular data to Wikidata. They have defined a mapping language and system, T2WML, where you can define what parts of an excel file should be mapped to entities and properties in Wikidata. They have defined a yaml syntax for this mapping language. The goal is to augment data with new data to add new features (i.e. columns of data) to train models. Traditionally in ontology-based data integration, well, an ontology is being used as the hub for integration. However in this case, they are using wikidata as the hub of integration, both the schema and data of Wikidata.
  • I learned about is PyTorch-BigGraph from Facebook: “open-sourcing PyTorch-BigGraph (PBG), a tool that makes it much faster and easier to produce graph embeddings for extremely large graphs  … the first published embeddings of the full Wikidata graph of 50 million Wikipedia concepts,”  Pedro has used this as-is, for their matching tasks they immediately got 80% f-measure! Think about it: there is an existing model that can be reused (no need to create training data, etc) as-is, off the shelf, which gives them fairly decent accuracy. 
  • Great to finally spend time chatting with Mark Musen and learn more about Center for Expanded Data Annotation and Retrieval (CEDAR) which is a metadata management system for scientists.
  • Raphael and Pedro are both participating in Semantic Web Challenge on Tabular Data to Knowledge Graph Matching. I am keeping a close eye on this challenge because it’s tackling problems I’m seeing in the real world.

Thanks to John Domingue and Andreas Harth for organizing this event. It was great to hang out with a group of really smart people to discuss the future of knowledge graphs and semantic technology.

A final report, the Chania Declaration on Knowledge Graph Research, will be published that presents recommendations of research areas that significantly need more attention. I’m happy to say that my voice was heard in order to make an impact in this agenda.