A Pay-as-you-go Methodology to Design and Build Knowledge Graphs

At the 18th International Semantic Web Conference I will be presenting our in-use paper:

A Pay-as-you-go Methodology to Design and Build Enterprise Knowledge Graphs from Relational Databases

Business users must answer business questions quickly to address Business Intelligence (BI) needs. The bottleneck is to understand the complex databases schemas. Only few people in the IT department truly understand them. A holy grail is to empower business users to ask and answer their own questions with minimal IT support. Semantic technologies, now dubbed as Knowledge Graphs, become useful here. Even though the research and industry community has provided evidence that semantic technologies works in the real world, our experience is that there continues to be a major challenge: the engineering of ontologies and mappings covering enterprise databases containing thousands of tables with tens of thousands of attributes. In this paper, we present a novel and unique pay-as-you-go methodology that addresses the aforementioned difficulties. We provide a case study with a large scale e-commerce company where Capsenta’s Ultrawrap has been deployed in production for over 3 years.

This is joint work with Will Briggs, Daniel Miranker and Wayne Heideman. This paper documents our experience and lessons learned, while at Capsenta, in order to design and build enterprise knowledge graphs from disparate and heterogeneous complex relational databases. 

The Problem: how do we get non-semantic aware folks to design and build ontologies and subsequently create mappings from the complex schemas of enterprise database (1000s of tables and 10000s of attributes) to the ontologies. 

Our answer: a methodology that combines ontologies and mappings, that is iterative, and focuses on answering business questions to avoid boiling the ocean.

It is great to see how we are applying this methodology with our customers at data.world. 

Interested in learning more? Please read the paper! Still have questions? Reach out to me!

Finally, it is an amazing honor that this paper is nominated to best paper. We pride ourselves in striving for excellence.

2019 STI2 Semantic Summit Trip Report

Earlier this month, I attended the 2019 STI2 Semantic Summit. The goal of this meeting was to

  1. discuss medium and long term research perspectives of semantic technologies and
  2. define areas of research that we recommend deserves more attention.

This event was attended by 21 researchers: 1 Korean, 3 Americans and the rest from Europe. This is an example why Europe is so dominant in this area.

The meeting followed the open space technology method, which is a pretty cool approach to run meetings. I really like the law of two feet: “If at any time during our time together you find yourself in any situation where you are neither learning nor contributing, use your two feet, go someplace else.”

In my view, there were three main topics that came out of this meeting that deserves much more attention from the research community: Knowledge Science, Decentralization, Human Centric Semantics.

Knowledge Science

This is a topic that I pushed in the meeting. For a while, I’ve been thinking about the relationships between a few areas: 

  1. How do we teach people to create knowledge graphs efficiently?
  2. The 80% of the time that the data scientist spends on cleaning data. 
  3. Knowledge Engineering era of the 80s and 90s. 

I believe that that 80% of the cleaning work is much more complicated and deserves its own role, which should build upon the extensive Knowledge Engineering work that was developed over 25 years and all of this is needed to create knowledge graphs. 
During the summer at SIGMOD I had a great conversations with Paul Groth and George Fletcher on this topic. We decided that it would be valuable to put our thoughts down on paper. Before arriving at the Semantic Summit, I met with Paul and George in Amsterdam to brainstorm and write a manuscript where we introduce the role of a Knowledge Scientist, which is basically the Knowledge Engineer 2.0. The following is an excerpt of what we George, Paul and I wrote:


In typical organizations, the knowledge work to create reliable data is ad-hoc and the results and practices are not shared. Furthermore, in data science teams, a data scientist or data engineer might do this knowledge work, but is not equipped, trained or incentivized to do so. Indeed, from our experience, the knowledge work (e.g. 8 hour conference calls, discussions, documentation, long slack chats, confluence spelunking ) required to create reliable data is often not valued by managers or employees themselves. The tasks and functions of creating reliable data are never fully articulated and thus responsibility is diffuse or non-existent. Who should be responsible?

The Knowledge Scientist is responsible. 

The Knowledge Scientist is the person who builds bridges between business requirements/questions/needs and data. Their goal is to document knowledge by gathering information from Business Users, Data Scientists, Data Engineers and their environment with the goal of making reliable data that can then be used effectively in a data driven organization.


Not a coincidence that this was a main topic. It was my agenda to make this a main topic.

At the Semantic Summit, in one of the breakout groups we discussed our war stories on creating knowledge graphs. Mark Musen talked about Protege’s team experience working with Pinterest to help them build a taxonomy/ontology using Protege. Pinterest originally started by outsourcing this to a team in Asia and building it in Google Sheet. Very quickly it got out of control. The big questions for them was 1) how do we know we are done and 2) how do we know what we are doing is correct.

Umut Simsek discussed about how they are crawling tourism data on the web that is described in schema.org. The advantages is that they focused on specific schema, but that doesn’t mean that everybody who describes their data with schema.org does it with the same semantics in mind. Furthermore, people using schema.org get lost because it’s so big. They don’t know where to start.

Valentina Presutti discussed their use cases of creating knowledge graphs from cultural open data. They are mandated to release open data and they want their data to be used, but at the end, there is no specific task that is defined. I also provided my enterprise knowledge graph war stories. 

All of these problems are very similar to the ones encountered by Knowledge Engineers in the 1980s (i.e. knowledge acquisition bottleneck, etc). We asked ourselves: How is this different today? We discussed two main differences: 

  1. Before, data was not at the center stage. Today, we have big data. 
  2. Before, the tasks were well defined. Today, they seem to be more ambiguous, or at least they start out being ambiguous (i.e. search has changed our lives). 

When I arrived to the event, I was very intrigued to see what the folks in the room would think about my proposal to push this notion of Knowledge Science as an important topic. Would they say that I’m just proposing a reinvention of a wheel? Or is there something novel and challenging here? I think it is somewhere in the middle that gained enthusiasm during the event. It was very cool to have these discussions with the old-timers in the community such as Mark Musen, Rudi Studer, Dieter Fensel and have them agree with me (apparently I can put on my CV now that Dieter Fensel agreed with me).

Decentralization 

This is a topic that continues to come up over the past couple of years but still hasn’t gained the attraction that it deserves (in my opinion). I think it’s because it is a very hard problem and it is not clear where to start from. Ruben Verborgh best describes this problem as: querying large amounts of small data

Today we are comfortable with querying small amounts of big data. Let that be by dumping things into a data lake or doing federation. However, in a decentralized approach, by definition you can’t dump things into a lake. Standard federation techniques do not apply because you need to have knowledge beforehand of all the sources. Think about building queries without knowing where is all the data. This is a different scale of the problem. 

In 2017, there was a Dagstuhl seminar on Federated Semantic Data Management. That report has a gold mine of problems to be addressed. It just doesn’t seem a priority in the community when I personally think it should! Want to do a PhD? Read that dagstuhl report!

Human Centric Semantics 

This was probably the coolest topic that came up. It was very philosophical and hard to grasp at the beginning (and still is!). In one group, we were chatting with Aldo Gangemi and he brought up this need that AI and semantics should not forget about human needs. Another group was discussing about common sense reasoning (e.g. ConceptNet, Cyc, etc). Then, I believe that a Machine Learning breakout group was discussing how ML systems should be culturally aware (more and more countries are becoming multicultural and multilingual). When we got together as a group, all these ideas morphed into a really interesting topic. AI systems can push our social skills by being able to understand relationships between cultures. With this understanding, they can take action, provide recommendations, etc, which shows culturally-aware behavior.

The question is: can common sense knowledge be culturally biased? 

Do we need to distinguish between the common sense knowledge which can be characterized as universal (i.e. laws of physics) from culturally biased common sense (e.g. social norms)? Or is it an oxymoron to ask this question? 

We observe that acquiring common sense knowledge (i.e. via crowdsourcing) will reflect cultural stereotypes and bias and we should take that into account. This type of knowledge is an essential building block for the explainability of AI systems. 
The opportunity is to investigate what it means to combine cultural sensitivity with common sense knowledge in order to come up with innovative approaches of reasoning which can be used for explainability of AI systems, among others.

By coincidence, this post was published during our event Understanding the geopolitics of tech ecosystems by Yann Lechelle, CTO of SNIPS (considered as one of the best AI startups in France) and brought to our attention by Raphael Troncy. An interesting quote

In addition, there is an opportunity to create an AI and technologies that honor our European values, which are the Enlightenment values enhanced by post-war idealism. We must therefore seize these tools to resume the race, another race. Trading performance for privacy. Trading quantity for quality. Trading the economy of attention for the valorization of concentration.”

Additional thoughts

  • Knowledge and data is evolving rapidly. Therefore we need to have clear mechanisms to keep track of the evolution (i.e. provenance) and use that to reason with the data (i.e. provide context). We should be able to define WHAT is being done to the data in order to explain the evolution of the data. We should also define WHY those changes have been done to the data. We have mechanisms in place to describe the WHAT. However, it is not clear how we describe the WHY and how we can reason with it. This is also needed for explainability.
  • Human vs machine in the loop: Today we are hearing a lot about Human in the Loop where the machine is in charge and then the human comes in. But what if we flip that around. How about if the human is in charge and we let the machine come in? Why? Not everything can be automated from start to finish, or at least it shouldn’t be (think about bias in data, etc). Let’s give control back to the human.
  • Machine learning was an obvious topic. Machine Learning systems are black box. How do we explain what is going on in these black boxes? Explainable AI is already a hot topic that is far from being solved. Turning the knobs and figuring out which was the parameter that made the most impact is not sufficient for explainability. Semantics plays a key role to explain and not just for the the what but also the WHY. 
  • We have spam. We have fake news. What about fake knowledge? How do we combat and manage false knowledge? I recently found this article which resonates with this topic.
  • I got a demo from Pedro Szekely on their latest work is on mapping tabular data to Wikidata. They have defined a mapping language and system, T2WML, where you can define what parts of an excel file should be mapped to entities and properties in Wikidata. They have defined a yaml syntax for this mapping language. The goal is to augment data with new data to add new features (i.e. columns of data) to train models. Traditionally in ontology-based data integration, well, an ontology is being used as the hub for integration. However in this case, they are using wikidata as the hub of integration, both the schema and data of Wikidata.
  • I learned about is PyTorch-BigGraph from Facebook: “open-sourcing PyTorch-BigGraph (PBG), a tool that makes it much faster and easier to produce graph embeddings for extremely large graphs  … the first published embeddings of the full Wikidata graph of 50 million Wikipedia concepts,”  Pedro has used this as-is, for their matching tasks they immediately got 80% f-measure! Think about it: there is an existing model that can be reused (no need to create training data, etc) as-is, off the shelf, which gives them fairly decent accuracy. 
  • Great to finally spend time chatting with Mark Musen and learn more about Center for Expanded Data Annotation and Retrieval (CEDAR) which is a metadata management system for scientists.
  • Raphael and Pedro are both participating in Semantic Web Challenge on Tabular Data to Knowledge Graph Matching. I am keeping a close eye on this challenge because it’s tackling problems I’m seeing in the real world.

Thanks to John Domingue and Andreas Harth for organizing this event. It was great to hang out with a group of really smart people to discuss the future of knowledge graphs and semantic technology.

A final report, the Chania Declaration on Knowledge Graph Research, will be published that presents recommendations of research areas that significantly need more attention. I’m happy to say that my voice was heard in order to make an impact in this agenda.

Thank you Capsenta! Hello data.world!

Together with my PhD advisor, Prof Daniel Miranker at the University of Texas at Austin, we founded Capsenta in 2014 for the following reasons:

1) We truly believe that there is a commercial opportunity for companies wanting to connect their relational databases with semantic web technologies.
2) We are both passionate about commercializing research via startups.

Over the years we have been using Ultrawrap, and recently Gra.fo, to address data integration and business intelligence problems using semantic web technologies. Business users do not understand their complex data sources. IT struggles to understand the thousands of tables, millions of attributes and how the data all works together. We deliver a beautiful view of these myriad, complex relational data sources by designing an ontology (i.e., knowledge graph schema), mapping it to the complex data sources via a pay-as-you-go methodology and then using the mappings to integrate the data in a virtual (NoETL) or materialized (ETL) way. Our ultimate goal is to take complex data and turn it into beautiful data.

We are now experiencing more interest and uptake from the industry. Knowledge Graphs are the new cool kid on the block. Graph databases are hot. The Semantic Web community continues to constantly provide evidence that semantic technology works in the real world (just see all the papers in the in-use and industry tracks at ISWC and ESWC, the recent Knowledge Graph Conference, etc.). The industry is really starting to care!

There is one company who really, really cares: data.world.

And I am extremely excited to share that

Capsenta has been acquired by data.world!

Who is data.world?

data.world is a data platform where anybody can add their data, integrate it, share it, query it and much more. Data on the data.world platform becomes part of a web of linked data. The coolest thing is that it runs 100% on semantic web technology. Literally! They have made use of many research results from the semantic web community. For example, every single dataset is stored as RDF HDT. They make use of Apache Jena. You can query all the data in SPARQL, even federate queries across different datasets. When you import relational or csv data, they use the RDB2RDF and CSV2RDF direct mappings. They have even created their own SQL to SPARQL translator, thus enabling tabular data to be queried in SQL in addition to SPARQL. All changes are tracked and the provenance is represented in PROV-O and queryable. Heck, they even support SHACL! data.world is a true semantic web platform.

data.world started out in 2016 by creating a community of open data, which has been called a kind of “GitHub for data”. Now, data.world is the world’s largest collaborative data community and that community has come together to upload and curate hundreds of thousands of data sets.

data.world is also a Public Benefit Corporation with the following ambitious mission:

– Build the most meaningful, collaborative and abundant data resource in the world in order to maximize data’s societal problem-solving utility.
– Advocate publicly for improving the adoption, usability, and proliferation of open data and linked data. (YES, you read that correctly! Their mission is to improve the adoption of linked data!!!!)
– Serve as an accessible historical repository of the world’s data.

It’s now time to start the next phase of taking data.world to the enterprise. This is where Capsenta comes in.

Why am I excited?

There are two main reasons why I am excited:

Perfect Technology Match: We both breathe and eat semantic web. Ultrawrap is a component that will help data.world create a hybrid data platform. We have enterprise customers who want to keep their data in place and not move it to the cloud. This is where Ultrawrap NoETL plays a crucial role. Furthermore, we both acknowledge that we need to make semantic web technology easy to use. data.world’s consumer-grade UI is a valuable differentiator. At Capsenta we created Gra.fo because there wasn’t an easily-usable ontology/knowledge graph schema editor for business users.

Perfect Mission/Vision Match: We are both heading towards the same goal. The way data is managed within enterprises is ugly and complicated. We have to address this problem from a holistic point of view. At Capsenta, our goal is to change the way the world models, governs and integrates data by generating beautiful data that the business users can consume and start solving their business problems. We want to democratize data, or how data.world states it, humanize the data. It’s clear to us that data integration is not just about the technology but also about the people. We need to empower the different stakeholders to be part of the conversation. That is why data.world is all about collaboration. Capsenta’s Gra.fo allows users to share their documents and have conversations via comments.

Oh, and we are both in Austin! How cool is that!

How did we get here?

When I transferred to UT Austin to finish my undergrad in Computer Science in 2006, by serendipity, I met Prof. Daniel Miranker. He was also intrigued by the Semantic Web. Our research started with a very basic question: what is the relationship between relational databases and semantic web? It was clear to us that if the semantic web were to be successful, it must incorporate relational databases because that is where the majority of data is located. After I finished my undergrad, I wanted to continue this same line of research and keep working with Dan. One of the main reasons I wanted to do a PhD was because of the potential to start a company from our research. If semantic web technologies were to take off, then we would be seeing a lot of companies wanting to integrate their relational databases with the semantic web… and we would have the solution! Capsenta was founded to commercialize my PhD research.

With his fantastic technical basis in hand, Wayne Heideman joined the journey as CEO to guide the commercialization of these ideas, the technology and its productization. Since then we have demonstrated that our technology works to integrate data within very large enterprises in industries such as healthcare, e-commerce, oil and gas, and pharma and have enjoyed commercial success with millions of dollars of customer revenue.

Personally, I have learned A LOT about how to work with data in large enterprise settings (our smallest customer is a billion dollar revenue company), from both the technical and social aspects. It is very satisfying to see our research being used in the real world to solve challenging data integration problems… and that we get paid to do it.

In order to scale Capsenta’s business, we needed more fuel. Given the alignment that we have with data.world, it makes complete sense to join forces.

What’s next?

The entire Capsenta team has joined data.world! I am now Principal Scientist at data.world. I continue to wear my scientific hat and collaborate with many research partners, attending and presenting at conferences, participating in program committees and editorial boards, supervising students and more. I also wear a business hat where I support engineering, technical sales and work with customers to understand their problems and tie them back to R&D.

Capsenta and data.world had already been working together for over a year as partners and Ultrawrap NoETL was already integrated as the virtualization mechanism for data.world before the acquisition. It will be very fun to further integrate Capsenta’s technology within data.world. We also plan to continue to support all of our customers and continue development and support for Ultrawrap and Gra.fo.

Parting Thoughts

With my scientific hat

I am very proud to be part of a startup coming out of research done at the Department of Computer Science at the University of Texas at Austin. I’m looking forward to seeing more startups coming out of UTCS.

There is so much fun research to be done! It’s going to be fun organizing our research plans for the short, medium and long term. Stay tuned!

With my business hat

This is a huge win for companies who are looking to deploy an Enterprise Knowledge Graph. If you are learning and starting small, we can help you. If you are advanced and you know exactly what you want, we can help you too. Together, we now have the best platform in the world to create knowledge graphs!

Personally,

Thanks to the entire Capsenta team, past and present. We are starting this new chapter thanks to all of you.

Thanks to the investors for trusting us in this endeavor.

Thanks to Dan Miranker for believing in me.

Thanks to Wayne Heidenman for teaching me so much about business and technology.

Thanks to my family for supporting me every step of the way.

It’s clear that both Capsenta and data.world are heading in the same direction. We are honored and humbled to be invited to be part of the data.world journey and are excited at what it holds for us all.

Thank you Capsenta!

Hello data.world!

We are one team now.

2019 Knowledge Graph Conference Trip Report

I just got back from the 2019 Knowledge Graph Conference organized by the School of Professional Studies of Columbia University and chaired by François Scharffe. I was honored to be invited by François to be a member of the Program Committee. Our task was to invite speakers, go over the submitted proposals and help shape the program of this event. 

As always, I asked myself: what does success look like? If we learn what are the real world problems that various industries are tackling with Knowledge Graphs, and how they are achieving it. Additionally, for the skeptics to leave less skeptical and eager to engage more with Knowledge Graphs. 

I can report that, per my definition, this was a successful event! It actually surpassed my expectations. The event was packed and all 200 tickets were sold out. 

My main takeaways: 1) Finance is all over Knowledge Graphs, 2) more and more industries are now starting to pay attention, 3) roadblocks are social, not technical (the technology works!) and 4) virtual knowledge graphs are gaining a lot of interest (keep the data where it is).

Before I dive into details of this trip report, I believe it is paramount to highlight a problem that was observed by many: the lack of gender diversity. This was also observed in the past W3C Graph Data Workshop. We have a vibrant graph community, but why are we lacking gender diversity in this graph community? While we were creating the program, we invited many female speakers. Unfortunately, some couldn’t make it and some cancelled at the last minute.  It was great to see a larger female representation within the audience. A diverse group brings diverse ideas and fosters increased creativity. As a community, we need to make sure that all voices are included. This lack of diversity worries me tremendously. We need to support the community at large and encourage people from all diverse backgrounds to participate and speak at next year’s Knowledge Graph Conference (yes, this event will take place next year!).

This event did have a diversity of industries attending. The talks and break discussions were very broad. I’m going to organize this report by the following topics: Finance, Other Industries, Unicorns, Virtual Graphs, Vision and Vendors. 

Finance

Given that we were in New York, there was a great representation of financial services companies.

The day was kicked off with a talk by Christos Boutsidis from Goldman Sachs.

It takes 1 week for them to construct the graph (if they are lucky) and 1 day to update the graphs with deltas. They infer and extract knowledge from the graph by running a series of standard graph algorithms:  Edge weights to understand how strong is a relation between a client and employee, Vertex centrality (i.e. Pagerank) to identify influencers, Vertex similarity to match Marcus applicants with politically exposed people, All pairs all paths to find Shortest paths to connect people with the firm and Community detection (clustering) to find set of accounts that participate in the same money transfer. The compliance applications are Insider threat, Insider trading, AML, Marcus Lending/Banking and Co-branded credit card. 

The next talk was by Patricia Branum and Bethany Sehon from Capital One. Their goal was to attach an ontology to their existing Customer 360 data in order to enhance definitions, standardized metadata and then further improve the metadata. 

When asked, how did they get sponsorship and internal buy in, it was an easy sell within Capital One because they see themselves as a data-driven company (shouldn’t everybody be one?!!). Given that their sponsors were in risk management, which deals with a lot of data, it was easy to fund the pilot. Capital One is planning to take this into production. They are also looking into reasoning.

What were their challenges? It wasn’t technical, it was social (I topic that I discussed during my talk)

I also really liked their definition of an ontology (and see the replies to my tweet to see other interesting discussions)

David Newman from Wells Fargo, a long timer also presented.

Tim Baker from Refinitiv (formerly Thomson Reuters Financial & Risk) presented their Knowledge Graph used to track bad actors.

Vivek Khetan from Accenture discussed on combining knowledge graph and NLP to understand regulatory press releases 

It’s been known for a while that the financial industry has been using semantic/graph technology for a while. But why has it been taking so long? I think Dean Allemang‘s First mover slide below sums it up:

More Industry Real World Use cases

Joe Pindell from Pitney Bowes and Colin Puri from Accenture jointly presented a customer service use case. With their knowledge graph they are 1) providing context and guidance, 2) discovering resolutions via relationships and 3) modeling & merging data views.

Lambert Hogenhout from the United Nations shared with the audience the reasons why the UN needs knowledge graphs.

The UN also needs to deal with many multilingual issues. They are just starting out.

Chris Brockmann from Eccenca discussed how Knowledge Graphs are used to integrate data in supply chain and provided a great ROI.

Tom Plasterer from AstraZeneca discussed that their main challenges is that data is all over the place. Their approach is to build a knowledge graph following the FAIR principles.

Parsa Mirhaji from Montefiore Hospital discussed on how it is still challenging to do analytics with health data

Steven Gustafson from MAANA shared their experience of creating knowledge graphs in the oil and gas industry. The popular term in this industry is Digital Transformation and he provided an interesting definition: Knowledge Graph + Function Graph = Digital Transformation, where a function graph is a graph of methods (ie functions) and how they interact between each other.

Unicorns

By unicorns I mean, companies that are very different from the mainstream (not everybody is a Google). It was very excited to have representatives from Airbnb, Amazon, Diffbot, Uber and Wikidata. 

Xiaoya Wei from Airbnb presented their knowledge graph:

They built the infrastructure from scratch. From a storage and data partitioning perspective, the nodes and edges are stored separately, by source. Node schema and edge payload are defined by thrift binary. It is horizontally scalable. The goal is to avoid broadcast for queries with large fan out. From a query perspective, the objective is to traverse a subgraph and retrieve nodes and edges from the traversal. Data is being ingested via an asynchronous framework to continuously import data. Diffs are calculated and then published on Kafka. Finally, why did they build the infrastructure from scratch? Because they built upon the infrastructure that they currently support internally (i.e. they don’t want to bring in more software and have to support it).

Three use cases were discussed: 1) Navigation via a taxonomy that describes the inventory, 2) recommendation and 3) provide more context. 

Data quality and consistency is a key challenge. A human team checks data quality. That is why access control is important for them because a user can only make changes to the data that they know. 

Subhabrata Mukherjee from Amazon (now at Microsoft Research) discussed how the Amazon Product Graph is being built. 

Human in the loop techniques are required to clean up noisy training labels. Additionally, the information extraction system return a triples of strings, therefore the strings needs to map to concepts (things not strings!) in order to truly integrate the knowledge.

Even though Diffbot is a startup, I’m putting them in the unicorn category because they are doing something very unique that not everybody needs to do: create a knowledge graph by crawling the web. Effectively, they are competing against Google and offering services that Google doesn’t. Mike Tung, CEO of Diffbot presented:

Great quote:

Josh Shinavier described lessons learned from creating a Knowledge Graph at Uber.  Josh also confirmed Airbnb’s comments about why they built their own infrastructure from scratch: they want to reuse the support capabilities that they already have and not bring new software into the mix.

For more details, check out the article Uber’s graph expert bears the scars of billions of trips.

Finally, Denny Vrandecic from Google AI talked about Wikidata. Check out Vivek Khetan’s twitter thread on Denny’s talk.

Virtual Graphs

Capital One, AstraZenca, Uber and Wells Fargo all publicly stated that they are looking into virtual graphs. This means, they want to be able to keep the data in its original source and have a way to virtualize it as a Knowledge Graph.

This is music to my ears because this is what my PhD was all about and on the premise for which Capsenta was founded: a NoETL (i.e. virtualize) approach to data integration via semantic/graph technology.

I had a lot of discussions during the breaks with other folks about this topic. There is an agreement that moving the data to a centralized location has been the status quo and it’s getting more and more expensive. I’m also glad to see other vendors talking about virtualization such as data.world and Stardog.

Machine Learning

Subhabrata Mukherjee’s talk provided a lot of details into their machine learning process. Take a look at Vivek’s twitter thread.

Alfio Gliozzo from IBM Research discussed how to extend Knowledge Graphs using Distantly Supervised Deep Nets. The challenge: develop hand labeled data. There is an agreement with the ML folks in the audience. Vivek also has a detailed thread on this talk.

Freddy Lecue from Thales discussed explainable AI. 

Vision

Given the hype of Machine Learning, Deep Learning, AI, etc, I’ve been asking myself if we will ever automate the creation of Knowledge Graphs. I had a great discussion with Subhabrata Mukherjee on this topic. He thinks that we will get there assuming the source of data is unstructured because there is so much overlapping data within the same domain. On the other hand, when the source is structured data, we both agreed that the future doesn’t look bright. There simply isn’t enough overlapping domain data. As I mentioned in my talk, I never thought that I would be working on methodologies because we need to empower humans and machines to work together.

We were very lucky to have Pierre Haren,  a pioneer in AI and rule systems and founder of ILOG. He spoke about the future evolution of knowledge graphs to casual graphs where the relationships (edges) are causal. 

Personally, I was thrilled to finally meet him and get his input for our upcoming tutorial on the History of Knowledge Graph’s Main Ideas at ISWC2019.

Vicky Froyen discussed where Collibra is heading.

Vendors

We had representatives from many vendors: AllegroGraph/Franz, Amazon Neptune, Data Chemists, Datastax, data.world, GraphDB/Ontotext, Neo4j, Stardog, TigerGraph and yours truly, Capsenta!

I gave a 20 min version of my talk Designing and Building Enterprise Knowledge Graphs from Relational Databases in the Real World (which is an evolution of my previous talk on Integrating Semantic Web in the Real World: A Journey between Two Cities ). I’m happy to share that the talk was very well received. Check out this twitter thread.

I was also very thrilled to give demos of Gra.fo, our visual collaborative and real-time knowledge graph schema editor. I love seeing the faces of people when they see Gra.fo for the first time. I am so proud of the entire Capsenta team for developing Gra.fo!

Nav Mathur from Neo4j discussed how they build knowledge graphs

Jesús Barrasa shared an objective comparison between RDF Graphs and Property Graphs (more later)

Brad Bebee shared lessons learned from Amazon Neptune’s customers

Bryon Jacob from data.world discussed how they sneak knowledge graphs into the users without them even knowing about it

Nasos Kyriakov from Ontotext shared a marketing intelligence use case  

The grand finale was a genuine and honest discussion between all the vendors which I had the honor to moderate.

My takeaway is that there is NOT a RDF graph vs Property Graph “battle”. It was agreed that if your goal is to share data, then use RDF. But that doesn’t stop you from using a property graph. Jesus was very emphatic that you can use Neo4j as their storage model and still support RDF (probably not natively) from Neo4j. Jeremy from Datastax shared that with the upcoming Tinkerpop 4 you can compile anything into the internals of tinkerpop, let it be Cypher or SPARQL. Amazon supports both because their customers want both.

However, some of the RDF folks are more “pedantic” like Stardog and Datachemist. Finally, Datachemist is proposing a new graph language which has features that have been well defined in G-CORE and are going into GQL.

I asked everybody to give a 2 floor elevator pitch to convince the audience that they should spend their time evaluating their technology. Basically everybody’s response was the same: just sign up/download our system and try it out.

My takeaway from the panel: we are turning into a fuzzy open warm comfortable graph community. Confirms my takeaway from the W3C Graph workshop . 

Final Thoughts

  • Word on the street is that people really regret not attending this event.
  • Congrats to all the organizers: Francois, Thomas, Will and all the students collaborators. You all ran an impeccable event!
  • Kudos to all the speakers who stayed within the 20 minute slots for their talks.
  • Even though the majority were new faces, it was great to see old timers like Dieter Fensel, Dean Allemang and Sören Auer, renowned figures in the semantic web community
  • Check out all the #kgc2019 tweets
  • Beautiful location and the weather was PERFECT!
  • Check out Denny’s trip report 
  • Check out Vivek’s trip report
  • Talks were recorded! It will take a while but they will be made public. So stay tuned!
  • See you May 2020 back in NY!

Finally, check out what some of the attendees had to say

Gra.fo, six months later! What have we been up to?

I can’t believe that it has already been six months since we first announced Gra.fo. Time flies when you are having fun! I am really excited to share with everybody some of the major features we have been up to: New Exports, Graph Schema Documentation, Multi-select and Import Mapping.

New Exports

It was clear to us from the beginning that Gra.fo was in the position to support both RDF Graph and Property Graph communities. We started out by exporting the schemas as OWL ontologies in Turtle and RDF/XML syntaxes. However, we lacked support for Property Graph schemas.

Throughout the past few months, we have been thrilled to see the interest of the Property Graph community in schemas and Gra.fo. (I’m honored to be chairing the Property Graph Schema Working Group within the context of the GQL standardization effort.)

That is why we are excited to announce three new property graph schema export formats:

There is a clear need for a general purpose graph schema modeling tool. We are lucky to have this opportunity where Gra.fo can be a bridge between both graph communities.

Graph Schema Documentation

Exporting the graph schema to a PNG or SVG image sure is pretty may not be sufficient. The image does not show attributes or detailed descriptions.

An important need is to provide documentation about the graph schema in a way that can be easily consumed by humans. This type of documentation can serve as requirement documentation, project deliverable, etc.

Now you are able to view the documentation of the graph schema in a separate page. Go to File > Graph Documentation.

The documentation has its own URL of the form https://app.gra.fo/documentation/a1b2c3 You can now easily share that link with others who also have permission to view the document.

Need a PDF? Simply print and save as PDF.

Multi-select

What if you need to move multiple concepts at the same time? Before, you would need to move each one independently. That was very annoying.

Not any more! Now you can select multiple concepts at the same time and move them all at once. And it even works in real-time when you have multiple users on the document.

Simply select multiple concepts by pressing shift on your keyboard and clicking on each concept that you want to move. Additionally you can click on the canvas while pressing shift on your keyboard and then drag/drop to create a bounding box.

In addition to moving multiple concepts at once, you can also change the colors and delete them.

Import Mapping

Designing a graph schema is just the first part. You have to do something with it. Our customer’s common use case is data integration. Their need is to map complex source relational databases into the graph schema which models the business users view of the world.

One way of representing these mapping is using the W3C’s R2RML: Relational Databases to RDF Mapping Language. This standard was ratified back in 2012, together with the Direct Mapping standard (I am one of the editors).

R2RML is a declarative language that defines how RDF triples are generated from SQL tables or queries. For example, the following R2RML snippet defines that all the rows of the OMS_ORDER table will be instances of a class ec:Order and that the subject of the triples are defined by that template which uses the values from the attribute OrderId.

@prefix rr:    <http://www.w3.org/ns/r2rml#> .
@prefix map:   <http://capsenta.com/mappings#> .
@prefix ec:    <http://gra.fo/e-commerce/schema/> .

map:Order  a rr:TriplesMap ;
rr:logicalTable  [ rr:tableName "OMS_ORDER" ] ;
rr:subjectMap    [ rr:class ec:Order ;
rr:template "http://www.e-commerce.com/data/order/{OrderId}"
] .

These mappings can be created using editors, or by hand if you are an RDF geek 🙂. Capsenta offers Ultrawrap Mapper, our mapping management systems.

I’m really excited about this initial feature: import an existing R2RML mapping to a graph schema document. Go to Mapping > Manage

Once a mapping has been imported, you will see an icon on the left panel if a mapping exists for a Concept, Attribute or Relationship

If you click on the icon, you will see the mapping details. In this example, we are showing the previous R2RML snippet.

In real-world, enterprise relational databases, the mappings will consist of complex SQL queries defined in an R2RML mapping, as the following example shows:

Once you have the mappings, you have to do something with the mappings. You can use the mappings to physically convert the relational data into graphs (ETL) or virtualize the relational databases as if it were a graph database (NoETL). Capsenta also offers Ultrawrap Data Integrator where you can use mappings in an ETL or NoETL mode, or even a hybrid.

So what’s next?

There are a lot more exciting features coming soon.

  • Gra.fo/Mapper: Importing a mapping is just the beginning. We want you to be able to create your mappings all from within Gra.fo.
  • API: We want to empower users to create their own apps that interact with Gra.fo. Everything that you can do through the frontend, you will also be able to accomplish through an API.
  • Gra.fo Documentation: We have a phenomenal UI/UX team who strive to make Gra.fo very intuitive. Nevertheless, we acknowledge the necessity of having documentation.

We truly appreciate all the feedback that we have been getting from our users. Please keep it coming!


2nd U.S. Semantic Technologies Symposium 2019 Trip Report

I would summarize my experience at the 2nd US Semantics Technology Symposium as follows:

Frustrated but optimistic

First of all, I share my criticism with the upmost respect I have for everybody in this community. Furthermore, I acknowledge that this is my biased view based on the sessions and hallway conversations I had. Therefore, please take my comments with a grain of salt.

Frustration

Where is the excitement?

I left last year after the 1st US2TS very excited. It actually took me another week to process further the excitement. This year I simply didn’t feel the excitement. This was also echoed in the town hall meeting we had. I acknowledge that this is my feeling and it may not be shared by others. I was expecting to see people sharing their new research grants (per the previous year, there is a lot of NSF money), companies sharing what they are doing (there was a bit of this but not much), newcomers asking questions about how to bring semantic technologies into practice. All of this was missing. I think the community is stuck in the same ol’, same ol’.

Same ol’, Same ol’
A common theme I was hearing was the “Same ol’ same ol’”: we need better tools, we need to lower the barrier, it’s too hard to use, etc.  … Insert here a phrase often used towards someone who states the obvious….

This was also my takeaway from Deborah McGuinness’ keynote

This was probably my main source of frustration. I’ve been in this community for over a decade and I’ve been hearing the same thing for a decade. 

Where is the community?

Per the US2TS website “The goal of the U.S. Semantic Technologies Symposium series is to bring together the U.S. Semantic Web community and begin forming such a research network.” Given that this was the second edition, I was expecting that we would be seeing a community forming.  I did not feel that this was happening. Again, this is just my personal perception and others may disagree.

I did meet a few newcomers, and based on private conversations, I had the impression, and they also confessed, that a lot of the discussions were way over their head. Are we being open to the new comers? Do they feel welcome? I don’t think so. 

What is a Knowledge Graph? 

I understand that we need to define our terms, in order to make sure that we are talking about the same thing. But we have to be careful and not end up going down a black hole: 

All of this discussion was a reminder of the “drama” we went through in Dagstuhl on this same topic of defining what is a knowledge graph. As I mentioned in my Dagstuhl trip report:


Throughout the week, there was a philosophical and political discussion about the definition. Some academics wanted to come up with a precise definition. Others wanted it to be loose. Academics will inevitably come up with different (conflicting) definitions (they already have) but given that industry uptake of Knowledge Graphs is already taking place, we shouldn’t do anything that hinder the current industry uptake. For example, we don’t want people searching for “Knowledge Graphs” and finding a bunch of papers, problems, etc instead of real world solutions (this is what happened with Semantic Web). A definition should be open and inclusive. A definition I liked from Aidan Hogan and Antoine Zimmermann was “as a graph of data with the intention to encode knowledge”

Excerpt from Trip Report on Knowledge Graph Dagstuhl Seminar

I had the opportunity to provide “my definition” of knowledge graph and I did it in a controversial way

I find it funny/ironic that really smart academics are providing a definition for a marketing term that came up in a blogpost in 2012!

Optimistic

Now that I have shared my frustration, let me share my optimism.

It was confirmed over and over the semantic technologies do work in the real world. This was clearly exposed in Deborah’s and Helena’s keynote 

We do have newcomers who bring in a complete different perspective. 

The newcomers are bringing in lessons learned

This community is full of incredibly smart people. 

What should the 3rd edition of US2TS look like?

We need to provide elements to help form a community:

  • [UPDATE, idea after chatting with Anna Lisa Gentile] Given that there is a cap on attendees, in order to register, people should submit a position statement indicating 1) why they want to attend, 2) what they have to contribute and 3) what they expect to take away. This is what the W3C Workshop on Graph Data did and the conversations were very lively.
  • How about organizing a barcamp, with user-generated content on the fly.
  • Have a wall of ideas where people can post the topics they are interested.
  • Speed dating so people can find others that have similar interest.

We need more industry

  • I think we should strive to have 50/50 between Industry and Academia (I think it was 60% academia this time) .
  • Industry participants should have sessions explaining their pain points. 
  • Startup can share their latest developments and the help they may need.

We need an academic curriculum

  • If we already have a group of academics in the room, why not spend some time organizing an undergrad and post-grad curriculum for semantic technologies that can be shared. 

Even though I left frustrated, I’m optimistic that next year we can have an exciting event.

Final Notes

  • The event was very well organized. Kudos to all the organizers!
  • Duke University is very beautiful and the room we were in was very bright.
  • Should this event be rebranded to Knowledge Graphs?
  • Chris Mungall wrote a report
  • Folks appreciated my call for knowing our history

W3C Graph Data Workshop Trip Report

This week, March 4-6 2019, was the W3C Graph Data Workshop – Creating Bridges: RDF, Property Graph and SQL

When I come to meetings/workshops like this, I always ask myself what does success look like: “IF X THEN this will have been a successful meeting”. So, I told myself:

 

IF there is a consensus within the community that we need to standardize mappings between Property Graphs and RDF Graphs THEN this will have been a successful meeting.  

 

I can report that, per my definition, this was a successful meeting! It actually surpassed my expectations.

 

In order to keep track of the main outcomes of each talk/session I attend, I’m following a technique to summarize immediately the takeaways in a crisp and succinct manner (and if I can’t that means I didn’t understand). What better way of doing that than in a tweet (or two or three). Therefore the majority of this trip report are pointers to my tweets 🙂. In a nutshell the tl;dr:
– There is a unified and vibrant graph community.
– A W3C Business Group will be formed and serve as a liaison between different interested parties.
– There is a push for RDF*/SPARQL* to be a W3C Member submission.
– There is interest to standardize a Property Graph data model with a schema.
– There is interest to standardize mappings between Property Graphs and RDF.

 

Kudos to Dave Raggett and Alastair Green for chairing this event. The organization was fantastic. Additionally, the website has all the position papers, lighting talk slides and minutes in google docs for every single session. Please go there directly to get all the detailed information directly from the source.

Brad Bebee’s Keynote

The workshop started with a keynote by Brad Bebee from Amazon Neptune. The main takeaway of his talk was:

 

We all know that the common uses cases for graph are: social networks, recommendations, fraud detection, life science and network & IT operations. In addition to the common use cases, Brad said something that highly resonated with me specifically w.r.t. Knowledge Graphs (paraphrasing):

 

“Use graphs to link information together to transform the business. Link things that were never connected before. This is really exciting.

 

Some other important takeaways

Coexistence or Competition

After discussions about how standardization works within W3C and ISO, there was a mini panel session on “Coexistence or Competition” with Olaf Hartig, Alastair Green and Peter Eisentraut. The take aways:

Lightning talks

The day ended with over 25 lightning talks. The moderators were excellent time keepers. The two main themes that caught my attention were the following:

Many independent bridges are already being formed: Many approaches are being presented that build bridges between Property Graphs, RDF Graphs and SQL. A few of the lightning talks:

 

However, as Olaf Hartig was alluding, we should not focus on creating ad-hoc implementations of bridges. We need to clearly understand what that bridge means (i.e. what are the semantics!). Olaf’s RDF*/SPARQL* proposal to annotate statements in RDF and which can serve as a bridge between Property Graphs in RDF has been very well received in the community. As a matter of a fact, this approach has already been implemented in commercial systems such as Cambridge Semantics and Blazegraph.

 

Personally, I avoid (and actually stop) discussion on syntax. In my opinion, that should not be the first topic of discussion. We first need to agree on the meaning.
Note: I think there may be interesting science in here.

 

GraphQL is popular: I was surprised to see GraphQL being a constant topic of discussion. It was presented as the global layer over heterogeneous data sources (i.e OBDA), as an interface to RDF graphs, and also as a schema language for Property Graphs. You could hear a lot of GraphQL discussions in the hallway.
Note: I think this is engineering. Not clear if there is science here.

 

The second day consisted of three simultaneous tracks: Interoperation, Problems & Opportunities and Standards Evolution for a total of 12 sessions. By coincidence (?), all the sessions I was interested were in the Interoperation track.

Graph Data Interchange

Graph Query Interoperation

Specifying a Standard

I find it very cool that Filip Murlak and colleagues defined a formal, readable, and executable semantics of Cypher in Prolog which is based on the formal semantics defined by the folks from U. of Edinburgh. This reminds me when I took a course with JC Browne on Verification and Validation of Software Systems and learned about Tony Hoare’s and Jay Misra’s Verification Grand Challenge. 

Finally, Andy Seaborne made a very important point:

Graph Schema

I was glad to have the opportunity to moderate this session because this is a topic very dear to me (Hello Gra.fo!) and I am chairing an informal Property Graph Schema Working Group (PGSWG), so I was glad to moderate it.

 

Inspired by our work in G-CORE, which was a very nice mix of industry and academia members, and which influenced the GQL manifesto which lead to the GQL standardization effort, I was asked to chair this informal working group. I was able to share what we have accomplished up to now

George Fletcher provided a quick overview of the lessons learned in the academic survey. He condensed the lessons learned into: 1) start small, 2) start from foundations and 3) start with flexibility in mind. Oskar van Rest presented an overview of what the existing industry-based Graph databases support. This is still work in progress. I presented the use case and requirements document which is that starting point to drive the discussions towards features that address concrete use cases. Olaf presented how GraphQL could be a schema language for PG, in other words there is a syntax that could be reused. This sparked the discussion of syntax, syntax, syntax. As I previously mentioned, I avoid discussions that jump immediately into syntax because we should first focus on the understanding/semantics.

 

The top desirable feature was … KEEP IT SIMPLE! Other top features were: enable for future extensibility, allow for permissive vs restrictive, allow for open world vs closed world, have a simple clean formalization, and again… keep it simple (don’t make mistakes like XML Schema). Josh Shinavier remotely mentioned “historically, property graphs were somewhat of a reaction to the complexity of RDF. A complex standard will not be accepted by the developer community.

 

To summarize our 1.5 hour discussion:

 Finally, 1.5 hours is not enough to discuss graph schemas so a group of us stayed the next day and kept working on it.

Up to now, the PGSWG has been informal. There was a consensus that it should gain some sort of formality by becoming a task force within the Linked Data Benchmark Council (LDBC). More info soon!

What are the next steps?

The goal of the third and final day was to first offer a summary of each session and then to discuss the concrete next steps.
My concrete proposal for next steps:

I was also proposing to standardize mappings from Relational Database to Property Graphs and I was happy to learn that this work is already underway within ISO.

 

Following the building bridges analogy, we need to have aligned piers in order to know how to build the bridge. RDF is standardized and formalized. Property Graphs are not. Therefore the first task is to lift the Property Graph pier so it can be aligned to RDF. Subsequently, we will be in the position to start addressing interoperability needs between Property Graphs and RDF Graphs by the means of establishing direct and customizable mappings.

 

Furthermore, given the commercial uptake and interest of RDF*/SPARQL*, this will drive discussions towards a new version of RDF in the very near future.

 

The official outcome (I believe) is that a W3C Business Group will be created in order to coordinate with all the interested parties, existing W3C community groups and be a liaison with ISO (where the GQL and SQL/PG work is going on). An official report will come soon.

Lack of Diversity

We have a vibrant graph community. However this community lacks diversity as it was observed on twitter by Margaret who wasn’t even at the event:

There was definitely over 100 people attending this meeting. 96+ were men. I believe there was only 5 4 females attending (thanks to Christophe for the clarification). I had the chance to meet with them.

 

Dörthe Arndt: A moderator for the Rules and Reasoning session, a researcher at Ghent University and who believes that rules should be part of data. Unfortunately I did not have the opportunity to speak more with Dörthe.

– Marlène Hildebrand: This is the first time I met Marlène. She is at EPFL working on data integration using RDF, so we discussed a lot about converting different sources to RDF, mappings and methodologies on how to create ontologies and mappings.

– Petra Selmer is a member of the Query Languages Standards and Research Group at Neo4j and has vast experience on graph databases.
Monika Solanki, well known in the Semantic Web community and always a pleasure to interact with her at conferences.
– Natasa Varytimou: It was great to finally meet Natasa in real life after interacting a lot via email. She is an Information Architect at Refinitiv (Finance company of Thomson Reuters) and is one of the brains behind the large scale Refinitiv Knowledge Graph.
The lack of diversity worries me and I strongly urge that we, as a community, take action on this matter.

Final quick notes

– We seem to be converging into a unified graph community! Not individual RDF and PG communities. I didn’t hear any RDF vs PG conversations.
– However, Gremlin was underrepresented. If it weren’t for Josh Shinavier, who was constantly providing his input remotely, we would have missed valuable input.
– Thank you Josh and Uber for offering a virtual connection. I believe everything has been recorded and you can find the details in the minutes.
– BMW is starting to get onboard the Knowledge Graph bandwagon. After chatting with Daniel Alvarez , it seems that they are still in an early innovator phase. Nevertheless, very exciting.
– It was a great idea to have a two day event across three days. That  way you could technically arrive on the first day of the event and leave on the last day of the event.
– The W3C RDB2RDF Standard editors meet again! I was one of the editors of the Direct Mapping while Richard Cyganiak was one of the editors of R2RML

– Adrian Gschwend has his summary in a twitter thread:

– Gautier Poupeau has his summary in a twitter thread in french:

– Find a lot more tweets by searching for the the #W3CGraphWorkshop hashtag.

My Most Memorable Event of 2018

I travelled a lot in 2018, but actually a bit less than 2017. I flew 132,064 miles which is equivalent to 5.3x around Earth. I was on 93 flights. I spent almost 329 hours (~14 days) on a plane. I visited 11 countries: Colombia, France, Germany, India, Italy, Japan, Mexico, Spain, South Korea, Turkey, UK. I was in Austin for 174 days (my home), 53 days in Europe, 31 days in Colombia and 14 days in Mexico, India and Japan.

Given all this travel and everything I did in 2018, I asked myself: what was my most memorable event of 2018?

Answer: The 14 times I gave my talk “Integrating Semantic Web in the Real World: A Journey between Two Cities” all around the world.

Abstract: An early vision in Computer Science has been to create intelligent systems capable of reasoning on large amounts of data. Today, this vision can be delivered by integrating Relational Databases with the Semantic Web using the W3C standards: a graph data model (RDF), ontology language (OWL), mapping language (R2RML) and query language (SPARQL). The research community has successfully been showing how intelligent systems can be created with Semantic Web technologies, dubbed now as Knowledge Graphs.
However, where is the mainstream industry adoption? What are the barriers to adoption? Are these engineering and social barriers or are they open scientific problems that need to be addressed?
This talk will chronicle our journey of deploying Semantic Web technologies with real world users to address Business Intelligence and Data Integration needs, describe technical and social obstacles that are present in large organizations, and scientific and engineering challenges that require attention.

It all started when Oscar Corcho invited me to be a keynote speaker at KCAP 2017. I wanted to give a talk that described the journey I’ve been going through with Capsenta which is commercializing the research that I did in my PhD and the lessons learned throughout the process. Apparently the talk was very well received and I quickly started to get invitations.

I gave the talk at:

1. Imperial College London. London, UK. Jan 2018. Invited by Bob Kowalski.
2. Knowledge Media Institute at the Open University. Milton Keynes, UK. Feb 2018. Invited by Miriam Fernandez.
3. University of Oxford. Oxford, UK. Feb 2018. Invited by Dan Oltenu. (tweet)
4. TU Dresden. Dresden, Germany. April 2018. Invited by Hannes Voigt.
5. Big Data Kompetenzzentrum Leipzig (ScaDS Dresden/Leipzig) Universität Leipzig. Leipzig, Germany. April 2018
6. Information Sciences Institute at the University of Southern California. Marina del Rey, USA. May 2018. Invited by Mayank Kejriwal.
7. Pacific Northwest National Laboratory. Richland, Washington, USA. June 2018. Invited by Eric Stephan.
8. Ontology Engineering Group at the Universidad Politecnica de Madrid (UPM). Madrid, Spain. July 2018. Invited by Oscar Corcho.
9. Free University of Bolzano. Bolzano, Italy. July 2018. Invited by Enrico Franconi.
10. Keynote for the 45th Japanese Society for Artificial Intelligence Semantic Web and Ontology Conference. Tokyo, Japan. August 2018. Invited by Ryutaro Ichise and Patrik Schneider.
11. University of Edinburgh. Edinburgh, Scotland. Sept 2018. Invited by Leonid Libkin
12. University of Erlangen-Nuremberg. Nuremberg, Germany. Sept 2018. Invited by Andreas Harth.
13. University of California – Santa Cruz. Santa Cruz, USA. Oct 2018. Invited by Phokion Kolaitis
14. Manipal Institute of Technology. Manipal, India. November 2018.

I deliver this talk wearing two hats: science and business. The goal of the talk is to provide an answer to the following question: Why is it so hard to deploy Semantic Web technologies in the real world?

I start by describing the research I did in my PhD and what was productized in Capsenta and continue to describe the status quo of data integration that we see in the real world. I share five observations that we have made when trying to use semantic web technologies to address data integration needs:

1. We are boiling the ocean because we want to create the ontology first.
2. Real world databases schemas are hard… really hard!
3. Real world mappings are hard… really hard!
4. Knowledge Hoarding
5. Tools are made for citizens of the Semantic City

I present ideas and solutions that we are working on at Capsenta to address these issues and bridge the chasm between the Semantic and Non-Semantic cities. Essentially, we need Knowledge Engineers, who need to be empowered with methodologies and tools. A final call to arms is made: we need to study the social-technical aspects of data integration.

A theme throughout the talk is that we need to know our history. Too much wheel reinventing is going on.

It has been a true honor to have the opportunity to give this talk so many times in 2018. I want to thank everybody who invited me, who listened to the talk, asked questions and fostered discussions. I’m extremely lucky to have had so many enlightening discussions which have sparked new research-industry collaborations. 2019 is going to be very exciting!

Without further adieu, here is a recording of the talk at KMI in Feb 2018

 

How I am Avoiding a Burnout

Earlier this month I saw this tweet:

and it got me thinking. I provided a short answer:

I kept reflecting on what I’ve been doing this year to avoid a burnout, so I decided to write this up.

Make Lists

Write down everything you need to get done. Just write it down. Doesn’t matter if it’s short, medium or long term. After that, you understand the lay of the land and you can start organizing and prioritizing. Every day I look at the list and ask myself “What do I need to cross off my list to consider that I had a successful day?”. I focus on those few issues.  If you want to get more sophisticated, follow the Getting Things Done (GTD) time management method.

Learn to say No

One of the hardest things to do. This is something that everybody told me during grad school and everybody I talk to acknowledges that it’s a hard thing to do. Nevertheless, strive to say NO to more things.

Delegate

If you can, delegate. And when you do, then don’t worry about the task at hand. I know this is easier said than done and possible if you work in a team.

Read Magazine and Books. Watch documentaries 

During grad school I always felt guilty if I was spending time reading non-research papers because I always had a large stack of paper to read. I still feel that guilt. However, I realized that by reading other material, I get a different perspective of the world and this helps in the diversity of ideas. If you ask around, highly successful people spend a lot of their time reading, even with their hectic schedule.



If there is one magazine you would like to read, I highly recommend Bloomberg Business Week. I recently read Creative Selection: Inside Apple’s Design Process During the Golden Age of Steve Jobs. On my to-read list I have: Biography of SkinnerWhy We Sleep: Unlocking the Power of Sleep and DreamsThe Third Wave: An Entrepreneur’s Vision of the Future and The Next 100 Years: A Forecast for the 21st Century.

When am I get reading done? At night, leave your phone in a different room and read before going to bed.


I also enjoy watching documentaries so I can learn new things.  I’ve been enjoying Explained, World War II in Colour, and the CNN miniseries on 60s, 70s, 80s and 90s.

Have Fun!

I go dancing and try to go once a week at least. Cooking is my relaxation. I also avoid working on Saturday.

Just Relax!

Sometimes I feel like I didn’t accomplish anything during the day and I feel guilty. It’s fine if I don’t feel productive. Just relax. I know that I will probably have another moment where I will be extremely productive.

Be Healthy

Last but not least, focusing on my health has been game changer. This involves going to the gym and eating healthy. I’ve never been a gym-going person. However I did find an amazing gym, Dane’s Body Shop, which is community oriented. When I’m in Austin, I look forward to going to the gym everyday!


I also started working with Veronica Bumpass, the nutritionist at the gym to support and guide me on organizing a healthy lifestyle, specially when I travel. I’m very conscience of what I eat, workouts I can do when I travel, all while still enjoying the fun lifestyle that I like to have.


A couple of tips:
1) work out 20 mins every day. No excuses. You can find plenty of workouts that you can do at home with no equipment. Just make sure that you have a correct form and for that, start working out with a trainer.
2) SLEEP 8 HOURS! No excuses.
3) When you are eating, ask your self “are those calories worth it?”
4) Do you have to be on the phone? Take a walk while you are on the call.


If you are in Austin, you MUST check out Dane’s Body Shop.

Conclusion

If you learn to say NO, and delegate other tasks, you will have more time. Having a healthy lifestyle will make you feel stronger, physically and mentally. Give your brain a break by relaxing, having fun, reading, etc. Everyday focus on a the top priority things on your list.


Finally, by coincidence, I saw this today:

I love the answer, which emphasizes the part of being healthy!

International Semantic Web Conference (ISWC) 2018 Trip Report

ISWC has been my go-to conference every year. This time it was very special for two reasons. First of all, it was my 10 year anniversary of attending ISWC (first one was ISWC2008 in Karlsruhe where I presented a poster that ultimately became the basis of my PhD research and also the foundational software of Capsenta). Too bad my dear friend and partner in crime, Olaf Hartig, missed out (but for good reasons!). I only missed ISWC2010 in Shanghai; other than that, I’ve attended each one and I plan to continue attending them (New Zealand next year!)

The other reason why this was a special ISWC is because we officially launched Gra.fo, a visual, collaborative, real-time ontology and knowledge graph schema editor, which we have been working on for over 2 years in stealth mode.


THE Workshop: This year at ISWC, I co-organized THE Workshop on Open Problems and Emerging New Topics in Semantic Web Technology. The goal was to organize a true workshop where attendees would actually discuss and get work done.

Let’s say that we may have been a bit ambitious but at the end it turned out very well. In the first part of the morning, everybody was encouraged to stand up and on the spot talk for a minute about their problem. We gathered 19 topics. The rest of the morning, we self organized into clusters and each group continued to discuss and finalized with a wrap-up.

The goal was to submit the problems to THE Workshop website. Looks like the attendees have not done their homework (you know who you are!). We had great feedback about this format and we will consider submitting it again for next year and improve the format.


VOILA: I’ve been attending the Visualization and Interaction for Ontologies and Linked Data (VOILA) Workshop for the past couple of years (guess why 🙂 ) and luckily I was able to catch the last part of it. My take away is that there is a lot of cool things going on in this area but the research problems that are being addressed are not always clear. Furthermore, prototypes are engineered and evaluated but it’s not clear who is this tool for. Who is your user? I brought this up in my trip report from last year. This community MUST partner with other researchers in HCI and Social Science in order to harden the scientific rigor. Additionally, there are cool ideas that would be interesting to see if there is commercial viability.


SHACL:  I attended the Validating RDF data tutorial by Jose Emilio Labra Gayo. I came in trying to find an answer to the following question: Is SHACL ready for industry prime time? The answer is complicated but unfortunately I have to say, not yet. First of all, even though SHACL is the W3C recommendation, there is another proposal called ShEx from Jose Emilio’s group. He acknowledges his bias but if you look at the ShEx and SHACL side by side, you can argue for one or the other objectively. For example, ShEx supports recursive constraints, but SHACL doesn’t (There was a research paper on this topic, Semantics and Validation of Recursive SHACL, … but it’s research!). Nevertheless, the current SHACL specification is stable and technically ready to be used in prime time. The problem is the lack of commercial tools for enterprise data. Jose Emilio is keeping a list of SHACL/ShEx implementations but all except for TopQuadrant, are (academic) prototypes. Seems like Stardog is planning to officially support it in their 6.0 release. At this stage, I was expecting to see a standalone SHACL validator that can take as input RDF data or a SPARQL endpoint and run the validations. With all due respect, but these kind of situations are embarrassing for this community and industry: apparently a standard is needed, a recommendation is made, but at the end there is no industry implementation and uptake (one or two is not enough). We live in a “build it and they will come” world and this does not make us look good. </rant>. On a positive note, I think we are very close to the following: create a SHACL to SPARQL translator that starts out by supporting a simple profile of SHACL (cardinality constraints). This way anybody can use this on any RDF graph database. Somebody should build this, and we should support it as a community, not just academics but also having industry users behind it.

Hat tip to Jose Emilio for the nice SHACL/ShEx Playground and to EricIovka and Dimitris for making their book, Validating RDF, available for free (html version).


SOLID: I missed out on the Decentralizing the Semantic Web workshop. I heard it was packed and I guess it did help that Tim Berners-Lee was there presenting on Solid. Later on, I had the chance to talk to TimBL about Solid and his new startup Inrupt. The way I understood Solid and what Inrupt is doing is through the following analogy: They have designed a brand new phone and the app store infrastructure around it (i.e. Solid). However, people already have phones (web apps that store your data) so they need to convince others to use their phone. Who would they convince and how? Ideally, they want to convince everybody on earth… literally, but they can start out with people who are concerned about data ownership privacy. My skepticism is that the majority of the people in the world don’t care about it. Jennifer Golbeck’s keynote touched on this topic and stated that young people don’t care about privacy but the older you get, the more you start caring. Solid is definitely solving a problem but I question the size of the market (i.e. who cares about this problem). Good luck Inrupt team!

Enterprise Knowledge Graphs: One of the highlights of ISWC was the Enterprise Knowledge Graph panel. This was actually a great panel (commonly I find that panels are very boring). The participants were from Microsoft, Facebook, Ebay, Google and IBM. I had two main takeaways.
1) For all of these large companies, the biggest challenge is identify resolution. Decades of Record Linkage/Entity Resolution/etc research and we are still far away from solving this problem… at scale. Context is the main issue.
2) The most important takeaway from the entire conference was: NONE OF THESE COMPANIES USE RDF/OWL/SPARQL… AND IT DOESN’T MATTER! I was actually very happy to hear them say this in front of the entire semantic web academic community. At the end, the ideas of linking data, using triples, having tight/loose schemas, reasoning, all at scale have come out of the semantic web research community and started to permeate into the industry. It’s fine if they are not using the exact W3C Semantic Web Standards. The important thing is that the ideas are being passed on to the real world. It’s time to listen to the real world and see what problems they have and bring it back for research. This is part of the scientific method!
Notes from each panelist:

Another possible answer to Yolanda Gil’s question is the recently launched dataCommons.org.
The final question to the panel: what are the challenges that the scientific community should be working on. Their answers:


Not everybody is a Google: The challenges stated by the Enterprise Knowledge Graph panelist are for the Googles of the world. Not everybody is a Google. For a while now, I feel that a large research focus is on tackling problems for the Googles of the world. But what about the other spectrum? My company Capsenta is building knowledge graphs for very large companies and I can tell you that building a beautiful, clean knowledge graph from even a single structured data source, let alone a dozen, is not easy. I believe that the semantic web, and even the database community have forgotten about this problem and dismissed this as day to day engineering challenges. The talk “Integrating Semantic Web in the Real World: A Journey between Two Cities” that I have been giving this year details all the open engineering, scientific and social challenges we are encountering. One of those problems is defining mappings from source to target schemas. Even though the Ontology Matching workshop and the Ontology Alignment Evaluation Initiative have been going on for over a decade… the research results and systems do not address the real world problems that we see at Capsenta in our day to day. We need to research the real world social-technical phenomenons of data integration. One example is dealing with complex mappings. I was very excited to see the work of Wright State University and their best resource paper nominated work “A Complex Alignment Benchmark: GeoLink Dataset”. This is off to a good start but there is still a lot of work to be done. Definitely a couple of PhDs can come out of this.


Natasha Noy’s keynote:  I really enjoyed her keynote, which I summarized: 

She also provided some insight on Google Dataset search:


Vanessa Evers’ keynote was incredible refreshing, because it successfully brought to the attention of the semantic web community the problems encounter to create social intelligent robots. Guess what’s missing? Semantics and reasoning!


Industry:  I was happily surprised to see a lot of industry folks this year. The session I chaired had about 100 people.

Throughout the week I saw and met with startups like Diffbot and Kobai; folks from Finance: FINRA, Moodys, Federal Reserve, Intuit, Bloomberg, Thomson Reuters/Refinitiv, Credit Suisse; Graph Databases companies: Amazon Neptune, Allegrograph, Marklogic, Ontotext’s GraphDB, Stardog; Healthcare: Montefiore Health Systems, Babylon Health, Numedii; the big companies: Google, Microsoft, IBM, Facebook, Ebay; and many others such as Pinterest, Springer, Elsevier, Expert Systems, Electronic Arts. Great to see so much industry attending ISWC! All the Industry papers are available online.

Best Papers: The best papers highlighted the theme of the conference: Knowledge Graphs and Real World relevance. The best paper went to an approach to provide explanations of facts in a Knowledge Graph.

The best student research paper was a theoretical paper on canonicalisation of monotone SPARQL queries, which has a clear real world usage: improve caching for SPARQL endpoints.

The best resource paper address the problem of creating a gold standard data set for data linking, a crucial task to create Knowledge Graphs at scale. They present an open source software framework to build Games with a Purpose in order to help create a gold standard of data by motivating users through fun incentives.

The best in use paper went to the paper that describes the usage of semantic technology underpinning Wikidata, the Wikipedia Knowledge Graph.

Finally, the best poster went to VoCaLS: Describing Streams on the Web and the Best demo award went to WebVOWL Editor.


DL: Seems like DL this year meant Deep Learning and not Description Logic. I don’t think there was any paper on Description Logic, a big switch from past years.


Students and Mentoring:  I enjoyed hanging out with PhD students and offering advice at the career panel during the Doctoral Consortium and at the mentoring lunch.

During the lunch on Wednesday we talked about science being a social process and it was very nice that this also came up on Thursday during Natasha’s keynote


Striving for Gender Equality: I am extremely proud of the semantic web research community because they are an example of always striving for gender equality. This year they had a powerful statement: conference was organized entirely by women (plus Denny and Rafael) and  they had 3 amazing women keynotes. Additionally, the local organizers did a tremendous job!

Furthermore, Ada Lovelace Day, which is held every year on the second Tuesday of October, occurred during ISWC. So what did the organizers do? They held the Ada Lovelace celebration where we had a fantastic panel discussing efforts on striving for gender equality in the sciences (check out sciencestories.io!)

The event ended with Wikipedia Edit-a-thon where we created and edited Wikipedia pages of female scientist. In particular, we created Wikipedia pages for female scientist in our community: Natasha Noy, Yolanda Gil, Lora Aroyo. It was a true honor to have the opportunity to create the english wikipedia page of Asunción Gómez Pérez, who has been incredibly influential in my life.

More trip reportsCheck out Helena Deus’ and Paul Groth’s ISWC Trip reports (which I haven’t read so it wouldn’t bias mine)

What an awesome research community: I am very lucky to consider the Semantic Web community my research home. It’s a work hard, play hard community.

We were at a very beautiful venue:

We like to sing

We like to have great dinners and dance:

We even throw jam sessions and parties:

And just like last year, I recorded the Jam session:

ISWC Jam Session

Posted by Juan Sequeda on Thursday, October 11, 2018

Posted by Juan Sequeda on Thursday, October 11, 2018

Posted by Juan Sequeda on Thursday, October 11, 2018

See you next year in New Zealand

… and then in 2020 … Athens, Greece!