I just got back from the 2019 Knowledge Graph Conference organized by the School of Professional Studies of Columbia University and chaired by François Scharffe. I was honored to be invited by François to be a member of the Program Committee. Our task was to invite speakers, go over the submitted proposals and help shape the program of this event.
As always, I asked myself: what does success look like? If we learn what are the real world problems that various industries are tackling with Knowledge Graphs, and how they are achieving it. Additionally, for the skeptics to leave less skeptical and eager to engage more with Knowledge Graphs.
I can report that, per my definition, this was a successful event! It actually surpassed my expectations. The event was packed and all 200 tickets were sold out.
My main takeaways: 1) Finance is all over Knowledge Graphs, 2) more and more industries are now starting to pay attention, 3) roadblocks are social, not technical (the technology works!) and 4) virtual knowledge graphs are gaining a lot of interest (keep the data where it is).
Before I dive into details of this trip report, I believe it is paramount to highlight a problem that was observed by many: the lack of gender diversity. This was also observed in the past W3C Graph Data Workshop. We have a vibrant graph community, but why are we lacking gender diversity in this graph community? While we were creating the program, we invited many female speakers. Unfortunately, some couldn’t make it and some cancelled at the last minute. It was great to see a larger female representation within the audience. A diverse group brings diverse ideas and fosters increased creativity. As a community, we need to make sure that all voices are included. This lack of diversity worries me tremendously. We need to support the community at large and encourage people from all diverse backgrounds to participate and speak at next year’s Knowledge Graph Conference (yes, this event will take place next year!).
This event did have a diversity of industries attending. The talks and break discussions were very broad. I’m going to organize this report by the following topics: Finance, Other Industries, Unicorns, Virtual Graphs, Vision and Vendors.
Finance
Given that we were in New York, there was a great representation of financial services companies.
The day was kicked off with a talk by Christos Boutsidis from Goldman Sachs.
It takes 1 week for them to construct the graph (if they are lucky) and 1 day to update the graphs with deltas. They infer and extract knowledge from the graph by running a series of standard graph algorithms: Edge weights to understand how strong is a relation between a client and employee, Vertex centrality (i.e. Pagerank) to identify influencers, Vertex similarity to match Marcus applicants with politically exposed people, All pairs all paths to find Shortest paths to connect people with the firm and Community detection (clustering) to find set of accounts that participate in the same money transfer. The compliance applications are Insider threat, Insider trading, AML, Marcus Lending/Banking and Co-branded credit card.
The next talk was by Patricia Branum and Bethany Sehon from Capital One. Their goal was to attach an ontology to their existing Customer 360 data in order to enhance definitions, standardized metadata and then further improve the metadata.
When asked, how did they get sponsorship and internal buy in, it was an easy sell within Capital One because they see themselves as a data-driven company (shouldn’t everybody be one?!!). Given that their sponsors were in risk management, which deals with a lot of data, it was easy to fund the pilot. Capital One is planning to take this into production. They are also looking into reasoning.
What were their challenges? It wasn’t technical, it was social (I topic that I discussed during my talk)
I also really liked their definition of an ontology (and see the replies to my tweet to see other interesting discussions)
David Newman from Wells Fargo, a long timer also presented.
Tim Baker from Refinitiv (formerly Thomson Reuters Financial & Risk) presented their Knowledge Graph used to track bad actors.
Vivek Khetan from Accenture discussed on combining knowledge graph and NLP to understand regulatory press releases
It’s been known for a while that the financial industry has been using semantic/graph technology for a while. But why has it been taking so long? I think Dean Allemang‘s First mover slide below sums it up:
More Industry Real World Use cases
Joe Pindell from Pitney Bowes and Colin Puri from Accenture jointly presented a customer service use case. With their knowledge graph they are 1) providing context and guidance, 2) discovering resolutions via relationships and 3) modeling & merging data views.
Lambert Hogenhout from the United Nations shared with the audience the reasons why the UN needs knowledge graphs.
The UN also needs to deal with many multilingual issues. They are just starting out.
Chris Brockmann from Eccenca discussed how Knowledge Graphs are used to integrate data in supply chain and provided a great ROI.
Tom Plasterer from AstraZeneca discussed that their main challenges is that data is all over the place. Their approach is to build a knowledge graph following the FAIR principles.
Parsa Mirhaji from Montefiore Hospital discussed on how it is still challenging to do analytics with health data
Steven Gustafson from MAANA shared their experience of creating knowledge graphs in the oil and gas industry. The popular term in this industry is Digital Transformation and he provided an interesting definition: Knowledge Graph + Function Graph = Digital Transformation, where a function graph is a graph of methods (ie functions) and how they interact between each other.
Unicorns
By unicorns I mean, companies that are very different from the mainstream (not everybody is a Google). It was very excited to have representatives from Airbnb, Amazon, Diffbot, Uber and Wikidata.
Xiaoya Wei from Airbnb presented their knowledge graph:
They built the infrastructure from scratch. From a storage and data partitioning perspective, the nodes and edges are stored separately, by source. Node schema and edge payload are defined by thrift binary. It is horizontally scalable. The goal is to avoid broadcast for queries with large fan out. From a query perspective, the objective is to traverse a subgraph and retrieve nodes and edges from the traversal. Data is being ingested via an asynchronous framework to continuously import data. Diffs are calculated and then published on Kafka. Finally, why did they build the infrastructure from scratch? Because they built upon the infrastructure that they currently support internally (i.e. they don’t want to bring in more software and have to support it).
Three use cases were discussed: 1) Navigation via a taxonomy that describes the inventory, 2) recommendation and 3) provide more context.
Data quality and consistency is a key challenge. A human team checks data quality. That is why access control is important for them because a user can only make changes to the data that they know.
Subhabrata Mukherjee from Amazon (now at Microsoft Research) discussed how the Amazon Product Graph is being built.
Human in the loop techniques are required to clean up noisy training labels. Additionally, the information extraction system return a triples of strings, therefore the strings needs to map to concepts (things not strings!) in order to truly integrate the knowledge.
Even though Diffbot is a startup, I’m putting them in the unicorn category because they are doing something very unique that not everybody needs to do: create a knowledge graph by crawling the web. Effectively, they are competing against Google and offering services that Google doesn’t. Mike Tung, CEO of Diffbot presented:
Great quote:
Josh Shinavier described lessons learned from creating a Knowledge Graph at Uber. Josh also confirmed Airbnb’s comments about why they built their own infrastructure from scratch: they want to reuse the support capabilities that they already have and not bring new software into the mix.
For more details, check out the article Uber’s graph expert bears the scars of billions of trips.
Finally, Denny Vrandecic from Google AI talked about Wikidata. Check out Vivek Khetan’s twitter thread on Denny’s talk.
Virtual Graphs
Capital One, AstraZenca, Uber and Wells Fargo all publicly stated that they are looking into virtual graphs. This means, they want to be able to keep the data in its original source and have a way to virtualize it as a Knowledge Graph.
This is music to my ears because this is what my PhD was all about and on the premise for which Capsenta was founded: a NoETL (i.e. virtualize) approach to data integration via semantic/graph technology.
I had a lot of discussions during the breaks with other folks about this topic. There is an agreement that moving the data to a centralized location has been the status quo and it’s getting more and more expensive. I’m also glad to see other vendors talking about virtualization such as data.world and Stardog.
Machine Learning
Subhabrata Mukherjee’s talk provided a lot of details into their machine learning process. Take a look at Vivek’s twitter thread.
Alfio Gliozzo from IBM Research discussed how to extend Knowledge Graphs using Distantly Supervised Deep Nets. The challenge: develop hand labeled data. There is an agreement with the ML folks in the audience. Vivek also has a detailed thread on this talk.
Freddy Lecue from Thales discussed explainable AI.
Vision
Given the hype of Machine Learning, Deep Learning, AI, etc, I’ve been asking myself if we will ever automate the creation of Knowledge Graphs. I had a great discussion with Subhabrata Mukherjee on this topic. He thinks that we will get there assuming the source of data is unstructured because there is so much overlapping data within the same domain. On the other hand, when the source is structured data, we both agreed that the future doesn’t look bright. There simply isn’t enough overlapping domain data. As I mentioned in my talk, I never thought that I would be working on methodologies because we need to empower humans and machines to work together.
We were very lucky to have Pierre Haren, a pioneer in AI and rule systems and founder of ILOG. He spoke about the future evolution of knowledge graphs to casual graphs where the relationships (edges) are causal.
Personally, I was thrilled to finally meet him and get his input for our upcoming tutorial on the History of Knowledge Graph’s Main Ideas at ISWC2019.
Vicky Froyen discussed where Collibra is heading.
Vendors
We had representatives from many vendors: AllegroGraph/Franz, Amazon Neptune, Data Chemists, Datastax, data.world, GraphDB/Ontotext, Neo4j, Stardog, TigerGraph and yours truly, Capsenta!
I gave a 20 min version of my talk Designing and Building Enterprise Knowledge Graphs from Relational Databases in the Real World (which is an evolution of my previous talk on Integrating Semantic Web in the Real World: A Journey between Two Cities ). I’m happy to share that the talk was very well received. Check out this twitter thread.
I was also very thrilled to give demos of Gra.fo, our visual collaborative and real-time knowledge graph schema editor. I love seeing the faces of people when they see Gra.fo for the first time. I am so proud of the entire Capsenta team for developing Gra.fo!
Nav Mathur from Neo4j discussed how they build knowledge graphs
Jesús Barrasa shared an objective comparison between RDF Graphs and Property Graphs (more later)
Brad Bebee shared lessons learned from Amazon Neptune’s customers
Bryon Jacob from data.world discussed how they sneak knowledge graphs into the users without them even knowing about it
Nasos Kyriakov from Ontotext shared a marketing intelligence use case
The grand finale was a genuine and honest discussion between all the vendors which I had the honor to moderate.
My takeaway is that there is NOT a RDF graph vs Property Graph “battle”. It was agreed that if your goal is to share data, then use RDF. But that doesn’t stop you from using a property graph. Jesus was very emphatic that you can use Neo4j as their storage model and still support RDF (probably not natively) from Neo4j. Jeremy from Datastax shared that with the upcoming Tinkerpop 4 you can compile anything into the internals of tinkerpop, let it be Cypher or SPARQL. Amazon supports both because their customers want both.
However, some of the RDF folks are more “pedantic” like Stardog and Datachemist. Finally, Datachemist is proposing a new graph language which has features that have been well defined in G-CORE and are going into GQL.
I asked everybody to give a 2 floor elevator pitch to convince the audience that they should spend their time evaluating their technology. Basically everybody’s response was the same: just sign up/download our system and try it out.
My takeaway from the panel: we are turning into a fuzzy open warm comfortable graph community. Confirms my takeaway from the W3C Graph workshop .
Final Thoughts
- Word on the street is that people really regret not attending this event.
- Congrats to all the organizers: Francois, Thomas, Will and all the students collaborators. You all ran an impeccable event!
- Kudos to all the speakers who stayed within the 20 minute slots for their talks.
- Even though the majority were new faces, it was great to see old timers like Dieter Fensel, Dean Allemang and Sören Auer, renowned figures in the semantic web community
- Check out all the #kgc2019 tweets
- Beautiful location and the weather was PERFECT!
- Check out Denny’s trip report
- Check out Vivek’s trip report
- Talks were recorded! It will take a while but they will be made public. So stay tuned!
- See you May 2020 back in NY!
Finally, check out what some of the attendees had to say