This past week was the SIGMOD conference. I’ve always known that there is a lot of overlapping work between the database and semantic web community so I have been attending regularly since 2015. The topics I’m most interested in are on data integration and graph databases, which is the focus of this trip report.
I was really looking forward to going to Portland but that was not possible due to Covid. Every time I attend SIGMOD I get to know more and more people so I was a bit worried that I wouldn’t get the same experience this year. This virtual conference was FANTASTIC. The organizers pulled off a phenomenal event. Everything ran smoothly. Slack enable a deep and thoughtful discussion because people spent time organizing their thoughts. There were social and networking events. Gather was an interesting experience to simulate the real world hallway conversations. The Zoomside chats were a AMA (ask me anything) with researchers on different topics. The Research Zoomtable were relaxed discussions between researchers about research topics. The panels on Women in DB, The Next 5 Years and Startups generated a lot of discussion afterwards on slack. Oh and how can we forget the most publicized and popular event of all: Mohan’s Retirement Party.
Science is a social process. That is why I value so much conferences because it is the opportunity to reconnect with many colleagues, meet new folks, discuss ideas, projects, etc. The downside of a virtual conference is that your normal work week continues (unless you decide to disconnect 100% from work during the week). I truly hope that next year we will be back to some sense of normality and we can meet again f2f.
My takeaways from SIGMOD this year are the following:
1) There is still a gap between real world and academia when it comes to data integration. I personally believe that there is science (not just engineering) that needs to be done to bridge this gap.
2) Academia is starting to study data lakes and data catalogs. There is a huge opportunity (see my previous point).
3) There is interest from academia to come up with novel interfaces to access data. However will just be an academic exercise with very little real world impact if we don’t understand who is the user. To do that, we need to connect more with the real world.
4) Graphs continue to gain more and more traction in industry.
I’m very excited that this community is looking into the needs and features of data catalogs because this is topic dear to my heart because I am the Principal Scientist at data.world, which is the only enterprise data catalog that is cloud-native SaaS with Virtualization and Federation, powered by a Knowledge Graph.
RESEARCH
There was a very interesting slack discussion about research and the “customer” that was sparked after the panel “The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?”.
AnHai Doan commented that the community understands the artificial problems in the research papers instead of understand the real problems that customers face. Therefore there is a need to identify the common uses cases (not corner cases) that address 80% of customers needs and own those problems and own them.
To that, Raul Castro Fernandez pointed out that systems work is disincentivized because reviewers always come back with “just engineering.” Personally, if there is a clear hypothesis and research question, with experiments that provide evidence to support the hypothesis, then the engineering is also science. Otherwise, it is engineering.
Joe Hellerstein chimed in, with spot on comments, that are not worth summarizing, so here they are verbatim:
“I would never discourage work that is detached from current industrial use; I think it’s not constructive to suggest that you need customers to start down a line of thinking. Sounds like a broadside against pure curiosity-driven research, and I LOVE the idea of pure curiosity-driven research. In fact, for really promising young thinkers, this seems like THE BEST reason to go into research rather than industry or startups“
– Joe Hellerstein in a slack discussion
“What I tend to find often leads to less-than-inspiring work is variant n+1 on a hot topic for large n. What Stonebraker calls “polishing a round ball”.“
“Bottom line, my primary advice to folks is to do research that inspires you.“
“...if you are searching for relevance, you don’t need to have a friend who is an executive at a corporation. Find 30-40 professionals on LinkedIn who might use software like you’re considering, and interview them to find out how they spend their time. Don’t ask them “do you think my idea is cool” (because they’ll almost always say yes to be nice). Ask them what they do all day, what bugs them. I learned this from Jeff Heer and Sean Kandel, who did this prior to our Wrangler research, that eventually led to Trifacta. It’s a very repeatable model that simply requires different legwork than we usually do in our community.“
DATA INTEGRATION
Most of the data integration work continues to be on the topic of data cleaning and data matching/entity matching/entity resolution/…. Makes sense to me because this is an area where there are opportunities to continue automating because there is a lot of data. The following papers are on my to-read list:
- A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching
- ZeroER: Entity Resolution using Zero Labeled Examples
- Learning Over Dirty Data Without Cleaning (Video)
- Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks (Video) (Code)
Given how data lineage is an important feature of data catalogs, I was keen to attend the Provenance session. At data.world, we represent data lineage as provenance using PROV-O. Unfortunately I missed it and was able to catch the tail of the Zoomtable discussion. My biased perception is that the academic discussions on provenance are disconnected from reality when it comes to data integration. I shared the following with a group of folks: “From an industry perspective, data lineage is something that companies ask for from data integration/catalog/governance companies. The state of the art in the industry is to extract lineage from SQL queries, stored procedures, ETL tools and represent this visually. This can now be done. Not much science here IMO. There is a push to get lineage from scripts/code in Java, Python. What is the academic state of the art of reverse engineering Java/Python/… code used for ETLing?“.
Zach Ives responded that there has been progress in incorporating UDFs and certain kinds of ETL operations, with human expertise incorporated, but he wasn’t aware of doing this automatically from Java/Python code.
I was pointed to the following that I need to dig in
- http://www.cs.iit.edu/~dbgroup/projects/gprom.html
- https://github.com/PierreSenellart/provsql
- http://gems-uff.github.io/noworkflow/
As AnHai noted on a slack discussion, there is still a need to bridge the gap between academia and the real world. Quoting him:
“For example, you said “we understand the problems in entity matching and provenance”. But the truth is: we understand the artificial problems that we define for our research papers. Not the real problems that customers face and we should solve. For instance, a very simple problem in entity matching is: develop an end-to-end solution that uses supervised ML to match two tables of entities. Amazingly, for this simple problem, our field offers very little. We did not solve all pain points for this solution. We do not have a theory on when it works and when it doesn’t. Nor do we have any system that real users can use. And yet this is the very first problem that most customers will face: I want to apply ML to solve entity matching. Can you help me?“
– AnHai Doan in a slack discussion
DATA ACCESS
I’m observing more work that intends to lower the barrier for accessing data via non-SQL interfaces such as natural language, visual and even speech! The session on “Usability and Natural Language User Interfaces” was my favorite one because the topics were “out of the box” and “curiosity-driven”. I am very intrigued by the QueryVis work to provide diagrams to understand complicated SQL queries. I think there is an opportunity here, but the devil is in the details. The SpeakSQL paper sparked a lot of discussion. Do we expect people in practice to dictate a SQL query? In the Duoquest paper, the researchers combine the Natural Language Interface approach with Programming-by-example, where a user provides a sample query result. I’ve seen this PBE approach over and over in the literature, specifically for schema matching. At a glance it seems an interesting approach but I do not see the real world applicability… or at least I’ve never been exposed to a use case where the end user has and/or is willing to provide a sample answer. However, I am be wrong about this. This reminds my of AnHai’s comments on corner cases but at the same time, this is curiosity-driven research.
Papers on my to-read list:
- QueryVis: Logic-based Diagrams help Users Understand Complicated SQL Queries Faster (Video) (More Details)
- Duoquest: A Dual-Specification System for Expressive SQL Queries
- SQLCheck: Automated Detection and Diagnosis of SQL Anti-Patterns
- SpeakQL: Towards Speech-driven Multimodal Querying of Structured Data
- Tutorial: State of the Art and Open Challenges in Natural Language Interfaces to Data
- DEMO: AURORA: Data-driven Construction of Visual Graph Query Interfaces for Graph Databases
Another topic very dear to me is data catalogs. Over the past couple of years I’ve been seeing research on the topics of dataset search/recommendation, join similarity, etc., that are important features for data catalogs. I’m looking forward to digging into these two papers:
- Organizing Data Lakes for Navigation
- Finding Related Tables in Data Lakes for Interactive Data Science
On this topic of data lakes, I’m really really bummed that I missed several keynotes:
- Natasha Noy – When the Web is your Data Lake: Creating a Search Engine for Datasets on the Web
- Awez Syed – The Challenge of Building Effective, Enterprise-scale Data Lakes
- Renee Miller: Data Discovery and Schema Inference Over Data Lakes
I can’t wait to watch the videos.
GRAPHS
If you look at the SIGMOD research program, there are ~20 papers on the general topic of graphs from ~140 research papers, plus all the papers from the GRADES-NDA workshop. The graph work that I was most attracted came from industry: Alibaba, Microsoft, IBM, TigerGraph, SAP, Tencent, Neo4j.
I found it intriguing that Alibaba and Tencent are both creating large scale knowledge graphs to represent and model common sense of their users. Cyc has been on it for decades. Many researchers believe that this is the wrong approach. But then 10 years ago schema.org came out as a high level ontology that the web content producers are adhering too. Now we are seeing these large companies creating knowledge bases (i.e. knowledge graphs) that integrates not just knowledge and data at scale, but also common sense. Talk about “what goes around comes around.”
Every year that I attend SIGMOD, it is a reminder that the database and semantic web community must talk to each other more and more. Case in point: IBM presented DB2 Graph where they retrofit graph queries (Property Graph model and Tinkerpop) on top of relationally-stored data. I need to dig into this work, but I have the suspicion that it overlaps with work from the semantic web community. For example, Ultrawrap, Ontop, Morph, among others, are systems that execute SPARQL graph queries on relational databases (note: Ultrawrap was part of my PhD, foundation for my company Capsenta which was acquired by data.world last year). There are even W3C standards to map Relational Data to RDF Graph (i.e. Direct Mapping, R2RML). Obviously the focus of the semantic web community has study these problem from the perspective of RDF Graphs and SPARQL. Nevertheless, it’s all just a graph so the work is overlapping. In the spirit of cross-communication, I was thrilled to see Katja Hose‘s keynote at GRADES-NDA where she presented work from the semantic web community such as SPARQL Federation, Linked Data Fragments, etc.
Another topic that was brought up by Semih Salihoglu was the practical uses of graph analytics algorithms. This discussion was sparked by the paper “Graph Based Benchmark Suite.” It was very neat to know that Neo4j has actually started to categorize the graph algorithms that are used in practices. In their graph data science library, algorithms exist within three tiers: Production-quality, beta and alpha. These tiers serve as proxies for what is being used in the real world.
Papers on my to-read list:
- AliCoCo: Alibaba E-commerce Cognitive Concept Net (Industry)
- A1: A Distributed In-Memory Graph Database (Industry)
- IBM Db2 Graph: Supporting Synergistic and Retrofittable Graph Queries Inside IBM Db2 (Industry) (https://www.youtube.com/watch?v=CwT_898Zkzk&feature=youtu.be)
- An Ontology-Based Conversation System for Knowledge Bases (Industry)
- Aggregation Support for Modern Graph Analytics in TigerGraph (Industry)
- GIANT: Scalable Creation of a Web-scale Ontology (Industry)
- On the Optimization of Recursive Relational Queries: Application to Graph Queries (Research)
- SHARQL: Shape Analysis of Recursive SPARQL Queries (Demo)
WHO IS THE USER?
A topic that came up in the “Next 5 Years” panel was the need for results to be “used in the real world” and for tools to be “easy to us”. This is inevitable in research because the opposite would be a falsehood (do research so it’s used in an artificial world and hard to be used). I believe that a missing link between “used in the real world” and “easy to use” is to understand the USER. I also believe it is paramount that the database research community understands who are the users in the real world. It’s not just data scientist. We have data engineers, data stewards, data analyst, BI developers, Knowledge scientist, Product Managers, Business User, etc. I believe that we need to look at the data integration and the general data management problem not just from a technical point of view (which is what the database community has been doing for 20+ years), but from a social aspect: understanding the users, processes and how they are connected using end-to-end technology solutions. This takes us out of our comfort zone, but this is what is going to push the needle in order to maximize the input.
For the past year, I’ve been advocating to research the phenomena of data integration from a socio-technical angle (see my guest lecture at the Stanford Knowledge Graph course), provide methodologies to create ontologies and mappings and the new role of the Knowledge Scientist.
Joe Hellerstein provided another great comment during our slack discussion:
“Building data systems for practitioners who know the data but not programming (current case in point—public health researcher) is a huge challenge that we largely have a blindspot for in SIGMOD. To fix that blindspot we should address it directly. And educate our students about the data analysis needs of users outside of the programmer community.“
– Joe Hellerstein in a slack discussion
While watching the presentations of the “Usability and Natural Language User Interfaces” session, I kept asking me: who is the user?What are the characteristics that define that user? Sometimes this is well defined, sometimes it is not. Sometimes it is connected with the real world, sometimes it is not.
The HILDA workshop and community is addressing this area and I’m very excited to get involved more. All the HILDA papers are on my to-read list. I’m leaving with a very long list of papers to read and new connections.
Thanks again to the organizers for an amazing event. Can’t wait to see what happens next year.
Additional notes:
- Congrats to all the award winners.
- Zeyuan Hu’s SIGMOD trip report
- Eugenie Lai’s SIGMOD trip report
Oh, and a final reminder to students: