Fueled by advances in information extraction and societal trends that value institutional openness and transparency, structured data are being produced and shared at an overwhelming speed. Open data sharing is central to supporting institutional transparency, but transparency is not achieved if shared data cannot be found and effectively aligned with other data being studied by data scientists, journalists, and others. This project will fundamentally contribute to the new science of open data sharing by laying the theoretical foundations of data discovery (identification, alignment, and integration of tables) within table repositories. It will contribute both to developing the right conceptual framework for studying this problem and to designing systems that solve the table discovery and alignment problems at scale. The key focus is the development of an approach to table discovery that both discovers a set of alignable tables as well as the best way to integrate (or align) the new data with a query table. In this new paradigm called "table-as-query", the user does not need to know a priori on which attributes various tables in a repository are best aligned.

Our search paradigm: Table-as-query extends a query table with an ideal subset of tables from a repository.

People & Affiliations

PhD students & Postdocs
Aristotelis Leventidis (PhD student)
Laura di Rocco (Post-Doc)
Nikolaos Tziavels (PhD student)
Senior Personnel
Renée Miller (PI)
Mirek Riedewald (Co-PI)
Northeastern University, Khoury College of Computer Sciences

Publications

DomainNet: Homograph Detection for Data Lake Disambiguation
Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald
(best paper award, announcement)

Best Paper Award Citation: The paper presents DomainNet, a system that disambiguates values from heterogeneous datasets by creating a network representing co-occurring values and computing their graph centrality. The system is unsupervised, its accuracy outperforms the state-of-the-art, and it is accompanied by an open benchmark. The paper is of high significance: the problem is important, the proposed solution is effective, and the benchmark facilitates further research.

Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries
Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald
PODS, 2021

Funding

This work has been supported in part by the National Science Foundation (NSF) under award number IIS-1956096. Any opinions, findings, and conclusions or recommendations expressed in this project are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

National Science Foundation