Table-as-Query: Unifying Data Discovery and Alignment

Fueled by advances in information extraction and societal trends that value institutional openness and transparency, structured data are being produced and shared at an overwhelming speed. Open data sharing is central to supporting institutional transparency, but transparency is not achieved if shared data cannot be found and effectively aligned with other data being studied by data scientists, journalists, and others. This project will fundamentally contribute to the new science of open data sharing by laying the theoretical foundations of data discovery (identification, alignment, and integration of tables) within table repositories. It will contribute both to developing the right conceptual framework for studying this problem and to designing systems that solve the table discovery and alignment problems at scale. The key focus is the development of an approach to table discovery that both discovers a set of alignable tables as well as the best way to integrate (or align) the new data with a query table. In this new paradigm called "table-as-query", the user does not need to know a priori on which attributes various tables in a repository are best aligned.

Our search paradigm: Table-as-query extends a query table with an ideal subset of tables from a repository.

People & Affiliations

Aristotelis Leventidis (PhD student)

Laura di Rocco (Post-Doc)

Nikolaos Tziavels (PhD student)

Renée Miller (PI)

Ricardo Baeza-Yates

Wolfgang Gatterbauer (Co-PI)

Mirek Riedewald (Co-PI)

Publications

DomainNet: Homograph Detection for Data Lake Disambiguation

Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald

EDBT, pp. 13-24, 2021

doi | op | preprint | video (10min) | arXiv:2103.09940 | code | datasets | gs | bib

(best paper award, announcement)

Best Paper Award Citation: The paper presents DomainNet, a system that disambiguates values from heterogeneous datasets by creating a network representing co-occurring values and computing their graph centrality. The system is unsupervised, its accuracy outperforms the state-of-the-art, and it is accompanied by an open benchmark. The paper is of high significance: the problem is important, the proposed solution is effective, and the benchmark facilitates further research.

Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries

Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald

PODS, 2021

doi | preprint | arXiv:2012.11965 | gs

Funding

This work has been supported in part by the National Science Foundation (NSF) under award number IIS-1956096. Any opinions, findings, and conclusions or recommendations expressed in this project are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.