Fueled by advances in information extraction and societal trends that value institutional openness and transparency, structured data are being produced and shared at an overwhelming speed. Open data sharing is central to supporting institutional transparency, but transparency is not achieved if shared data cannot be found and effectively aligned with other data being studied by data scientists, journalists, and others. This project will fundamentally contribute to the new science of open data sharing by laying the theoretical foundations of data discovery (identification, alignment, and integration of tables) within table repositories. It will contribute both to developing the right conceptual framework for studying this problem and to designing systems that solve the table discovery and alignment problems at scale. The key focus is the development of an approach to table discovery that both discovers a set of alignable tables as well as the best way to integrate (or align) the new data with a query table. In this new paradigm called "table-as-query", the user does not need to know a priori on which attributes various tables in a repository are best aligned.

Our search paradigm: Table-as-query extends a query table with an ideal subset of tables from a repository.

People & Affiliations

PhD students & Postdocs
Aristotelis Leventidis (PhD student)
Laura di Rocco (Post-Doc)
Senior Personnel
Renée Miller (PI)
Mirek Riedewald (Co-PI)
Northeastern University, Khoury College of Computer Sciences


DomainNet: Homograph Detection for Data Lake Disambiguation
Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald
EDBT 2021.


This work has been supported in part by the National Science Foundation (NSF) under award number IIS-1956096. Any opinions, findings, and conclusions or recommendations expressed in this project are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

National Science Foundation