What is Reverse Data Management (RDM)?

Reverse Data Management problems are an important, well-studied class of problems in database literature. Unlike traditional Forward Data Management, where the focus is on applying transformations (queries) on data to produce results, Reverse Data Management involves determining the changes needed to achieve a desired outcome post transformation.

In other words, Reverse Data Management asks, "What modifications (interventions) should be made to the input data so that when a given transformation (query) is applied to it, the resulting output satisfies specific conditions?"

Decades of research on traditional data managment has lead to the modern Query Optimizer, which can efficiently find the optimal execution plans for a variety of different Data Management problems. However, no similar solution exists for Reverse Data Management problems. The known algorithms and tractability criteria for different RDM problems are significantly different. Furthermore, in cases of bag semantics or queries with self-joins, many computational complexity questions are still open.

Intersection.png

What is Unified Reverse Data Management?

We discuss a novel approach for solving these problems: instead of creating dedicated algorithms for easy (PTIME) and hard (NP-complete) cases, we suggest using a unified algorithm that is guaranteed to terminate in PTIME for all easy cases. Our algorithm is also unified in that it can solve both previously studied restrictions and new cases (e.g., CQs with self-joins under set or bag semantics).

Our approach opens up the door to new variants and new fine-grained analysis: we have discovered new tractable cases for the problem of minimal factorization of provenance formulas as well as dichotomies under bag semantics for the problems of resilience and causal responsibility.

Intersection.png

Papers

A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations
Neha Makhija, Wolfgang Gatterbauer
Proposes a unified approach for solving reverse data management problems: all cases (including self-joins, bags, PTIME or NP-complete cases) are solved with the same algorithm. The algorithm just "happens" to terminate in PTIME for all known PTIME cases. We apply our techniques with sucess to the problems of resilience and causal responsibility, and envision that this approach can be applied to other problems in reverse data management - including various interventions for fairness and explanations. Also suggests a non-trivial Disjunctive Logic Program that can automatically find "hardness certificates" (think gadgets) for the resilience problem over conjunctive queries.
Minimally Factorizing the Provenance of Self-join Free Conjunctive Queries
Neha Makhija, Wolfgang Gatterbauer
Proposes the problem of finding the minimal-size factorization of the provenance of queries. Shows it to be a natural generalization ofand to recover all known PTIME cases of the problem of exact probabilistic inference. Makes first steps towards a dichotomy of the data complexity of the problem.
Discovering Dichotomies for Problems in Database Theory
Neha Makhija
VLDB 2023 PhD Workshop

Video tutorial and Key ideas

The recordings of our conference talks at SIGMOD 2024 and PODS 2024 give an easy and intuitive introduction to the key ideas.

The SIGMOD 2024 talk is an overview of the unified approach and its application to the problems of resilience and causal responsibility. The PODS 2024 talk is a overview of the minimal factorization of provenance formulas and how we can apply the unified approach for this problem as well.

SIGMOD 2024:

PODS 2024:

Authors from the DATA Lab at Northeastern University

Northeastern University, Khoury College of Computer Sciences

Funding

This work has been supported in part by the National Science Foundation (NSF) under award numbers IIS-1762268 and IIS-1956096, and conducted in part while the authors were visiting the Simons Institute for the Theory of Computing. Any opinions, findings, and conclusions or recommendations expressed in this project are those of the author(s) and do not necessarily reflect the views of the Funding Agencies.

National Science Foundation Simons Institute for the Theory of Computing