Unified Reverse Data Management

Reverse Data Management problems are an important, well-studied class of problems in database literature. Unlike traditional Forward Data Management, where the focus is on applying transformations (queries) on data to produce results, Reverse Data Management involves determining the changes needed to achieve a desired outcome post transformation.

In other words, Reverse Data Management asks, "What modifications (interventions) should be made to the input data so that when a given transformation (query) is applied to it, the resulting output satisfies specific conditions?"

Decades of research on traditional data managment has lead to the modern Query Optimizer, which can efficiently find the optimal execution plans for a variety of different Data Management problems. However, no similar solution exists for Reverse Data Management problems. The known algorithms and tractability criteria for different RDM problems are significantly different. Furthermore, in cases of bag semantics or queries with self-joins, many computational complexity questions are still open.

What is Unified Reverse Data Management?

We discuss a novel approach for solving these problems: instead of creating dedicated algorithms for easy (PTIME) and hard (NP-complete) cases, we suggest using a unified algorithm that is guaranteed to terminate in PTIME for all easy cases. Our algorithm is also unified in that it can solve both previously studied restrictions and new cases (e.g., CQs with self-joins under set or bag semantics).

Our approach opens up the door to new variants and new fine-grained analysis: we have discovered new tractable cases for the problem of minimal factorization of provenance formulas as well as dichotomies under bag semantics for the problems of resilience and causal responsibility.

Papers

Is Integer Linear Programming All You Need for Deletion Propagation? A Unified and Practical Approach for Generalized Deletion Propagation

Neha Makhija, Wolfgang Gatterbauer

VLDB 2025

arxiv:2411.17603 | gs

Unifies various instances of Deletion Propagation (and the smallest witness problem, which was so far seen as fundamentally *different*) in a single generalized framework that (1) not only captures all prior deletion propagation variants, but (2) also solves all prior known tractable cases in guaranteed PTIME, and (3) even introduces a whole family of new and well-motivated problems.

Resilience for Regular Path Queries: Towards a Complexity Classification

Antoine Amarilli, Wolfgang Gatterbauer, Neha Makhija, Mikaël Monet

PACMMOD, 3(2), Article 108, June 2025. (PODS'25)

ACM | arxiv:2412.09411 | Code: Computational Verification of Hardness Gadgets

Studies the computational complexity of the resilience problem for Regular Path Queries (RPQs) over graph databases. Shows that the resilience for so called "local languages" can be found in PTIME (in combined complexity) and that the problem is NP-complete for languages that are "four-legged" or are finite and have repeated letters.

A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations

Neha Makhija, Wolfgang Gatterbauer

PACMMOD, 1(4), Article 228, December 2023. (SIGMOD'24)

Proposes a unified approach for solving reverse data management problems: all cases (including self-joins, bags, PTIME or NP-complete cases) are solved with the same algorithm. The algorithm just "happens" to terminate in PTIME for all known PTIME cases. We apply our techniques with sucess to the problems of resilience and causal responsibility, and envision that this approach can be applied to other problems in reverse data management - including various interventions for fairness and explanations. Also suggests a non-trivial Disjunctive Logic Program that can automatically find "hardness certificates" (think gadgets) for the resilience problem over conjunctive queries.

Minimally Factorizing the Provenance of Self-join Free Conjunctive Queries

Neha Makhija, Wolfgang Gatterbauer

PACMMOD, 2(2), Article 104, May 2024. (PODS'24)

ACM | talk @ PODS 2024 (19min) | slides (June 2024 @ PODS) | arXiv:2105.14307 (long version)

Proposes the problem of finding the minimal-size factorization of the provenance of queries. Shows it to be a natural generalization ofand to recover all known PTIME cases of the problem of exact probabilistic inference. Makes first steps towards a dichotomy of the data complexity of the problem.

Discovering Dichotomies for Problems in Database Theory

Neha Makhija

VLDB 2023 PhD Workshop

VLDB PhD Workshop | arXiv:2308.13172

Video tutorial and Key ideas

The recordings of our conference talks at SIGMOD 2024 and PODS 2024 give an easy and intuitive introduction to the key ideas.

The SIGMOD 2024 talk is an overview of the unified approach and its application to the problems of resilience and causal responsibility. The PODS 2024 talk is a overview of the minimal factorization of provenance formulas and how we can apply the unified approach for this problem as well.

Authors from the DATA Lab at Northeastern University

Funding

This work has been supported in part by the National Science Foundation (NSF) under award numbers IIS-1762268 and IIS-1956096, and conducted in part while the authors were visiting the Simons Institute for the Theory of Computing. Any opinions, findings, and conclusions or recommendations expressed in this project are those of the author(s) and do not necessarily reflect the views of the Funding Agencies.

Unified Reverse Data Management

What is Reverse Data Management (RDM)?

What is Unified Reverse Data Management?

Papers

Video tutorial and Key ideas

Authors from the DATA Lab at Northeastern University

Funding