Distributinator: Scalable Big-Data Analytics

We design novel distributed algorithms for data-intensive computations in the cloud, asking questions like these:

We have focused on SQL queries involving joins, but we are working on extensions to other applications and problem types, including graphs and data lakes.

Problem Overview

While we have explored a variety of data-intensive computation problems, the discussion here will be limited to selected highlights for distributed computation of a variety of joins. The join operator is arguably the most important building block of any non-trivial SQL query. The join computation pattern also appears in many other contexts, including graph-pattern search, exploratory analysis, and correlation analysis. For instance, finding a path of length 2 in a graph is equivalent to selecting all pairs of edges with matching endpoints (equality condition on "to" node of first edge and "from" node of second edge) over a join of the edge set with itself. Note that we are not arguing for or against using relational DBMS for graph analytics. We simply are interested in the computational pattern and then take a fresh look at its distributed implementation.

So, let's dive right in and explore a specific type of (expensive) join: the Cartesian product aka cross product. Given two input tables S and T, it returns all pairs (s,t) of tuples s from S and t from T. Assume we have two worker machines w1 and w2. How should we partition S and T?

Let's cut S in half into S1 and S2, and assign S1 to w1 and S2 to w2. What happens to T? Since all S-tuples have to be paired up with all T-tuples, we must copy T and send it to both workers! The figure below illustrates this idea for 4 workers. (The join matrix represents the space of all possible pairs of S- and T-tuples.)

Partition S, broadcast T

What if we have 100 workers w1 to w100? We can now cut S into 100 equal pieces and distribute them evenly over the 100 workers. Unfortunately, we now need 100 copies of T to be able to compute the full Cartesian product. This approach is the classic partition-broadcast. (S is partitioned, while T has to be broadcast.)

Can we do better? What if we cut S into 10 equal-sized pieces S1,..., S10 and do the same for T? How do we assign these S and T partitions to the 100 workers? Consider the following: The first 10 workers all receive S1, but each gets a different partition of T. The next 10 workers all receive S2 and each a different partition of T, and so on. Then each worker computes its local Cartesian product between the S and T partition it received. This is illustrated below for a 4-worker scenario.

Symmetric partitioning of S and T into 2 each

Will this correctly compute the Cartesian product of S and T? This is actually easy to prove. Consider any output tuple (s,t). Clearly, s has to be in some partition of S, say Si, and t in some partition Tj. The way we assigned partitions guarantees that there is exactly one worker that received Si and Tj. This means that this worker will output (s,t) and no other worker can produce it. (Nine other workers received Si, but none of them received Tj; and vice versa.)

Is this really better than partition-broadcast? Partition-broadcast works with 1 copy of S and 100 copies of T, assigning 0.01|S|+|T| input data to each worker. (We use |X| to denote the size of set X. For simplicity we do not distinguish between size and cardinality of a set here.) The 10-by-10 scheme assigns 0.1|S|+0.1|T| per worker. The table below compares per-worker input for different scenarios of the size of S vs T. Interestingly, no scheme dominates the other. Depending on the relative size of S vs T, one or the other may assign up to an order of magnitude less input per worker!

Size of S Size of T Worker input (partition-broadcast) Worker input (10-by-10) Worker-input ratio
n n 1.01n 0.2n 5.05
n 1000n 1000.01n 100.1n 9.99
1000n n 11n 100.1n 0.11

Result Highlights

We discuss selected highlights; for details check out our papers below. There are many remaining challenges in this exciting research area. And we are looking for PhD students and postdocs to explore them!

Near-optimal distributed theta-joins: The theta-join is defined as any subset of the Cartesian product as identified by a Boolean function that returns true or false for a given pair (s,t) of tuples from S and T, respectively. Starting with the above observations for the Cartesian product, we developed a variety of strong theoretical and empirical results:

Near-optimal distributed equi-joins: The equi-join is arguably the most common type of theta-join. In relational DBMS it is frequently used to join tables based on foreign keys. In graph DBMS it appears in the context of pattern search. It had received a lot of attention in the research community, but we still were able to make exciting new contributions:

Typical partitioning for ExpVar

Near-optimal band-joins: Band-joins generalize equi-join and Cartesian product, but the 1-Bucket algorithm is suboptimal for band-joins with relatively small output.

Band-join partitioning approaches

Different partitioning strategies for band-joins are shown in the figure above. 1-Bucket and CSIO partition the join matrix where rows correspond to S-tuples and columns to T-tuples. Grid and RecPart partition the space defined by the join attributes.

RecPart near-optimality

Flexible, easy-to-train distributed cost models: Estimating the end-to-end running time for a distributed computation is notoriously difficult. Modern machine-learning approaches have been explored, but they require a lot of training data that in turn requires the execution of large benchmarks for profiling.

Publications and Software

Rundong Li, Wolfgang Gatterbauer, Mirek Riedewald. Near-optimal distributed band-joins through recursive partitioning. To appear in Proc. ACM SIGMOD Int. Conf. on Managament of Data, 2020 [preprint]

R. Li, N. Mi, M. Riedewald, Y. Sun, Y. Yao. Abstract Cost Models for Distributed Data-Intensive Computations. Distributed and Parallel Databases, 37(3): 411-439, Springer, 2019 [preprint]

R. Li, M. Riedewald, X. Deng. Submodularity of Distributed Join Computation. In Proc. ACM SIGMOD Int. Conf. on Managament of Data, pages 1237-1252, 2018 [preprint]

R. Li, N. Mi, M. Riedewald, Y. Sun, Y. Yao. A Case for Abstract Cost Models for Distributed Execution of Analytics Operators. In Proc. Int. Conf. on Big Data Analytics and Knowledge Discovery (, pages 149-163, 2017 (invited to special issue of Distributed and Parallel Databases presenting the Best Papers of DaWaK) [paper]

A. Okcan and M. Riedewald. Processing Theta-Joins using MapReduce. In Proc. ACM SIGMOD Int. Conf. on Managament of Data, pages 949-960, 2011

People

DATA Lab Team

Xinyan Deng (MS student, first job after graduation: Microsoft)
Wolfgang Gatterbauer
Rundong Li (PhD student, first job after graduation: Google)
Mirek Riedewald

Acknowledgements

Our work on this project was supported by the National Science Foundation (NSF) under award nos. 1017793 and 1762268, as well as the National Institutes of Health (NIH) under award no. R01 NS091421. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF or NIH.

National Institutes of Health (NIH) - Turning Discovery into Health