CS 7290, Fall 2017: Special Topics in Data Science:
Foundations in Scalable Data Management

Description, Project, Grading, Calendar, Readings

Coordinates: Tue 11:45 am - 1:25 pm, Thu 2:50 pm - 4:30 pm, in Ryder Hall 126

Instructors: Wolfgang Gatterbauer and Mirek Riedewald

Office hours: Directly after class and via email: {wolfgang,mirek)@ccs.neu.edu

Mailing list: Through Blackboard

Description

After taking this class you will be able to understand latest trends and research problems in large-scale data management & analysis and have gained significant research experience in this field.

Topics: This course explores research topics in analysis and management of large data, with a focus on distributed & parallel approaches, join processing, and imprecise data/approximation. We will discuss and analyze papers covering applications, algorithms, systems, and theory--with a focus on the most recent developments. This course is designed for PhD students, as well as advanced Masters students with a solid background in algorithms and one or more data-oriented areas of computer science, incl. machine learning, AI, logics, information retrieval, and security. A desired outcome of the course project is the creation of research results that are publishable in a peer-reviewed conference.

Course format: Classes are intended to be discussion-based. Think of each class as an extended brainstorming session or roundtable discussion about a specific problem or paper. The goal is to create the maximum possible interaction for the maximum possible understanding. For this purpose, a typical class will be a mix of presentations of core material and research papers by the instructors (e.g., to introduce the research areas during the first 10 lectures), and presentations of research papers by students (during the remaining classes), with lots of interactive discussions.

Prerequisites: For CCIS PhD students, there are no formal requirements other than an interest in the topics covered. For PhD students from other programs and MS students, instructor approval is required in order to ensure that this course is a good fit for you. This will be done case-by-case, usually based on strong performance (grade A) in CCIS graduate-level courses on big data topics (or equivalent outside CCIS). Please have a look at at the papers in our reading list and decide for yourself: if the content is *entirely* impenetrable for you, this class will be very time-intensive for you: We are happy to provide you with extra background reading material that you will need to study in order to understand the weekly readings and do well on regular written paper summaries and in our class discussions.

A note to first and second year PhD students: Prof. Gatterbauer is actively looking for multiple PhD students to work with him. This course provides an excellent opportunity to explore relevant topics.

Intended Learning Outcomes:

1. Learn about recent research topics and industry trends in large-scale data management.

2. Acquire a technical skill set and background knowledge that equips you to perform data science research.

3. Be able to critically read, evaluate, and present research papers.

4. Gain hands-on research experience during the project, ideally resulting in publishable results.

Course Project

A big component of this course is a research project. By default each team will have 2 members (exceptions require instructor approval). We will facilitate group formation through informal socializing sessions at the end of the first few lectures, but it generally is each student’s responsibility to form and manage groups. Please start looking for project partners right away and pick a topic that is related to this class. We are happy to discuss ideas with you, and to suggest a list of project topics, but you are free to select a project outside of this list. You should meet with us periodically throughout the semester, updating us on the progress of your project. We will help with the direction, but unless you take the initiative to actively explore the topic you choose, you are unlikely to accomplish much.

For effective collaboration, we will work with tools such as Latex/Git or Overleaf for document processing and Git or bitbucket for coding.

Project timeline

F, Oct 6 (noon)

Project proposal

  • What is your thesis for his project (read this excellent advice: http://web.cs.ucla.edu/~palsberg/shivers.html)? What is the problem you are solving? What is the key new idea or what other work influenced your approach? What will be your contribution to the state-of-the art (e.g., identifying a new problem, or solving an existing problem better)?
  • How are you going to evaluate success of your project?
  • What are major risks why your project might not succeed?
  • Maximal 1 page

T, Oct 17

Proposal presentation

  • Maximal 5 min overview of your project idea

M, Nov 13 (noon)

Interim report

  • Show us that your team is on the right track

T, Nov 14

Interim presentation

  • Maximal 10 min update on your project progress

R, Nov 30

T, Dec 5

Final presentation

  • Maximal 25 min interactive presentation of your project. Be prepared to answer questions during the presentation, not just at the end
  • Optional demo your system / approach / algorithm

M, Dec 11 (noon)

Final report

  • Incorporate feedback received during the presentations

Grading and Deadlines

There are no exams or traditional homework assignments. However, the class requires (1) an extensive project, (2) students are required to read, analyze, and review about 2-4 research papers per week, and (3) present 2-3 papers over the course of the class. Depending on previous knowledge, some papers might require reading of additional material to obtain the necessary background.

Paper reviews: Students will provide on average 2-4 brief paper reviews per week. We highly encourage you to form study groups, enabling you to jointly read and discuss the papers before coming to class. You still submit your paper review individually, one for each required paper. The review and optional discussions with your study group will prepare you to be ready for the discussion in class. Reviews have to be submitted by 6pm the day before class via the following Google form https://goo.gl/forms/93IGjkVOCxp0fde62

Paper presentations: Starting in October, students (alone or in team of 2) will present research papers during the lecture, sometimes following a presentation of introductory material by the instructors. Those presenting a paper are also the "discussion leaders;" all other students will participate in the discussion and ask questions. Send your slides to the instructors by 6pm before the class you are presenting (we may respond with suggestions) and optionally come to office hours if you are stuck.

Class and Reading

30%

  • Pre-class paper reviews
  • In-class participation

Paper presentation

20%

  • You will present and lead the discussion in some lectures        

Course Project

50%

  • Project proposal
  • Intermediate report
  • Presentation
  • Final report

Calendar

This calendar is very preliminary and will be updated as we progress

Overview classes (led by instructors)

01

R, Sept 7

Course introduction

Preparation

  • The Beckman Report on Database Research. Abadi et al. CACM 2016 (paper)
  • Research Directions for Principles of Data Management. Abiteboul et al. Dagstuhl Workshop 16151 (paper)

02
T, Sept 12

Data exploration

Papers to review

  • Controlling False Discoveries During Interactive Data Exploration. Zhao, De Stefani, Zgraggen, Binnig, Upfal, Kraska. SIGMOD 2017 (paper)

03

R, Sept 14

Joins in databases

Papers to review

  • Skew Strikes Back: New Developments in the Theory of Join Algorithms. Ngo, Ré, Rudra. SIGMOD record 2013 (paper)

Additional preparation

  • Section 6.4 in Alice book: Computing with Acyclic Joins (website)
  • Slides on GYO reduction (slides)
  • Slides on Yannakakis algorithm (slides)

04

T, Sept 19

Parallel query processing

Papers to review

  • Processing theta-joins using MapReduce. Okcan, Riedewald. SIGMOD 2011 (paper)
  • Load Balancing and Skew Resilience for Parallel Joins. Vitorovic, Elseidy, Koch. ICDE 2016 (paper)

Additional preparation

  • "Mining book" Section 2.3.7 Computing Natural Join by MapReduce (website)
  • "Mining book" Section 2.5.3 Multiway Joins (website)

05

R, Sept 21

Graphs

Papers to review

  • Linearized and single-pass belief propagation. Gatterbauer, Günnemann, Koutra, Faloutsos. VLDB 2015 (paper) (Youtube video)
  • "Network estimation" (Paper shared on Blackboard)

Additional preparation

  • Sections 1-3 from Graph algorithms in the language of linear algebra. Kepner, Gilbert. 2011
  • Chapter 14.6 from Networks, crowds, and markets: reasoning about a highly connected world. Easley, Kleinberg. 2010 (book website)

06

T, Sept 26

Dichotomy theorems in databases: Guest speaker Cibele Freire

Papers to review

  • The complexity of resilience and responsibility for self-join-free conjunctive queries. Freire, Gatterbauer, Immerman, Meliou. VLDB 2016 (paper)

07

R, Sept 28

Data exploration: Guest speaker Jonathan Ullman 

Papers / blog post to review

  • The reusable holdout: Preserving validity in adaptive data analysis. Dwork, Feldman, Hardt, Pitassi, Reingold, Roth. Science 2015 (paper)
  • Competing in a data science contest without reading the data. Hardt. (blog post)

Optional preparation

  • Preserving Statistical Validity in Adaptive Data Analysis. Dwork, Feldman, Hardt, Pitassi, Reingold, Roth. STOC 2015 (long version)
  • A firm foundation for private data analysis. Dwork. CACM 2011 (paper)
  • The Ladder: A Reliable Leaderboard for Machine Learning Competitions. Blum, Hardt. ICML 2015 (paper)

08

T, Oct 3

Uncertainty

Papers to review

  • Dissociation and propagation for approximate lifted inference with standard relational database management systems. Gatterbauer, Suciu. VLDBJ 2016 (paper)

Optional preparation

  • Oblivious bounds on the probability of Boolean functions. Gatterbauer, Suciu. TODS 2014 (paper)
  • Probabilistic databases: diamonds in the dirt. Dalvi, Re, Suciu. CACM 2009 (paper)

09

R, Oct 5

Uncertainty (continued)

Continued from last class

10

T, Oct 10

Project discussions & "Factorized Evaluation"

Papers to review

  • Factorized Databases. Olteanu, Schleich. SIGMOD record 2016 (paper)

Optional preparation

11

R, Oct 12

 "Modern recommender systems: matrices, bandits, and neurons": Guest speaker Georgia Koutrika

Papers to review

  • CourseNavigator: interactive learning path exploration. Li, Papaemmanouil, Koutrika. ExploreDB 2016 (paper)

Optional preparation

  • Finding Related Forum Posts through Content Similarity over Intention-Based Segmentation. Papadimitriou, Koutrika, Velegrakis, Mylopoulos. TKDE 2017 (paper)
  • Recsplorer: recommendation algorithms based on precedence mining. Parameswaran, Koutrika, Bercovitz, Garcia-Molina. SIGMOD 2010 (paper)
  • FlexRecs: expressing and combining flexible recommendations. Koutrika, Bercovitz, Garcia-Molina. SIGMOD 2009 (paper)

We will post a link to the slides as soon as Georgia shares them with us. The talk is based on a recent longer 3 part tutorial, and for copyright reasons with ACM, she cannot share them with us, and we cannot post them yet:
https://summerschool.acm.org/user-analytics-for-recommender-systems/

12

T, Oct 17

Project proposal presentations & Factorized Evaluation (continued)

Factorized evaluation: Continued from last Tuesday

13

R, Oct 19

Factorized Evaluation (continued)

Factorized evaluation continued

In-depth classes (led by students)

14

T, Oct 24

One-slide summaries & Parallel query processing: Equi-joins

Papers to review

  • Optimizing Joins in a Map-Reduce Environment. Afrati, Ullman EDBT 2010 (paper)
    Rundong + Aditya

15

R, Oct 26

Parallel query processing: Equi-joins (continued)

Papers to review

  • Optimizing Joins in a Map-Reduce Environment. Afrati, Ullman EDBT 2010 (paper)
    Rundong + Aditya

16

T, Oct 31

"Factors": Dataset versioning

Papers to review

  • Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff. Bhattacherjee, Chavan, Huang, Deshpande, Parameswaran. VLDB 2015
    Joe

17

R, Nov 2

Incremental graph maintenance 1/2

Papers to review

  • Storing and Analyzing Historical Graph Data at Scale. Khurana, Deshpande. EDBT 2016 (paper)

Additional preparation

  • ImmortalGraph: A System for Storage and Analysis of Temporal Graphs. Miao et al. TOS 2015 (paper)
    Joe

18

T, Nov 7

Incremental graph maintenance 2/2

Papers to review

  • The G* graph database: efficiently managing large distributed dynamic graphs. Labouseur, Birnbaum, Olsen Jr., Spillane, Vijayan, Hwang, Han. Distributed and Parallel Databases. 2016 (paper)
    Joe

19

R, Nov 9

Uncertainty: Learning probabilities: Guest lecture Niccolo Meneghetti

Papers to review

  • Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning. Meneghetti, Kennedy, Gatterbauer. SIGMOD 2017 (paper)

Additional preparation

  • Learning Tuple Probabilities. Dylla, Theobald. Arxiv 2016 (paper)

20

T, Nov 14

Project interim presentations 

21

R, Nov 16

Distributing Frank-Wolfe via Map-Reduce: Guest lecture: Stratis Ioannidis

Paper to review

  • Distributing Frank-Wolfe via Map-Reduce. Moharrer, Ioannidis. ICDM 2017 (paper), (PPTX slides)

Optional preparation

  • Short optimization tutorial. Jaggi. 2015 (slides)
  • The Recent Revival of the Frank–Wolfe Algorithm. Jaggi, Harchaoui. 2014 (paper)

22

T, Nov 21

Approximate Query Processing ("AQP")

Papers to review

  • Wander Join: Online Aggregation via Random Walks. Li, Wu, Yi, Zhao. SIGMOD record 2017 [best of SIGMOD 2016]
    Rundong + Aditya

R, Nov 23

NO CLASS (Thanksgiving)

24

T, Nov 28

Parallel query processing: University of Washington

Papers to review

  • From theory to practice: Efficient join query evaluation in a parallel database system. Chu, Balazinska, Suciu. SIGMOD 2015 (paper)
    Rundong + Aditya

25
R, Nov 30

Data exploration & uncertainty: University of Washington

Papers to review

  • Probabilistic Database Summarization for Interactive Data Exploration. Orr, Balazinska, Suciu. VLDB 2017
    Xinyan + Xiaofeng

Project presentations

T, Dec 5

Project presentations

R, Dec 7

Project presentations

M, Dec 11

Final report due

Topic and Reading list

1. Basic Database Background and overview

2. Data Exploration

3. Join Algorithms

4. Parallel Query Processing

5. "Factors" (Factorized databases / Dataset versioning / Incremental updates)

6. Dichotomy Theorems in Databases

7. Graphs

8. Uncertainty

9. Approximate Query Processing (AQP)

10. Distributed Transactions

Basic Research Setup (Please read these on your own)