CS 7240: Principles of scalable data management:

theory, algorithms and database systems (SP 2023)

Course Description

This course is a graduate-level introduction to the algorithms, core principles, and foundational concepts for managing data at scale. Our goal is to expose students to general abstractions, tools and algorithms developed across various branches of the data management community and to draw relationships between them. The emphasis will be on both, the high-level theoretical intuitions and principles underlying scalable data management, as well as technical details. The examples throughout the class are naturally biased by the expertise of the instructor and are chosen with an eye towards ongoing research in our research group.

Intended topics include: data models, query languages, logic & relational calculus, relational algebra, Datalog & recursion, stable model semantics, complexity of query evaluation, constraint satisfaction, query containment, homomorphisms, polymorphisms, universal algebra, acyclic query optimization, variable elimination orders, worst-case optimal join algorithms, provenance, reverse data management, tree decompositions and hypertree decompositions, algebraic structures for factorized representations, linear algebra, iterative graph algorithms, computation graphs, knowledge compilation, axiomatic approaches for defining problem solutions, normalization, managing uncertain and incomplete data.

Prerequisites: The course is fast-paced and requires standard knowledge of discrete math, algorithms and complexity theory (e.g., from textbooks such as [Ericson'19], [Dasgupta, Papadimitriou, Vazirani'06], [Cormen, Leiserson, Rivest, Stein'09], [Kleinberg, Tardos'05], or [Lehman, Leighton, and Meyer'15]).

PhD program: The course fulfills the PhD breath requirements for the ''Artificial Intelligence and Data Science'' breath area.

Administrative Information

Time/location

Lectures: Tue/Fri 1:35pm - 3:15pm, Snell Library 125 in person. There is a chance that two classes may have to be taught remotely via Zoom. Classes will not be recorded, but all slides from class will be made available within 2 days after each lecture.
Office hours: directly after class, or by appointment in person at 450 West Village H, or via Microsoft Teams or Zoom. Please email me with 3 time slots possible for you.

Instructor: Wolfgang Gatterbauer

Teaching Assistant: Neha Makhija

Contact: Please use Piazza (via direct access from within Canvas) for all questions related to lectures, coursework, and the project. Notice you can post questions anonymously to all other students, or anonymously even to me. Alternatively, please use our anonymous feedback form to send comments that are anonymous to me and that only I can see. Please use my email (wgatterbauer@northeastern.edu) for scheduling office hours and everything else.

Coursework/Evaluation

50%: Course project: The main component of this course will be a research project in the latter part of this class. This project can be a new application of one of the techniques presented or theoretically-oriented. The topic will be flexible, allowing students to explore scalable data management and analysis aspects related to their individual PhD research. This will involve an initial project proposal, an intermediate report, a project presentation and a final report. The final report should resemble a conference paper, and will be evaluated on the basis of soundness, significance, novelty, and clarity. Deliverables and dates are posted on the project page.

35%: Class scribes: Students take turns in "illustrating" 7 lectures (to be consistent with standard notation we refer to it as "scribes"). Graduate theory classes often ask students to scribe the lecture content. Yet since I already provide you with well-curated lectures slides, we change the rule of the game. Rather than scribing (= repeating and summarizing) the content of the class, I ask you to "illustrate" some interesting aspect in the covered topics with imaginative and ideally tricky illustrating examples. Those illustrations are great if they in turn can help other students practice and solidify their understanding of the topics discussed.

Justification: Georg Cantor is quoted as saying: "To ask the right question is harder than to answer it." In that spirit, our class scribes are closer to research than assignments: What particular aspect in a class is worthy to be "illustrated"? That's already part of the question. Scribes are done in PowerPoint and are due 1 week after class at midnight (Tue for Tue classes, Fri for Fri classes). They can be done individually or in teams of two (if you work in teams of two, you are expected to illustrate 14 classes instead of 7 classes). For some more pedagogic motivation see videos by Tim Brown on asking questions and reframing problems being key to creativity, Dan Meyer on formulating problem being more important than just solving them, Derek Muller on increasing learning by including possible misconceptions into stories, a blog post on example-based reasoning, and an older text of mine of the educational value of temporarily misleading the spectator before giving the correct answer.
Submission on Canvas: Please start from our PPTX template, and submit it as PDF to Canvas, naming it "cs7240-sp23-[YOUR NAME]-scribe[NUMBER]-[SOME DESCRIPTIVE TITLE].PDF".
Secondary submission on Piazza: Please also post this document to Piazza (optionally anonymized by removing the title page). If you prefer, you can make your post anonymous to other students. I will post my comments on each submission both on Canvas and copy it too on Piazza. That way your document together with my comments will also help other students.

15%: Class participation: Classes will be interactive and require concentration and participation of the students. Participate when we discuss the merits or shortcomings of algorithms, or when we have small group break-out sessions with exercises. Ask questions, during class or on Piazza. Questions that make me ponder or create new illustrating examples are all great examples of class participation. Also, *never* hesitate to point out to me errors you spot in my slides, even if minor. You can post anonymously to the other students on Piazza (and even anonymously to me, though then I would not be able to associate you with your participation). Also, don't hesitate to point out to me any interesting links to interesting related material. It can only count towards class participation. Finally notice that while the class provides extensive readings for those interested, these pointers are almost exclusively optional (unless otherwise stated in class).

Textbooks

The course is based on papers, lectures, chapters from textbooks, and original presentations the topics. Pointers to those works are available on the website on a per-topic basis on the calendar page. This class provides a very broad overview and you are not expected to read all the texts. Yet for those interested to dig deeper, they will provide useful background.

Acknowledgements

This course builds upon the structure of several existing great database classes. Some content courtesy to Dan Suciu, Benny Kimelfeld, Phokion Kolaitis, Moshe Vardi, Reinhard Pichler, Chris Re, Jeff Ullman, and Andreas Pieris. A more recent and closely related class is by Dan Olteanu. I tried to give credit to the original inspirations on each individual slide. In case I missed something, please let me know. Any errors and non-acknowledged content is the result of me trying to explain concepts to my students and myself with minimal examples necessary, and refining them every time I teach and get student feedback.

CS 7240: Principles of scalable data management: theory, algorithms and database systems (SP 2023)