CS 7240: Principles of scalable data management:
theory, algorithms and database systems (SP 2021)

Course Description

This course is a graduate-level introduction to the algorithms, core principles, and foundational concepts for managing data at scale. Our goal is to expose students to general abstractions, tools and algorithms developed across various branches of the data management community and to draw relationships between them. The emphasis will be on both, the high-level theoretical intuitions and principles underlying scalable data management, as well as technical details.

Topics include: data models, query languages, worst-case optimal join algorithms, relational algebra, relational calculus, Datalog, query optimization, complexity of big-data analysis, parallel data processing, uncertain and incomplete data, provenance, algebraic structures for factorized representations, linear algebra.

Prerequisites: The course requires standard knowledge of discrete math, algorithms and complexity theory (e.g., from textbooks such as [Ericson'19], [Dasgupta, Papadimitriou, Vazirani'06], [Cormen, Leiserson, Rivest, Stein'09], [Kleinberg, Tardos'05], or [Lehman, Leighton, and Meyer'15].

PhD program: The course fulfills the PhD breath requirements for the Artificial Intelligence and Data Science area.

Administrative Information

Time/location

Instructor: Wolfgang Gatterbauer

Contact: Please use Piazza (via direct access from within Canvas) for all questions related to lectures, coursework, and the project. Notice you can post anonymously to all other students. Alternatively, please use our anonymous feedback form to send me comments anonymously. Please use my email (wgatterbauer@northeastern.edu) for scheduling office hours and everything else.

Coursework/Evaluation

50%: Course project: The main component of this course will be a research project in the latter part of this class. This project can be a new application of one of the techniques presented or theoretically-oriented. The topic will be flexible, allowing students to explore scalable data management and analysis aspects related to their PhD research. This will involve an initial project proposal, an intermediate report, a project presentation and a final report. The final report should resemble a conference paper, and will be evaluated on the basis of soundness, significance, novelty, and clarity. Deliverables and dates are posted on the project page.

35% 20%: Class scribes: Students take turns in "illustrating" 7 lectures (to be consistent with standard notation we refer to it as "scribing"). Graduate theory classes often ask students to scribe the lecture content. Yet since this graduate course provides well-curated lectures slides, we change the rule of the game. Rather than scribing (repeating and summarizing) the content of the class, the students are asked to "illustrate" some interesting aspect in the covered topics with imaginative and ideally tricky illustrating examples. Those in turn can help other students practice and solidify their understanding of the topics discussed.
Some pedagogic motivation: see videos by Tim Brown on asking questions and reframing problems being key to creativity, Dan Meyer on formulating problem being more important than just solving them, Derek Muller on increasing learning by including possible misconceptions into stories, and an older text of mine of the educational value of temporarily misleading the spectator before giving the correct answer.
Scribes are due 1 week after class at midnight (Tue for Tue classes, Fri for Fri classes) and can be done individually or in teams of two. If you work in teams of two, you are expected to illustrate classes more often. We have a signup sheet in Piazza. You can either use PowerPoint or Latex. For PPTX, please use our PPTX template, name your document "cs7240-sp21-[YOUR NAME]-scribe[NUMBER]-[SOME DESCRIBTIVE TITLE].pptx", and send it to via email using the document name in the subject. For Latex, please use our Overleaf latex template then share it with my Northeastern email with edit rights via Overleaf.

15%: Three assignments: We have three assignments in the first part of this class. Like research projects, assignment solutions must be typeset in LaTeX. I replaced the assignments with more class scribes. Georg Cantor is quoted as saying: "To ask the right question is harder than to answer it." In that spirit, our class scribes are closer to research than assignments. What particular aspect in a class is worthy to be "illustrated"? That's already part of the question.

15%: Class participation: Classes will be interactive and require participation of the students in class (discussing contributions or shortcomings of algorithms covered, small group break-out sessions with exercises). The class provides extensive readings for those interested, yet those are optional unless otherwise stated in class.

Textbooks

The course is based on papers, lectures, chapters from textbooks, and original presentations on scalable algorithms in data management. Pointers to those works are available on the website on a per-topic basis on the calendar page. This class provides a very broad overview and you aren't expected to read all the texts. Yet for those interested to dig deeper, they will provide useful background.

Acknowledgements

This course builds upon the structure and content of several existing great database classes. Some content courtesy to Dan Suciu, Benny Kimelfeld, Phokion Kolaitis, Moshe Vardi, Reinhard Pichler, Chris Re, Jeff Ullman, and Andreas Pieris.