This course is a graduate-level introduction to the algorithms, core principles, and foundational concepts for managing data at scale. Our goal is to expose students to general abstractions, tools and algorithms developed across various branches of the data management community and to draw relationships between them. The emphasis will be on both, the high-level theoretical intuitions and principles underlying scalable data management, as well as technical details.
Topics include: data models, query languages, worst-case optimal join algorithms, relational algebra, relational calculus, Datalog, query optimization, complexity of big-data analysis, parallel data processing, uncertain and incomplete data, provenance, algebraic structures for factorized representations, linear algebra.
Prerequisites: The course requires standard knowledge of discrete math, algorithms and complexity theory (e.g., from textbooks such as [Ericson'19], [Dasgupta, Papadimitriou, Vazirani'06], [Cormen, Leiserson, Rivest, Stein'09], [Kleinberg, Tardos'05], or [Lehman, Leighton, and Meyer'15].
PhD program: The course fulfills the PhD breath requirements for the Artificial Intelligence and Data Science area.
Time/location
Instructor: Wolfgang Gatterbauer
Contact: Please use Piazza (access code is posted on Blackboard) for all questions related to lectures, coursework, project, and scheduling extra office hours. Please use my email (wgatterbauer@northeastern.edu) for everything else.
40%: Course project: The main component of this course will be a research project in the latter part of this class. This project can be a new application of one of the techniques presented or theoretically-oriented. The topic will be flexible, allowing students to explore scalable data management and analysis aspects related to their PhD research. This will involve an initial project proposal, an intermediate report, a project presentation and a final report. The final report should resemble a conference paper, and will be evaluated on the basis of soundness, significance, novelty, and clarity. Deliverables and dates are posted on the project page.
30%: 3 assignments: We have three assignments in the first part of this class. Like research projects and scribe notes, assignment solutions must be typeset in LaTeX.
15%: Midterm exam (Fri 3/13): The midterm exam tests fundamental concepts we saw in the first part of the class and will be "in class" (i.e. not take home).
15%: Class participation: Classes will be interactive and require participation of the students in class (discussing contributions or shortcomings of algorithms covered, small group break-out sessions with in-class exercises). The class provides extensive readings for those interested, yet those are optional unless otherwise stated in class. Students also take turns in "illustrating" (formerly called "scribing") 3-5 lectures. Graduate theory classes often ask students to scribe the lecture content. Yet since this course provides typed lectures notes in the forms of slides, we change the rule of the game for this class. Rather than scribing the content of the class, the students are aksed to "illustrate" the covered topics with imaginative and ideally tricky illustrating examples. Those in turn can help other students practice and solidify their understanding of the class material. Scribes are due 1 week after class at midnight (Tue for Tue classes, Fri for Fri classes) and can be done individually or in teams of two. We have a signup sheet in Piazza. Please use our Overleaf latex template and share your write-up with me on Overleaf.
We have no prescribed textbook. Instead online readings will be posted on the calendar page.
This course builds upon the structure and content of several existing great database classes. Some content courtesy to Dan Suciu, Benny Kimelfeld, Phokion Kolaitis, Moshe Vardi, Reinhard Pichler, Chris Re, Jeff Ullman, Paris Koutris, and Andreas Pieris.