CS 7240/7280: Principles of scalable data management: theory, algorithms and database systems

Administrative Information

Instructors:

Office hours:

Time/location: 2:50 - 4:30pm Mon/Wed in Ryder Hall 233. Spring 2019.

Piazza site: http://piazza.com/northeastern/spring2019/cs7280

Prerequisites: The course requires standard CS knowledge of algorithms and complexity theory (e.g., from textbooks such as [Ericson], [Dasgupta, Papadimitriou, Vazirani], or [Cormen Leiserson Rivest Stein].

Course Description

This course will provide a rigorous introduction to the algorithms, core principles, and foundational concepts for managing data at scale. Our emphasis will be on both, the high-level theoretical intuitions and principles underlying scalable data management, as well as technical details. Topics include data models and query languages, query optimization, complexity of big-data analysis, data stream processing, parallel data processing, and probabilistic data management. Students will gain deep algorithmic understanding through interactive classes and a project with regular feedback. The latter will be flexible, allowing students to explore scalable data management and analysis aspects related to their PhD research.

Topics and approximate agenda

Topic 1: Data models and query languages: L1 (1/7) - L5 (1/23)

Topic 2: Complexity of big data analysis: the theory of conjunctive queries and beyond: L6 (1/28) - L10 (2/11)

Topic 3: Query execution and optimization: L11 (2/13)

Topic 4: Transactions: L12 (2/20) - L13 (2/25)

Topic 5: Parallel data processing: L14 (2/27) - L17 (3/18)

Topic 6: Data stream processing: L18 (3/20) - L19 (3/25)

Topic 7: Logic and uncertainty: L20 (3/27) - L23 (4/8)

Topic 8: Linear algebra

Topic 9: Factorized databases

Coursework


Literature list

The relational model, query languages, SQL, logic

Normalization

Setting up a DBMS

Alternative data models and NoSQL

Complexity of query evaluation

Research directions in databases