CS 7280: Principles of scalable data management: theory, algorithms and database systems
Instructors:
Office hours:
-
Directly after class or arranged via email: {wolfgang,mirek)@ccs.neu.edu
Time/location: 2:50 - 4:30pm Mon/Wed in Ryder Hall 233. Spring 2019.
Piazza site:
http://piazza.com/northeastern/spring2019/cs7280
Prerequisites:
The course requires standard CS knowledge of algorithms and complexity theory (e.g., from textbooks such as
[Ericson],
[Dasgupta, Papadimitriou, Vazirani], or
[Cormen Leiserson Rivest Stein].
This course will provide a rigorous introduction to the algorithms, core principles, and foundational concepts for managing data at scale.
Our emphasis will be on both, the high-level theoretical intuitions and principles underlying
scalable data management, as well as technical details.
Topics include data models and query languages, query optimization, complexity of big-data analysis, data stream processing, parallel data processing, and probabilistic data management.
Students will gain deep algorithmic understanding through interactive classes and a project with regular feedback.
The latter will be flexible, allowing students to explore scalable data management and analysis aspects related to their PhD research.
Topic 1: Data models and query languages: L1 (1/7) - L5 (1/23)
-
SQL refresher (SAMS, Cow book Ch 3 & 5, Complete book Ch 6)
-
Relational algebra,
Relational calculus,
Datalog, Codd's theorem
(Alice book Ch 3 & 4 & 24.1,
Cow book Ch 4,
Complete book 2 & 5)
-
Normal forms and their information-theoretic justification
(Complete book Ch 3, [Arenas & Libkin 2005])
-
Alternative data models and NoSQL (Hellerstein & Stonebraker, Sadalage & Fowler
Topic 2: Complexity of big data analysis: the theory of conjunctive queries and beyond: L6 (1/28) - L10 (2/11)
-
Semantics of Conjunctive queries
-
Query containment and Query minimization
-
Query complexity (data vs query complexity)
-
Acyclic joins: Query hypergraph, GYO, Dynamic programming and Yannakakis (Alice book Ch 6.4), Any-k algorithms
-
Cyclic joins: Tree decomposition, query widths, AGM bounds, worst-case optimal join algorithms
-
Hardness dichotomies: resilience
Topic 3: Query execution and optimization: L11 (2/13)
-
lifecycle of a query plan,
query optimization
Topic 4: Transactions: L12 (2/20) - L13 (2/25)
-
locking,
2-phase commit,
ARIES,
CAP and the rise of NoSQL databases,
NewSQL
Topic 5: Parallel data processing: L14 (2/27) - L17 (3/18)
-
parallel DBMS,
MapReduce and Spark,
pipeline versus partitioned parallelism,
complexity of parallel data management,
theta-joins vs equi-joins
Topic 6: Data stream processing: L18 (3/20) - L19 (3/25)
-
data models and query languages,
one-pass algorithms,
sketches (AMS sketch, count-min, spatial sketches)
Topic 7: Logic and uncertainty: L20 (3/27) - L23 (4/8)
-
probabilistic models,
complexity of probabilistic inference,
MaxEntropy,
scalable approximations to probabilistic inference,
connections to models of propagation on graphs and hypergraphs,
connections to approximate query processing
Topic 8: Linear algebra
-
linear algebra and its connection to relational algebra optimizations
Topic 9: Factorized databases
-
dataset versions,
optimal factorizations,
models for Web table integration
-
Reading assignments and class participation (20%): There will be weekly reading assignments in preparation for the topics covered that week.
Each reading assignment requires submission of an interesting and non-trivial multiple-choice question about the reading assignments that requires understanding the material in order to correctly answer it.
We are using an online platform to manage the questions, which will be answered by everybody in class.
Classes will be interactive and require participation of the students (discussing contributions or shortcomings of algorithms covered, small group break-out sessions with in-class exercises)
-
Exam (30%): One comprehensive in-class exam
-
Research project (50%):
One research project with final presentation, starting about 1/3rd into the semester.
The topic will be flexible, allowing students to explore scalable data management and analysis aspects related to their PhD research.
The relational model, query languages, SQL, logic
-
''Cow book'': Ramakrishnan, Gehrke. Database Management Systems. 3rd ed 2003.
Ch 3: The relational model,
Ch 4: Relational algebra and calculus,
Ch 5: SQL,
Ch 24.1-24.2: Datalog
-
''Stanford book'': Garcia-Molina, Ullman, Widom. Database Systems. 2014.
Ch 2: The relational model,
Ch 3: Design theory,
Ch 5: Algebraic and logical query languages,
Ch 6: SQL
-
"SAMS": Forta. Teach yourself SQL in 10min. 4th ed.
[Safari books eBook (NEU free online access)],
[EBSCOhost eBook (NEU free online access)],
[Amazon (30$)],
[SQL files for SAMS book]
-
''Alice book'': Abiteboul, Hull, Vianu. Foundations of Databases. 1995.
Ch 3: The relational model,
-
Ian Barland, Phokion Kolaitis, Moshe Vardi, Matthias Felleisen, John Greiner.
Intro to Logic: online exercises.
First-Order Logic: bound variables, free variables
,
First-Order Logic: equivalences
,
Exercises for First-Order Logic
-
Halpern, Harper, Immerman, Kolaitis, Vardi, Vianu.
On the Unusual Effectiveness of Logic in Computer Science.
Bulletin of Symbolic Logic, Volume 7, Issue 2, June 2001.
Normalization
Setting up a DBMS
Alternative data models and NoSQL
-
Sadalage, Fowler: NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. 2012
[Safari books eBook (NEU free online access)]
Ch 8: Key-value stoes,
Ch 9: Document databases,
Ch 10: Column-famility stores,
Ch 11: Graph databases
-
Harrison: Next Generation Databases: NoSQL, NewSQL, and Big Data. 2016
[Safari books eBook (NEU free online access)]
p43-p51: CAP, Eventual Consistency, Hashing,
p61-p62: MongoDB,
p65-p68: Graphs,
p75-p80: Columns,
p145-156: Data models,
p163, 164, 166: Secondary indexing
-
Stonebraker, Hellerstein: What Goes Around Comes Around. 2005
[Online PDF]
Also see
various discussions of the paper.
Complexity of query evaluation
-
''Alice book'': Abiteboul, Hull, Vianu. Foundations of Databases. 1995.
Ch 4: Conjunctive queries,
Ch 6.2: Query containment,
Ch 6.4: Acyclic joins
-
"AGM": Atserias, Grohe, Marx. Size bounds and query plans for relational joins. SIAM J. Comput. 2013
-
Ngo, Porat, Re, Rudra. Worst-case Optimal Join Algorithms. JACM 2018.
-
Veidhulzen. Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm. ICDT 2014.
-
Freire, Gatterbauer, Immerman, Meliou. The complexity of resilience and responsibility for self-join-free conjunctive queries. VLDB 2016.
Research directions in databases
-
Abadi et al. The Beckman Report on Database Research. CACM 2016
[paper]
-
Abiteboul et al. Research Directions for Principles of Data Management.Dagstuhl Workshop 16151
[paper]