CS 7840: Foundations and Applications of Information Theory (Fall 2024)

Topics and approximate agenda

This schedule will be updated regularly as the class progresses. Check back frequently. We will usually post lecture slides by the end of the day following a lecture (thus the next day), or latest two days after class. Posted slides are accumulative per topic.

PART 1: Information Theory (the basics)

Covers the basic mathematical framework behind entropy and its various forms.

Lecture 1 (Wed 9/4)
Course introduction and encoding example
Lecture 2 (Mon 9/9)
Basics of Probability (1/2) [random experiment, independence, conditional probability, chain rule, Bayes' theorem, random variables]
Lecture 3 (Wed 9/11)
Basics of Probability (2/2) [expectation, variance, Markov chains]
Basics of information theory (1/6) [measures of Information, intuition behind entropy]
Lecture 4 (Mon 9/16)
Compression (1/3) [algorithmic derivation of entropy via compression]
Lecture 5 (Wed 9/18)
Basics of information theory (2/6) [conditional entropy, binary entropy, max entropy]
Lecture 6 (Mon 9/23)
Basics of information theory (3/6) [joint entropy, conditional entropy, mutual information, cross entropy]
Compression (2/3) [uniquely decodable codes]
Lecture 7 (Wed 9/25)
Compression (3/3) [uniquely decodable codes continued]
Lecture 8 (Mon 9/30)
Compression (2/3) [Asymptotic Equipartition Property (AEP)]
Basics of information theory (4/6) [multivariate entropies]
Python notebook: entropies
Lecture 9 (Wed 10/2)
Basics of information theory (5/6) [multivariate entropies, interaction information, Markov chains, data processing inequality]
Lecture 10 (Mon 10/7)
Basics of information theory (6/6) [data processing inequality, sufficient statistics, information inequalities]

PART 2: The axiomatic approach (deriving formulations from first principles)

Covers the axiomatic approach from multiple angles: a few simple principles (axioms) leading to entropy or the laws of probability up to factors. Starting from a list of postulates leading to particular solution is a powerful approach that has been used across different areas of computer science (e.g. how to define the right scoring function for achieving a desired outcome)

Lecture TBD: Derivation of Hartley measure and entropy function from first principles
Lecture TBD: Cox's theorem: a derivation of the laws of probability theory from a certain set of postulates. Contrast with Kolmogorov’s “probability axioms”

Part 3: Selected Applications to data management, machine learning and information retrieval

Covers example approaches of basic ideas from information theory to practical problems in data management, machine learning, and information retrieval. Topics and discussed papers may vary over years.

Lecture 11 (Wed 10/9)
Compression (1/4) [Cover,Thomas'06: Ch 5.9, Ch 13.3]
(Mon 10/14): no class (Indigenous Peoples Day)
Lecture 12 (Wed 10/16):
Compression (2/4) [Cover Thomas'06: Ch 5.11]
Lecture 13 (Mon 10/21):
Compression (3/4) [Cover Thomas'06: Ch 13.4, Ch 13.5]
Lecture 14 (Wed 10/23):
Compression (4/4)
Decision trees (1/2) [Hunt's algorithm, information gain, gini, gain ratio]
Lecture 15 (Mon 10/28):
Decision trees (2/2) [MDL for decision trees]
Logistic Regression (1/2) [Deriving multinomial logistic regression as maximum entropy model, Lagrange multipliers, softmax]
Python notebooks: 202, 204, 208, 212
Lecture 16 (Wed 10/30):
Method of Types (1/2)
Lecture 17 (Mon 11/4):
Method of Types (2/2) [Sanov's theorem, large deviation theory]
Lecture 18 (Wed 11/6):
Maximum Entropy (1/2) [Deriving the Maximum Entropy Principle]
Python notebooks: 224
(Mon 11/11): no class (Veterans Day)
Lecture 19 (Wed 11/13):
Logistic Regression (2/2) [Luce's choice axiom, Bradley-Terry model]
Maximum Entropy (2/2) [Occam, Kolmogorov, Minimum Description Length (MDL)]
Lecture 20 (Mon 11/18):
Channel capacity [Cover Thomas'06: Ch 7]
Lecture 21 (Wed 11/20):
Distortion Theory (1/2) [Cover Thomas'06: Ch 10]
Lecture 22 (Mon 11/25):
Distortion Theory (2/2) [Cover Thomas'06: Ch 10]
Python notebooks: 232
(Wed 11/27): no class (Fall break)
Lecture 23 (Mon 12/2):
Information Bottleneck Theory (1/2)
Lecture 24 (Wed 12/4):
Information Bottleneck Theory (2/2)

Project presentations

Lecture 25 (Mon 12/9): P4 Project presentations
Lecture 26 (Wed 12/11): P4 Project presentations

Literature list

Part 1: Information Theory (the basics)

[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006: Ch 2.2 Joint Entropy and Conditional Entropy, Ch 2.3 Relative Entropy and Mutual Information, Ch 2.5 Chain Rules, Ch 2.8 Data-Processing Inequality, Ch 2.9 Sufficient Statistics, Ch 2.10 Fano’s Inequality, Ch 3 Asymptotic Equipartition Property (AEP), Ch 4 Markov chains/entropy rates, Ch 5 Data Compression (5.2 Kraft inequality, 5.6 Huffman codes), Ch 12 Maximum Entropy, Ch 14 Komogorov Complexity, Ch 28 Occam's Razor & Minimum description length (MDL)
[Pishro-Nik'14] Introduction to Probability, Statistics, and Random Processes, Online book, 2014: Section 5.1.3 is a nice refresher on conditioning and independence
[Schneider'13] Information Theory Primer, With an Appendix on Logarithms. 2013: nice refresher on logarithms
[Moser'18] Information Theory (lecture Notes), 6th ed, 2018. Ch 1 Shannon's measure of information, Ch 3.1 Relative entropy
[Olah'15] Visual Information Theory (blog entry). 2015
[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003: Ch 2 Probability Entropy and Inference, Ch 4 The Source Coding Theorem
[MIT 6.004] Chris Terman. L01: Basics of Information (lecture notes and videos), 6.004 Computation Structures, 2015-2017.
[3Blue1Brown Bayes] Bayes theorem, the geometry of changing beliefs (Youtube). 2019
[Godoy'18] Understanding binary cross-entropy / log loss: a visual explanation (blog entry). 2018. (Youtube video)
[Murphy'12] Machine Learning: a Probabilistic Perspective, MIT press, 2012: Ch 2 Probability (including information theory), Ch 8 Logistic regression
[Murphy'22] Probabilistic Machine Learning: An Introduction, MIT press, 2022: Ch 2.4 Bernoulli and binomial distributions (including Sigmoid), Ch 2.5 Categorical and multinomial distributions (including Softmax), Ch 6 Information Theory, Ch 10 Logistic regression
[Casella,Berger'24] Statistical inference (2nd ed), CRC press, 2024: Ch 6 Principles of data reduction, Ch 6.2.1 Sufficient statistics
[Fithian'24] Statistics 210a: Theoretical Statistics (Lecture 4 sufficiency), Berkeley, 2014.
[Yeung'08] Information Theory and Network Coding. 2008: Ch 2.6 The basic inequalities, Ch 2.7 Some Useful Information Inequalities, Ch 3.5 Information Diagrams, Ch 13 Information inequalities, Ch 14 Shannon-type inequalities, Ch 15 Beyond Shannon-type inequalities
[Suciu'23] Applications of Information Inequalities to Database Theory Problems, LICS keynote 2023. (Slides)
[Suciu'23b] CS294-248: Topics in Database Theory, Berkeley 2023 (slides and videos): Unit 5a Information Inequalities, Unit 5b Database Constraints
Python notebook: entropies

Part 2: The axiomatic approach (deriving formulations from first principles)

[Klir'06] Uncertainty and Information: Foundations of Generalized Information Theory, Wiley, 2006. Ch 3.2.1. Simple Derivation of the Shannon Entropy, Ch 3.2.2. Uniqueness of the Shannon Entropy, Ch 9.3.1. Principle of Maximum Entropy
[Cox'46] Probability, frequency and reasonable expectation, American Journal of Physics, 1946.
[Shannon'48] A Mathematical Theory of Communication, The Bell System Technical Journal, 1948.
[Jaynes'03] Probability theory: the logic of science, Cambridge press, 2003. Section 11. Discrete prior probabilities: the entropy principle. (read the reviews on Amazon)
[Van Horn'03] Constructing a logic of plausible inference: a guide to Cox's theorem, International Journal of Approximate Reasoning, 2003.
[Fleming,Wallace'86] How not to lie with statistics: the correct way to summarize benchmark results. CACM 1986.
[Kleinberg'02] An Impossibility Theorem for Clustering. NeurIPS 2002.
[wikipedia Probability axioms] Probability axioms.
[wikipedia Cox's theorem] Cox's theorem.

Topic 3: Selected Applications to data management, machine learning and information retrieval

Compression

[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006: Ch 5.9 Shannon–Fano–Elias Coding, Ch 13.3 Arithmetic Coding, Ch 5.11 Generation of Discrete Distributions from Fair Coins, Ch 13.4 Lempel–Ziv Coding, Ch 13.5 Optimality of Lempel–Ziv Algorithms
[Moser'18] Information Theory (lecture Notes), 6th ed, 2018. Ch 4.8.2 Shannon code, Ch 4.8.3 Fanno code, Ch 4.9 Huffman code, Ch 5.3 Arithmetic coding, Ch 7.3 Lempel-Ziv
[LNTwww] Compression According to Lempel, Ziv and Welch. 2023: Example 3
[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003: Ch 6 Stream codes (Arithmetic Coding, Lempel-Ziv)
[wikipedia Arithmetic Coding] Arithmetic coding

Method of Types

[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006: Ch 11.1 Method of Types, Ch 11.4 Large Deviation Theory (Sanov's theorem), Ch 11.5 Examples of Sanov's Theorem, Ch 11.6 Conditional Limit Theorem

Information Gain in Decision trees

[Tan+'18] Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd ed, 2018. Ch 3 Classification with Decision trees (freely available chapter)
[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009. Ch 9.2 Tree-based methods
[Mitchell'97] Introduction to Machine Learning, McGraw-Hill, 1997. (Online PDF posted by Tom Mitchell). Ch 3.6.2 Occam's Razor
[wikipedia Information Gain] Information gain in decision trees
[Hyafil,Rivest'76] Constructing optimal binary decision trees is NP-complete, Information Processing Letters, 1976.
[Russell,Norvig'20] Artificial Intelligence: A Modern Approach, 4th ed, 2020. Ch 19.3 Learning Decision Trees
[MLU-explain'22] Wilber, Santamaria. Decision Trees: The unreasonable power of nested decision rules. (an interactive guide)

Logistic regression, softmax, maximum entropy, cross-entropy

[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009. Ch 4.4 Logistic regression
[Hui,Belkin'21] Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks, ICLR 2021.
[Mount'11] The equivalence of logistic regression and maximum entropy models
[wikipedia Softmax] Softmax function
[Cramer'03] The origins and development of the logit model, 2003 (extended book chapter)
[Zombori+'23] Zombori, Agapi Rissaki, Szabo, Gatterbauer, Benedikt. Towards Unbiased Exploration in Partial Label Learning
Python notebook: logistic regression

Bradley-Terry model, Luce's choice axiom, Item Response Theory

[Luce'08] Luce's choice axiom. Scholarpedia, 2008.
[wikipedia Luce's choice axiom] Luce's choice axiom
[wikipedia Bradley-Terry] Bradley-Terry model
[wikipedia Expected utility] Expected utility hypothesis
[Hunter'04] MM algorithms for generalized Bradley-Terry models. Annals of Statistics. 2004
[Newman'23] Efficient Computation of Rankings from Pairwise Comparisons. JMRL 2023.
[Chen+'24] Chen, Mitra, Ravi, Gatterbauer. HITSNDIFFS: From Truth Discovery to Ability Discovery by Recovering Matrices with the Consecutive Ones Property, ICDE 2024.

Minimum Description Length (MDL), Kolmogorov Complexity

[wikipedia MDL] Minimum description length
[Gruenwald'04] A Tutorial Introduction to the Minimum Description Length Principle, book chapter 2005. (free preprint)
[Vreeken,Yamanishi'19] Modern MDL meets Data Mining: Insights, Theory, and Practice (tutorial), KDD 2019.
[Li,Vitanyi'19] An introduction to Kolmogorov complexity and its applications, 4th ed, Springer, 2019. Ch 5.4 Hypothesis Identification by MDL
[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003: Ch 28.3 Minimum Description Length (MDL)
[Domingos'99] The role of Occam's Razor in Knowledge Discovery. 1999
[Mitchell'97] Introduction to Machine Learning, McGraw-Hill, 1997. (Online PDF posted by Tom Mitchell). Ch 6.6 MDL
[Sutskever'23] An Observation on Generalization (recorded talk at Simon Institute for the Theory of Computing / Berkeley)

Rate Distortion & Information bottleneck theory

[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006: Ch 10 Rate distortion theory
[Tishby+'99] Tishby, Pereira, Bialek. The information bottleneck method. The 37th annual Allerton Conference on Communication, Control, and Computing. pp. 368–377.
[Harremoes,Tishby'07] The Information Bottleneck Revisited or How to Choose a Good Distortion Measure. International Symposium on Information Theory, 2007.
[Zaslavsky+'18] Zaslavsky, Kemp, Regier, Tishby. The Efficient compression in color naming and its evolution. PNAS, 2018.
[Webb+'24] Webb, Frankland, Altabaa, Segert, Krishnamurthy, Campbell, Russin, Giallanza, Dulberg, OReilly, Lafferty, Cohen. The Relational Bottleneck as an Inductive Bias for Efficient Abstraction. Trends in Cognitive Science, 2024.
[Segert'24] Maximum Entropy, Symmetry, and the Relational Bottleneck: Unraveling the Impact of Inductive Biases on Systematic Reasoning. PhD thesis, Neuroscience @ Princeton, 2024.
[Ren,Li,Leskovec'20] Graph Information Bottleneck, NeurIPS, 2020.
[Zidi+'20] Zidi, Estella-Aguerri, Shamai. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy, 2020.

Machine Learning: VC dimensions

[Shalev-Shwartz,Ben-David'14] Understanding machine learning: from theory to algorithms, Cambridge University Press, 2014. Ch 6 The VC-dimension
[Mohri+'18] Mohri, Rostamizadeh, Talwalkar. Foundations of Machine learning. MIT press, 2018. Ch 3 Rademacher Complexity and VC-Dimension
[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009. Ch 7.9 VC dimension
[Piantadosi'18] One parameter is always enough. 2018
[Boue'19] Real numbers, data science and chaos: How to ﬁt any dataset with a single parameter, 2019
[wiki VC dimension] Wikipedia, Vapnik–Chervonenkis dimension.

Data Management: Normal forms

[Elmasri, Navathe'15] Fundamentals of Database Systems. 7th ed 2015. Ch 14 & 15: Normal forms,
[Silberschatz+'20] Silberschatz, Korth, Sudarshan. Database system concepts. 7th ed 2020. Ch 7.5 & 7.6: Relational design with decomposition.
[Arenas, Libkin'05] An information-theoretic approach to normal forms for relational and XML data. JACM 2005.
[Lee'87] An Information-Theoretic Analysis of Relational Databases-Part I: Data Dependencies and Information Metric. IEEE Transactions on Software Engineering 1987.

Data Management: Approximate acylic schemas

[Kenig,Suciu'22] Integrity Constraints Revisited: From Exact to Approximate Implication. LMCS. 2022
[Kenig,Weinberger'23] Quantifying the Loss of Acyclic Join Dependencies. PODS. 2023

Data Management: Information inequalities and cardinality estimation

[Suciu'23] Applications of Information Inequalities to Database Theory Problems. (slides)
[Abo Khamis+'17] Abo Khamis, Ngo, Suciu.What do Shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? PODS 2017.
[Abo Khamis+'24] Abo Khamis, Nakos, Olteanu, Suciu. Join Size Bounds using LP-Norms on Degree Sequences PODS 2024.

Data Management: Approximate functional dependencies

[Parciak+'23] Measuring Approximate Functional Dependencies: a Comparative Study

Data Management: Explanation tables

[El Gebaly+'18] El Gebaly, Feng, Golab, Korn, Srivastava. Explanation Tables. IEEE Debull 2018

Information Retrieval: Inverse Document Frequency

[Robertson'04] Understanding Inverse Document Frequency: On theoretical arguments for IDF, Journal of Documentation, 2004.
[Aizawa'03] An information-theoretic perspective of tf–idf measures, IP&M 2003.
[Lin'98] An Information-Theoretic Definition of Similarity. ICML 1998.

Research best practice

[Azuma'19] Everything I wanted to know about C.S. graduate school at the beginning but didn’t learn until later. 1997-2019
[Dalio'17] Principles by Ray Dalio. Also see Youtubue videos.
[MIT'20] Version control (git)
[LaTeX'03] Indian Tex User group Latex Tutorial. An old but still nice comprehensive quick intro to LaTeX
[Oetiker'21] The Not So Short Introduction to LaTeX.
[Trettin, Fenn'07] An essential guide to LaTeX usage: Obsolete commands and packages.
[Kurose'09] 15 pieces of advice I wish my PhD advisor had given me.