CS 7840: Foundations and Applications of Information Theory (Fall 2025)

Topics and approximate agenda

This schedule will be updated regularly as the class progresses. Check back frequently. I will post lecture slides by the end of the day following a lecture (thus the *next* day). Reason is that I often walk away from lecture with ideas on how to improve the slides, and that takes time. You can always glance at slides from a previous edition from this class. Posted slides are accumulative per topic.

PART 1: Information Theory (the basics)

Covers the basic mathematical framework behind entropy and its various forms. Starts with a probability primer.

()
No class (lecturer is unfortunately unavailable)
Lecture 1 ()
Course introduction with end-to-end encoding example
([Schneider'13])
Lecture 2 ()
Basics of Probability (1/2)
([Bertsekas, Tsitsiklis'08], [Pishro-Nik'14], random experiment, independence, conditional probability, conditional independence, chain rule, Bayes' theorem)
Lecture 3 ()
Basics of Probability (2/2)
(random variables, expectation, variance)
Basics of information theory (1/7)
(measures of Information, intuition behind entropy)
Lecture 4 ()
Basics of information theory (2/7)
(conditional entropy, binary entropy, max entropy)
Lecture 5 ()
Basics of information theory (3/7)
(joint entropy, conditional entropy, mutual information, cross entropy)
Lecture 6 ()
Basics of information theory (4/7)
(multivariate entropies, interaction information)
7840 Python notebook: entropies
Lecture 7 ()
Basics of information theory (5/7)
(Markov chains)
Lecture 8 ()
Basics of information theory (6/7)
(data processing inequality, sufficient statistics)
Lecture 9 () / P1 Project ideas
Basics of information theory (7/7)
(sufficient statistics, information inequalities)

PART 2: Compression

Covers an Algorithmic Derivation of Entropy via Compression: we establish entropy as the fundamental limit for the compression of information and hence a natural measure of efficient description length. Entropy then falls out as a simple consequence of deriving optimal codes for compression. We may (or may not) cover the method of types (a powerful combinatorial tool in information theory for analyzing probabilities of sequences) and use it to see how entropy and relative entropy naturally emerge in probability estimates and to give short intuitive proofs of Shannon's coding theorems (channel capacity, source coding).

Lecture 10 ()
Compression (1/5)
(algorithmic derivation of entropy via compression)
(Mon, Oct 13): no class (Indigenous Peoples Day = former Columbus Day)
Lecture 11 ()
Compression (2/5)
(uniquely decodable codes)

PART 3: Selected Applications to data management, machine learning and information retrieval

Covers example approaches of basic ideas from information theory to practical problems in data management, machine learning, and information retrieval. Topics and discussed papers may vary over years.

Lecture 12 ()
Decision trees (1/2)
(Hunt's algorithm, information gain, gini, gain ratio)
Lecture 13 () / P2 Project proposal
Decision trees (2/2)
(Parity function, decision diagrams, overfitting)
MDL, Occam, Kolmogorov (1/2)
(A simple end-to-end example for MDL)
Logistic Regression (1/2)
(Deriving multinomial logistic regression as maximum entropy model, Lagrange multipliers, softmax)
Python notebooks: 202, 204, 208, 212
Lecture 14 ()
Logistic Regression (2/2)
(Luce's choice axiom, Bradley-Terry model, Item Response Theory)
Python notebooks: 224
Lecture 15 ()
Maximum Entropy
(Deriving the Maximum Entropy Principle)
Lecture 16 ()
MDL, Occam, Kolmogorov (2/2)
(Occam, Minimum Description Length (MDL), Kolmogorov)
Lecture 17 ()
Channel capacity, Distortion Theory, Information Bottleneck (1/4)
[Cover Thomas'06: Ch 7 Channel Capacity]
()
No class
Lecture 18 () / P3 Intermediate report
30min remote guest lecture by Zsolt Zombori on [Zombori+'23] Towards Unbiased Exploration in Partial Label Learning,
Channel capacity, Distortion Theory, Information Bottleneck (2/4)
[Cover Thomas'06: Ch 10 Distortion Theory]
Lecture 19 ()
Channel capacity, Distortion Theory, Information Bottleneck (3/4)
[Cover Thomas'06: Ch 10 Distortion Theory]
Python notebooks on Rate Distortion and Quantization: 232
Information bottleneck theory: [Zaslavsky+'18], [Webb+'24]
Lecture 20 ()
Channel capacity, Distortion Theory, Information Bottleneck (4/4)
Khowledge distillation [Hinton+'15]

PART 4: The axiomatic approach (deriving formulations from first principles)

Covers the axiomatic approach from multiple angles: a few simple principles (axioms) leading to entropy or the laws of probability up to factors. Starting from a list of postulates leading to particular solution is a powerful approach that has been used across different areas of computer science (e.g. how to define the right scoring function for achieving a desired outcome)

Lecture 21 ()
Shapley values, Communication Complexity, Coding with errors
(Thu 11/27): no class (Fall break)
skipped
Derivation of Hartley measure and entropy function from first principles
skipped
Cox's theorem: a derivation of the laws of probability theory from a certain set of postulates. Contrast with Kolmogorov’s “probability axioms”

PART 5: Project presentations

Lecture 24 (Thu 12/4): P4 Project presentations / P5 Final report
Lecture 25 (Mon 12/8): P4 Project presentations / P5 Final report
Lecture 26 (Thu 12/11): P4 Project presentations / P5 Final report

Literature list

Part 1: Information Theory (the basics)

[Bertsekas, Tsitsiklis'08] Introduction to Probability (2nd ed), 2008: Solid textbook on probability theory. That's the book I find myself going back to for basics. Our probability primer is a brief excerpt from chapter 1 on "Sample Space and Probability". Working through that chapter by yourself is a good investment of your time.
[Pishro-Nik'14] Introduction to Probability, Statistics, and Random Processes, Online book, 2014: Section 5.1.3 is a nice refresher on conditioning and independence
[Schneider'13] Information Theory Primer, With an Appendix on Logarithms. 2013: nice refresher on logarithms
[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006: Ch 2.2 Joint Entropy and Conditional Entropy, Ch 2.3 Relative Entropy and Mutual Information, Ch 2.5 Chain Rules, Ch 2.8 Data-Processing Inequality, Ch 2.9 Sufficient Statistics, Ch 2.10 Fano’s Inequality, Ch 3 Asymptotic Equipartition Property (AEP), Ch 4 Markov chains/entropy rates, Ch 5 Data Compression (5.2 Kraft inequality, 5.6 Huffman codes), Ch 12 Maximum Entropy, Ch 14 Komogorov Complexity, Ch 28 Occam's Razor & Minimum description length (MDL)
[Moser'18] Information Theory (lecture Notes), 6th ed, 2018. Ch 1 Shannon's measure of information, Ch 3.1 Relative entropy
[Olah'15] Visual Information Theory (blog entry). 2015
[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003: Ch 2 Probability Entropy and Inference, Ch 4 The Source Coding Theorem
[MIT 6.004] Chris Terman. L01: Basics of Information (lecture notes and videos), 6.004 Computation Structures, 2015-2017.
[3Blue1Brown Bayes] Bayes theorem, the geometry of changing beliefs (Youtube). 2019
[Godoy'18] Understanding binary cross-entropy / log loss: a visual explanation (blog entry). 2018. (Youtube video)
[Murphy'12] Machine Learning: a Probabilistic Perspective, MIT press, 2012: Ch 2 Probability (including information theory), Ch 8 Logistic regression
[Murphy'22] Probabilistic Machine Learning: An Introduction, MIT press, 2022: Ch 2.4 Bernoulli and binomial distributions (including Sigmoid), Ch 2.5 Categorical and multinomial distributions (including Softmax), Ch 6 Information Theory, Ch 10 Logistic regression
[Casella,Berger'24] Statistical inference (2nd ed), CRC press, 2024: Ch 6 Principles of data reduction, Ch 6.2.1 Sufficient statistics
[Fithian'24] Statistics 210a: Theoretical Statistics (Lecture 4 sufficiency), Berkeley, 2014.
[Yeung'08] Information Theory and Network Coding. 2008: Ch 2.6 The basic inequalities, Ch 2.7 Some Useful Information Inequalities, Ch 3.5 Information Diagrams, Ch 13 Information inequalities, Ch 14 Shannon-type inequalities, Ch 15 Beyond Shannon-type inequalities
[Suciu'23] Applications of Information Inequalities to Database Theory Problems, LICS keynote 2023. (Slides)
[Suciu'23b] CS294-248: Topics in Database Theory, Berkeley 2023 (slides and videos): Unit 5a Information Inequalities, Unit 5b Database Constraints
cs7840 Python notebook: entropies
[3Blue1Brown'23] Why π is in the normal distribution : Gives great intuition about why the Normal distribution is so special

Topic 2: Compression and Method of Types

Compression

[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006: Ch 5.9 Shannon–Fano–Elias Coding, Ch 13.3 Arithmetic Coding, Ch 5.11 Generation of Discrete Distributions from Fair Coins, Ch 13.4 Lempel–Ziv Coding, Ch 13.5 Optimality of Lempel–Ziv Algorithms
[Moser'18] Information Theory (lecture Notes), 6th ed, 2018. Ch 4.8.2 Shannon code, Ch 4.8.3 Fanno code, Ch 4.9 Huffman code, Ch 5.3 Arithmetic coding, Ch 7.3 Lempel-Ziv
[LNTwww] Compression According to Lempel, Ziv and Welch. 2023: Example 3
[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003: Ch 6 Stream codes (Arithmetic Coding, Lempel-Ziv)
[wikipedia Arithmetic Coding] Arithmetic coding

Method of Types

[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006: Ch 11.1 Method of Types, Ch 11.4 Large Deviation Theory (Sanov's theorem), Ch 11.5 Examples of Sanov's Theorem, Ch 11.6 Conditional Limit Theorem

Topic 3: Selected Applications to data management, machine learning and information retrieval

Information Gain in Decision trees

[Tan+'18] Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd ed, 2018. Ch 3 Classification with Decision trees (freely available chapter)
[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009. Ch 9.2 Tree-based methods
[Mitchell'97] Introduction to Machine Learning, McGraw-Hill, 1997. (Online PDF posted by Tom Mitchell). Ch 3.6.2 Occam's Razor
[wikipedia Information Gain] Information gain in decision trees
[Hyafil,Rivest'76] Constructing optimal binary decision trees is NP-complete, Information Processing Letters, 1976.
[Russell,Norvig'20] Artificial Intelligence: A Modern Approach, 4th ed, 2020. Ch 19.3 Learning Decision Trees
[MLU-explain'22] Wilber, Santamaria. Decision Trees: The unreasonable power of nested decision rules. (an interactive guide)

Logistic regression, softmax, maximum entropy, cross-entropy

[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009. Ch 4.4 Logistic regression
[Hui,Belkin'21] Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks, ICLR 2021.
[Mount'11] The equivalence of logistic regression and maximum entropy models
[wikipedia Softmax] Softmax function
[Cramer'03] The origins and development of the logit model, 2003 (extended book chapter)
[Zombori+'23] Zombori, Agapi Rissaki, Szabo, Gatterbauer, Benedikt. Towards Unbiased Exploration in Partial Label Learning
Python notebook: logistic regression
[Hinton+'15] Hinton, Vinyals, Dean. Distilling the Knowledge in a Neural Network. NIPS Deep Learning Workshop, 2014. Video, Slides.

Bradley-Terry model, Luce's choice axiom, Item Response Theory

[Luce'08] Luce's choice axiom. Scholarpedia, 2008.
[wikipedia Luce's choice axiom] Luce's choice axiom
[wikipedia Bradley-Terry] Bradley-Terry model
[wikipedia Expected utility] Expected utility hypothesis
[Hunter'04] MM algorithms for generalized Bradley-Terry models. Annals of Statistics. 2004
[Newman'23] Efficient Computation of Rankings from Pairwise Comparisons. JMRL 2023.
[Chen+'24] Chen, Mitra, Ravi, Gatterbauer. HITSNDIFFS: From Truth Discovery to Ability Discovery by Recovering Matrices with the Consecutive Ones Property, ICDE 2024.

Minimum Description Length (MDL), Kolmogorov Complexity

[wikipedia MDL] Minimum description length
[Gruenwald'04] A Tutorial Introduction to the Minimum Description Length Principle, book chapter 2005. (free preprint)
[Vreeken,Yamanishi'19] Modern MDL meets Data Mining: Insights, Theory, and Practice (tutorial), KDD 2019.
[Li,Vitanyi'19] An introduction to Kolmogorov complexity and its applications, 4th ed, Springer, 2019. Ch 5.4 Hypothesis Identification by MDL
[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003: Ch 28.3 Minimum Description Length (MDL)
[Domingos'99] The role of Occam's Razor in Knowledge Discovery. 1999
[Mitchell'97] Introduction to Machine Learning, McGraw-Hill, 1997. (Online PDF posted by Tom Mitchell). Ch 6.6 MDL
[Sutskever'23] An Observation on Generalization (recorded talk at Simon Institute for the Theory of Computing / Berkeley)

Rate Distortion & Information bottleneck theory

[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006: Ch 10 Rate distortion theory
[Tishby+'99] Tishby, Pereira, Bialek. The information bottleneck method. The 37th annual Allerton Conference on Communication, Control, and Computing. pp. 368–377.
[Harremoes,Tishby'07] The Information Bottleneck Revisited or How to Choose a Good Distortion Measure. International Symposium on Information Theory, 2007.
[Zaslavsky+'18] Zaslavsky, Kemp, Regier, Tishby. The Efficient compression in color naming and its evolution. PNAS, 2018.
[Webb+'24] Webb, Frankland, Altabaa, Segert, Krishnamurthy, Campbell, Russin, Giallanza, Dulberg, OReilly, Lafferty, Cohen. The Relational Bottleneck as an Inductive Bias for Efficient Abstraction. Trends in Cognitive Science, 2024.
[Segert'24] Maximum Entropy, Symmetry, and the Relational Bottleneck: Unraveling the Impact of Inductive Biases on Systematic Reasoning. PhD thesis, Neuroscience @ Princeton, 2024.
[Ren,Li,Leskovec'20] Graph Information Bottleneck, NeurIPS, 2020.
[Zidi+'20] Zidi, Estella-Aguerri, Shamai. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy, 2020.

Probing and the data processing inequality

[Belinkov'22] Probing Classifiers: Promises, Shortcomings, and Advances, Computational Linguistics 2022.
[Pimentel+20] Pimentel, Valvoda, Maudslay, Zmigrod, Williams, Cotterell. Information-Theoretic Probing for Linguistic Structure, ACL 2020.
[Pimentel, Cotterell'21] A Bayesian Framework for Information-Theoretic Probing, EMNLP 2021.
[Voita, Titov'20] Information-Theoretic Probing with Minimum Description Length, EMNLP 2020. Youtube. Very nice blog entry by Voita.

Approximate Correlation Measures

[Kinney, Atwal'14] Equitability, mutual information, and the maximal information coefficient, PNAS 2014. Response by the authors of MIC: Reshef, Reshef, Mitzenmacher, Sabeti. Cleaning up the record on the maximal information coefficient and equitability, PNAS 2014.
[Radhakrishnan+'25] Radhakrishnan, Jain, Uhler, Lander. Efficiently quantifying dependence in massive scientific datasets using InterDependence Scores, PNAS 2025.
[Ma+20] Ma, Lewis, Kleijn. The HSIC Bottleneck: Deep Learning without Back-Propagation. AAAI 2020.
[Ma'22] Backpropagation-free learning with an information surrogate, PhD thesis 2022.
[Gretton'25] Reproducing kernel Hilbert spaces in Machine Learning, Lecture notes 2025. More excellent lectures notes by Arthur Gretton.
[Muandet+17] Muandet, Fukumizu, Sriperumbudur, Schölkopf. Kernel Mean Embedding of Distributions: A Review and Beyond, Foundations and Trends in ML 2017.
[Koenig+'25] Koenig, Guenther, von Luxburg. Disentangling Interactions and Dependencies in Feature Attribution, AISTATS 2025.

Machine Learning: VC dimensions

[Shalev-Shwartz,Ben-David'14] Understanding machine learning: from theory to algorithms, Cambridge University Press, 2014. Ch 6 The VC-dimension
[Mohri+'18] Mohri, Rostamizadeh, Talwalkar. Foundations of Machine learning. MIT press, 2018. Ch 3 Rademacher Complexity and VC-Dimension
[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009. Ch 7.9 VC dimension
[Piantadosi'18] One parameter is always enough. 2018
[Boue'19] Real numbers, data science and chaos: How to ﬁt any dataset with a single parameter, 2019
[wiki VC dimension] Wikipedia, Vapnik–Chervonenkis dimension.

Data Management: Normal forms

[Elmasri, Navathe'15] Fundamentals of Database Systems. 7th ed 2015. Ch 14 & 15: Normal forms,
[Silberschatz+'20] Silberschatz, Korth, Sudarshan. Database system concepts. 7th ed 2020. Ch 7.5 & 7.6: Relational design with decomposition.
[Arenas, Libkin'05] An information-theoretic approach to normal forms for relational and XML data. JACM 2005.
[Lee'87] An Information-Theoretic Analysis of Relational Databases-Part I: Data Dependencies and Information Metric. IEEE Transactions on Software Engineering 1987.

Data Management: Approximate acylic schemas

[Kenig,Suciu'22] Integrity Constraints Revisited: From Exact to Approximate Implication. LMCS. 2022
[Kenig,Weinberger'23] Quantifying the Loss of Acyclic Join Dependencies. PODS. 2023

Data Management: Information inequalities and cardinality estimation

[Suciu'23] Applications of Information Inequalities to Database Theory Problems. (slides)
[Abo Khamis+'17] Abo Khamis, Ngo, Suciu.What do Shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another? PODS 2017.
[Abo Khamis+'24] Abo Khamis, Nakos, Olteanu, Suciu. Join Size Bounds using LP-Norms on Degree Sequences PODS 2024.

Data Management: Approximate functional dependencies

[Parciak+'23] Measuring Approximate Functional Dependencies: a Comparative Study

Data Management: Explanation tables

[El Gebaly+'18] El Gebaly, Feng, Golab, Korn, Srivastava. Explanation Tables. IEEE Debull 2018

Information Retrieval: Inverse Document Frequency

[Robertson'04] Understanding Inverse Document Frequency: On theoretical arguments for IDF, Journal of Documentation, 2004.
[Aizawa'03] An information-theoretic perspective of tf–idf measures, IP&M 2003.
[Lin'98] An Information-Theoretic Definition of Similarity. ICML 1998.

Part 4: The axiomatic approach (deriving formulations from first principles)

[Klir'06] Uncertainty and Information: Foundations of Generalized Information Theory, Wiley, 2006. Ch 3.2.1. Simple Derivation of the Shannon Entropy, Ch 3.2.2. Uniqueness of the Shannon Entropy, Ch 9.3.1. Principle of Maximum Entropy
[Cox'46] Probability, frequency and reasonable expectation, American Journal of Physics, 1946.
[Shannon'48] A Mathematical Theory of Communication, The Bell System Technical Journal, 1948.
[Jaynes'03] Probability theory: the logic of science, Cambridge press, 2003. Section 11. Discrete prior probabilities: the entropy principle. (read the reviews on Amazon)
[Van Horn'03] Constructing a logic of plausible inference: a guide to Cox's theorem, International Journal of Approximate Reasoning, 2003.
[Fleming,Wallace'86] How not to lie with statistics: the correct way to summarize benchmark results. CACM 1986.
[Kleinberg'02] An Impossibility Theorem for Clustering. NeurIPS 2002.
[wikipedia Probability axioms] Probability axioms.
[wikipedia Cox's theorem] Cox's theorem.

Research best practice

[Azuma'19] Everything I wanted to know about C.S. graduate school at the beginning but didn’t learn until later. 1997-2019
[Dalio'17] Principles by Ray Dalio. Also see Youtubue videos.
[MIT'20] Version control (git)
[LaTeX'03] Indian Tex User group Latex Tutorial. An old but still nice comprehensive quick intro to LaTeX
[Oetiker'21] The Not So Short Introduction to LaTeX.
[Trettin, Fenn'07] An essential guide to LaTeX usage: Obsolete commands and packages.
[Kurose'09] 15 pieces of advice I wish my PhD advisor had given me.