CS 7840: Foundations and Applications of Information Theory (Fall 2025)
Topics and approximate agenda
This schedule will be updated regularly as the class progresses. Check back frequently.
I will post lecture slides by the end of the day following a lecture (thus the *next* day).
Reason is that I often walk away from lecture with ideas on how to improve the slides, and that takes time. You can always glance at slides from a previous edition from this class.
Posted slides are accumulative per topic.
PART 1: Information Theory (the basics)
Covers the basic mathematical framework behind entropy and its various forms. Starts with a probability primer.
-
(Thu 9/4)
No class (lecturer is unfortunately unavailable)
-
Lecture 1 (Mon 9/8)
Course introduction with end-to-end encoding example
-
Lecture 2 (Thu 9/11)
Basics of Probability [random experiment, independence, conditional probability, conditional independence, chain rule, Bayes' theorem, random variables, expectation, variance, Markov chains]
-
Lecture 3 (Mon 9/15)
Basics of information theory (1/5) [measures of Information, intuition behind entropy]
-
Lecture 4 (Thu 9/18)
Basics of information theory (2/5) [conditional entropy, binary entropy, max entropy]
-
Lecture 5 (Mon 9/22)
Basics of information theory (3/5) [joint entropy, conditional entropy, mutual information, cross entropy]
-
Lecture 6 (Thu 9/25)
Basics of information theory (4/5) [multivariate entropies, interaction information, Markov chains, data processing inequality]
7840 Python notebook: entropies
-
Lecture 7 (Mon 9/29)
Basics of information theory (5/5) [data processing inequality, sufficient statistics, information inequalities]
PART 2: Compression
Covers an Algorithmic Derivation of Entropy via Compression:
we establish entropy as the fundamental limit for the compression of information and hence a natural measure of efficient description length. Entropy then falls out as a simple consequence of deriving optimal codes for compression.
We may (or may not) cover the method of types (a powerful combinatorial tool in information theory for analyzing probabilities of sequences) and use it to see how entropy and relative entropy naturally emerge in probability estimates and to give short intuitive proofs of Shannon's coding theorems (channel capacity, source coding).
-
Lecture 8 (Thu 10/2)
Compression (1/5) [algorithmic derivation of entropy via compression]
-
Lecture 9 (Mon 10/6) / P1 Project ideas
Compression (2/5)
[uniquely decodable codes]
-
Lecture 10 (Thu 10/9)
Compression (3/5)
-
(Mon 10/13): no class (Indigenous Peoples Day = former Columbus Day)
-
Lecture 11 (Thu 10/16)
Compression (4/4)
-
Lecture 12 (Mon 10/20)
Compression (5/5)
-
Lecture 13 (Thu 10/23) / P2 Project proposal
Method of Types
[Sanov's theorem, large deviation theory]
PART 3: The axiomatic approach (deriving formulations from first principles)
Covers the axiomatic approach from multiple angles: a few simple principles (axioms) leading to entropy or the laws of probability up to factors. Starting from a list of postulates leading to particular solution is a powerful approach that has been used across different areas of computer science (e.g. how to define the right scoring function for achieving a desired outcome)
-
Lecture 14 (Mon 10/17)
Derivation of Hartley measure and entropy function from first principles
-
Lecture 15 (Thu 10/30)
Cox's theorem: a derivation of the laws of probability theory from a certain set of postulates.
Contrast with Kolmogorov’s “probability axioms”
-
TBD
Shapley value
PART 4: Selected Applications to data management, machine learning and information retrieval
Covers example approaches of basic ideas from information theory to practical problems in data management, machine learning, and information retrieval. Topics and discussed papers may vary over years.
-
Lecture 16 (Mon 11/3)
Decision trees (1/2) [Hunt's algorithm, information gain, gini, gain ratio]
-
Lecture 17 (Thu 11/6)
Decision trees (2/2)
[MDL for decision trees]
Logistic Regression (1/2)
[Deriving multinomial logistic regression as maximum entropy model, Lagrange multipliers, softmax]
Python notebooks: 202, 204, 208, 212
-
Lecture 18 (Mon 11/10)
Maximum Entropy (1/2)
[Deriving the Maximum Entropy Principle]
Python notebooks: 224
-
Lecture 19 (Thu 11/13) / P3 Intermediate report
Logistic Regression (2/2)
[Luce's choice axiom, Bradley-Terry model]
Maximum Entropy (2/2)
[Occam, Kolmogorov, Minimum Description Length (MDL)]
-
Lecture 20 (Mon 11/17)
Channel capacity [Cover Thomas'06: Ch 7]
-
Lecture 21 (Thu 11/20)
Distortion Theory (1/2)
[Cover Thomas'06: Ch 10]
-
Lecture 22 (Mon 11/24)
Distortion Theory (2/2)
[Cover Thomas'06: Ch 10]
Python notebooks: 232
-
(Thu 11/27): no class (Fall break)
-
Lecture 23 (Mon 12/1)
Information Bottleneck Theory
PART 5: Project presentations
-
Lecture 24 (Thu 12/4): P4 Project presentations / P5 Final report
-
Lecture 25 (Mon 12/8): P4 Project presentations / P5 Final report
-
Lecture 26 (Thu 12/11): P4 Project presentations / P5 Final report
Literature list
Part 1: Information Theory (the basics)
-
[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006:
Ch 2.2 Joint Entropy and Conditional Entropy,
Ch 2.3 Relative Entropy and Mutual Information,
Ch 2.5 Chain Rules,
Ch 2.8 Data-Processing Inequality,
Ch 2.9 Sufficient Statistics,
Ch 2.10 Fano’s Inequality,
Ch 3 Asymptotic Equipartition Property (AEP),
Ch 4 Markov chains/entropy rates,
Ch 5 Data Compression (5.2 Kraft inequality, 5.6 Huffman codes),
Ch 12 Maximum Entropy,
Ch 14 Komogorov Complexity,
Ch 28 Occam's Razor & Minimum description length (MDL)
-
[Pishro-Nik'14] Introduction to Probability, Statistics, and Random Processes, Online book, 2014: Section 5.1.3 is a nice refresher on
conditioning and independence
-
[Schneider'13] Information Theory Primer, With an Appendix on Logarithms. 2013: nice refresher on logarithms
-
[Moser'18] Information Theory (lecture Notes), 6th ed, 2018.
Ch 1 Shannon's measure of information,
Ch 3.1 Relative entropy
-
[Olah'15] Visual Information Theory (blog entry). 2015
-
[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003:
Ch 2 Probability Entropy and Inference,
Ch 4 The Source Coding Theorem
-
[MIT 6.004] Chris Terman. L01: Basics of Information (lecture notes and videos), 6.004 Computation Structures, 2015-2017.
-
[3Blue1Brown Bayes] Bayes theorem, the geometry of changing beliefs (Youtube). 2019
-
[Godoy'18] Understanding binary cross-entropy / log loss: a visual explanation (blog entry). 2018. (Youtube video)
-
[Murphy'12] Machine Learning: a Probabilistic Perspective, MIT press, 2012:
Ch 2 Probability (including information theory),
Ch 8 Logistic regression
-
[Murphy'22] Probabilistic Machine Learning: An Introduction, MIT press, 2022:
Ch 2.4 Bernoulli and binomial distributions (including Sigmoid),
Ch 2.5 Categorical and multinomial distributions (including Softmax),
Ch 6 Information Theory,
Ch 10 Logistic regression
-
[Casella,Berger'24] Statistical inference (2nd ed), CRC press, 2024:
Ch 6 Principles of data reduction,
Ch 6.2.1 Sufficient statistics
-
[Fithian'24] Statistics 210a: Theoretical Statistics (Lecture 4 sufficiency), Berkeley, 2014.
-
[Yeung'08] Information Theory and Network Coding. 2008:
Ch 2.6 The basic inequalities,
Ch 2.7 Some Useful Information Inequalities,
Ch 3.5 Information Diagrams,
Ch 13 Information inequalities,
Ch 14 Shannon-type inequalities,
Ch 15 Beyond Shannon-type inequalities
-
[Suciu'23] Applications of Information Inequalities to Database Theory Problems, LICS keynote 2023.
(Slides)
-
[Suciu'23b] CS294-248: Topics in Database Theory, Berkeley 2023 (slides and videos):
Unit 5a Information Inequalities,
Unit 5b Database Constraints
-
cs7840 Python notebook: entropies
Topic 2: Compression and Method of Types
-
Compression
-
[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006:
Ch 5.9 Shannon–Fano–Elias Coding,
Ch 13.3 Arithmetic Coding,
Ch 5.11 Generation of Discrete Distributions from Fair Coins,
Ch 13.4 Lempel–Ziv Coding,
Ch 13.5 Optimality of Lempel–Ziv Algorithms
-
[Moser'18] Information Theory (lecture Notes), 6th ed, 2018.
Ch 4.8.2 Shannon code,
Ch 4.8.3 Fanno code,
Ch 4.9 Huffman code,
Ch 5.3 Arithmetic coding,
Ch 7.3 Lempel-Ziv
-
[LNTwww] Compression According to Lempel, Ziv and Welch. 2023:
Example 3
-
[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003:
Ch 6 Stream codes (Arithmetic Coding, Lempel-Ziv)
-
[wikipedia Arithmetic Coding] Arithmetic coding
-
Method of Types
-
[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006:
Ch 11.1 Method of Types,
Ch 11.4 Large Deviation Theory (Sanov's theorem),
Ch 11.5 Examples of Sanov's Theorem,
Ch 11.6 Conditional Limit Theorem
Part 3: The axiomatic approach (deriving formulations from first principles)
-
[Klir'06] Uncertainty and Information: Foundations of Generalized Information Theory, Wiley, 2006.
Ch 3.2.1. Simple Derivation of the Shannon Entropy, Ch 3.2.2. Uniqueness of the Shannon Entropy, Ch 9.3.1. Principle of Maximum Entropy
-
[Cox'46] Probability, frequency and reasonable expectation, American Journal of Physics, 1946.
-
[Shannon'48] A Mathematical Theory of Communication, The Bell System Technical Journal, 1948.
-
[Jaynes'03] Probability theory: the logic of science, Cambridge press, 2003. Section 11. Discrete prior probabilities: the entropy principle. (read the reviews on Amazon)
-
[Van Horn'03] Constructing a logic of plausible inference: a guide to Cox's theorem, International Journal of Approximate Reasoning, 2003.
-
[Fleming,Wallace'86] How not to lie with statistics: the correct way to summarize benchmark results. CACM 1986.
-
[Kleinberg'02] An Impossibility Theorem for Clustering. NeurIPS 2002.
-
[wikipedia Probability axioms] Probability axioms.
-
[wikipedia Cox's theorem] Cox's theorem.
Topic 4: Selected Applications to data management, machine learning and information retrieval
-
Information Gain in Decision trees
-
[Tan+'18] Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd ed, 2018.
Ch 3 Classification with Decision trees (freely available chapter)
-
[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009.
Ch 9.2 Tree-based methods
-
[Mitchell'97] Introduction to Machine Learning, McGraw-Hill, 1997.
(Online PDF posted by Tom Mitchell).
Ch 3.6.2 Occam's Razor
-
[wikipedia Information Gain] Information gain in decision trees
-
[Hyafil,Rivest'76] Constructing optimal binary decision trees is NP-complete, Information Processing Letters, 1976.
-
[Russell,Norvig'20] Artificial Intelligence: A Modern Approach, 4th ed, 2020. Ch 19.3 Learning Decision Trees
-
[MLU-explain'22] Wilber, Santamaria. Decision Trees: The unreasonable power of nested decision rules. (an interactive guide)
-
Logistic regression, softmax, maximum entropy, cross-entropy
-
[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009.
Ch 4.4 Logistic regression
-
[Hui,Belkin'21] Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks, ICLR 2021.
-
[Mount'11] The equivalence of logistic regression and maximum entropy models
-
[wikipedia Softmax] Softmax function
-
[Cramer'03] The origins and development of the logit model, 2003 (extended book chapter)
-
[Zombori+'23] Zombori, Agapi Rissaki, Szabo, Gatterbauer, Benedikt. Towards Unbiased Exploration in Partial Label Learning
-
Python notebook: logistic regression
-
Bradley-Terry model, Luce's choice axiom, Item Response Theory
-
Minimum Description Length (MDL), Kolmogorov Complexity
-
[wikipedia MDL] Minimum description length
-
[Gruenwald'04] A Tutorial Introduction to the Minimum Description Length Principle, book chapter 2005. (free preprint)
-
[Vreeken,Yamanishi'19] Modern MDL meets Data Mining: Insights, Theory, and Practice (tutorial), KDD 2019.
-
[Li,Vitanyi'19] An introduction to Kolmogorov complexity and its applications, 4th ed, Springer, 2019. Ch 5.4 Hypothesis Identification by MDL
-
[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003:
Ch 28.3 Minimum Description Length (MDL)
-
[Domingos'99] The role of Occam's Razor in Knowledge Discovery. 1999
-
[Mitchell'97] Introduction to Machine Learning, McGraw-Hill, 1997.
(Online PDF posted by Tom Mitchell).
Ch 6.6 MDL
-
[Sutskever'23] An Observation on Generalization (recorded talk at Simon Institute for the Theory of Computing / Berkeley)
-
Rate Distortion & Information bottleneck theory
-
[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006:
Ch 10 Rate distortion theory
-
[Tishby+'99] Tishby, Pereira, Bialek. The information bottleneck method. The 37th annual Allerton Conference on Communication, Control, and Computing. pp. 368–377.
-
[Harremoes,Tishby'07] The Information Bottleneck Revisited or How to Choose a Good Distortion Measure. International Symposium on Information Theory, 2007.
-
[Zaslavsky+'18] Zaslavsky, Kemp, Regier, Tishby. The Efficient compression in color naming and its evolution. PNAS, 2018.
-
[Webb+'24] Webb, Frankland, Altabaa, Segert, Krishnamurthy, Campbell, Russin, Giallanza, Dulberg, OReilly, Lafferty, Cohen. The Relational Bottleneck as an Inductive Bias for Efficient Abstraction. Trends in Cognitive Science, 2024.
-
[Segert'24] Maximum Entropy, Symmetry, and the Relational Bottleneck: Unraveling the Impact of Inductive Biases on Systematic Reasoning. PhD thesis, Neuroscience @ Princeton, 2024.
-
[Ren,Li,Leskovec'20] Graph Information Bottleneck, NeurIPS, 2020.
-
[Zidi+'20] Zidi, Estella-Aguerri, Shamai. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy, 2020.
-
Machine Learning: VC dimensions
-
[Shalev-Shwartz,Ben-David'14] Understanding machine learning: from theory to algorithms, Cambridge University Press, 2014. Ch 6 The VC-dimension
-
[Mohri+'18] Mohri, Rostamizadeh, Talwalkar. Foundations of Machine learning. MIT press, 2018. Ch 3 Rademacher Complexity and VC-Dimension
-
[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009. Ch 7.9 VC dimension
-
[Piantadosi'18] One parameter is always enough. 2018
-
[Boue'19] Real numbers, data science and chaos: How to fit any dataset with a single parameter, 2019
-
[wiki VC dimension] Wikipedia, Vapnik–Chervonenkis dimension.
-
Data Management: Normal forms
-
Data Management: Approximate acylic schemas
-
Data Management: Information inequalities and cardinality estimation
-
Data Management: Approximate functional dependencies
-
Data Management: Explanation tables
-
[El Gebaly+'18] El Gebaly, Feng, Golab, Korn, Srivastava. Explanation Tables. IEEE Debull 2018
-
Information Retrieval: Inverse Document Frequency
Research best practice