CS 7840: Foundations and Applications of Information Theory (Fall 2024)
Topics and approximate agenda
This schedule will be updated regularly as the class progresses. Check back frequently.
We will usually post lecture slides by the end of the day following a lecture (thus the next day), or latest two days after class. Posted slides are accumulative per topic.
PART 1: Information Theory (the basics)
Covers the basic mathematical framework behind entropy and its various forms.

Lecture 1 (Wed 9/4)
Course introduction and encoding example

Lecture 2 (Mon 9/9)
Basics of Probability (1/2) [random experiment, independence, conditional probability, chain rule, Bayes' theorem, random variables]

Lecture 3 (Wed 9/11)
Basics of Probability (2/2) [expectation, variance, Markov chains]
Basics of information theory (1/6) [measures of Information, intuition behind entropy]

Lecture 4 (Mon 9/16)
Compression (1/3) [algorithmic derivation of entropy via compression]

Lecture 5 (Wed 9/18)
Basics of information theory (2/6) [conditional entropy, binary entropy, max entropy]

Lecture 6 (Mon 9/23)
Basics of information theory (3/6) [joint entropy, conditional entropy, mutual information, cross entropy]
Compression (2/3) [uniquely decodable codes]

Lecture 7 (Wed 9/25)
Compression (3/3) [uniquely decodable codes continued]

Lecture 8 (Mon 9/30)
Compression (2/3) [Asymptotic Equipartition Property (AEP)]
Basics of information theory (4/6) [multivariate entropies]
Python notebook: entropies

Lecture 9 (Wed 10/2)
Basics of information theory (5/6) [multivariate entropies, interaction information, Markov chains, data processing inequality]

Lecture 10 (Mon 10/7)
Basics of information theory (6/6) [data processing inequality, sufficient statistics, information inequalities]
PART 2: The axiomatic approach (deriving formulations from first principles)
Covers the axiomatic approach from multiple angles: a few simple principles (axioms) leading to entropy or the laws of probability up to factors. Starting from a list of postulates leading to particular solution is a powerful approach that has been used across different areas of computer science (e.g. how to define the right scoring function for achieving a desired outcome)

Lecture TBD:
Derivation of Hartley measure and entropy function from first principles

Lecture TBD:
Cox’ theorem: a derivation of the laws of probability theory from a certain set of postulates.
Contrast with Kolmogorov’s “probability axioms”
Part 3: Selected Applications to data management, machine learning and information retrieval
Covers example approaches of basic ideas from information theory to practical problems in data management, machine learning, and information retrieval. Topics and discussed papers may vary over years.

Lecture 11 (Wed 10/9)
Compression (1/4)
[Cover, Thomas'06: Ch 5.9, Ch 13.3]

(Mon 10/14): no class (Indigenous Peoples Day)

Lecture 12 (Wed 10/16):
Compression (2/4)
[Cover Thomas'06: Ch 5.11]

Lecture 13 (Mon 10/21):
Compression (3/4)
[Cover Thomas'06: Ch 13.4, Ch 13.5]

Lecture 14 (Wed 10/23):
Compression (4/4)
Decision trees (1/2)

Lecture 15 (Mon 10/28):
Decision trees (2/2)
Logistic Regression (1/...) [Connections (multinomial) logistic regression, maximum entropy models, Lagrange multipliers, Occam's razor, softmax]
Python notebook: logistic regression

Lecture 16 (Wed 10/30):
Method of Types (1/2)

Lecture 17 (Mon 11/4):
Method of Types (2/2)
[Sanov's theorm, large deviation theory]

Lecture 18 (Wed 11/6):
Logistic Regression (2/...) [Occam, Maximum Entropy, Cross Entropy, BradleyTerry model, Luce's choice axiom, Item Response Theory]

(Mon 11/11): no class (Veterans Day)

Lecture 19 (Wed 11/13):
Minimum Description Length (MDL), Kolmogorov Complexity

Lecture 20 (Mon 11/18):
Rate Distortion Theory, Information Bottleneck Theory

Lecture 21 (Wed 11/20):

Lecture 22 (Mon 11/25):

(Wed 11/27): no class (Fall break)

Lecture 23 (Mon 12/2):
Inverse document frequency

Lecture 24 (Wed 12/4):
Placeholder
Project presentations

Lecture 25 (Mon 12/9): P4 Project presentations

Lecture 26 (Wed 12/11): P4 Project presentations
Literature list
Part 1: Information Theory (the basics)

[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006:
Ch 2.2 Joint Entropy and Conditional Entropy,
Ch 2.3 Relative Entropy and Mutual Information,
Ch 2.5 Chain Rules,
Ch 2.8 DataProcessing Inequality,
Ch 2.9 Sufficient Statistics,
Ch 2.10 Fano’s Inequality,
Ch 3 Asymptotic Equipartition Property (AEP),
Ch 4 Markov chains/entropy rates,
Ch 5 Data Compression (5.2 Kraft inequality, 5.6 Huffman codes),
Ch 12 Maximum Entropy,
Ch 14 Komogorov Complexity,
Ch 28 Occam's Razor & Minimum description length (MDL)

[PishroNik'14] Introduction to Probability, Statistics, and Random Processes, Online book, 2014: Section 5.1.3 is a nice refresher on
conditioning and independence

[Schneider'13] Information Theory Primer, With an Appendix on Logarithms. 2013: nice refresher on logarithms

[Moser'18] Information Theory (lecture Notes), 6th ed, 2018.
Ch 1 Shannon's measure of information,
Ch 3.1 Relative entropy

[Olah'15] Visual Information Theory (blog entry). 2015

[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003:
Ch 2 Probability Entropy and Inference,
Ch 4 The Source Coding Theorem

[MIT 6.004] Chris Terman. L01: Basics of Information (lecture notes and videos), 6.004 Computation Structures, 20152017.

[3Blue1Brown Bayes] Bayes theorem, the geometry of changing beliefs (Youtube). 2019

[Godoy'18] Understanding binary crossentropy / log loss: a visual explanation (blog entry). 2018. (Youtube video)

[Murphy'12] Machine Learning: a Probabilistic Perspective, MIT press, 2012:
Ch 2 Probability (including information theory),
Ch 8 Logistic regression

[Murphy'22] Probabilistic Machine Learning: An Introduction, MIT press, 2022:
Ch 2.4 Bernoulli and binomial distributions (including Sigmoid),
Ch 2.5 Categorical and multinomial distributions (including Softmax),
Ch 6 Information Theory,
Ch 10 Logistic regression

[Casella,Berger'24] Statistical inference (2nd ed), CRC press, 2024:
Ch 6 Principles of data reduction,
Ch 6.2.1 Sufficient statistics

[Fithian'24] Statistics 210a: Theoretical Statistics (Lecture 4 sufficiency), Berkeley, 2014.

[Yeung'08] Information Theory and Network Coding. 2008:
Ch 2.6 The basic inequalities,
Ch 2.7 Some Useful Information Inequalities,
Ch 3.5 Information Diagrams,
Ch 13 Information inequalities,
Ch 14 Shannontype inequalities,
Ch 15 Beyond Shannontype inequalities

[Suciu'23] Applications of Information Inequalities to Database Theory Problems, LICS keynote 2023.
(Slides)

[Suciu'23b] CS294248: Topics in Database Theory, Berkeley 2023 (slides and videos):
Unit 5a Information Inequalities,
Unit 5b Database Constraints

Python notebook: entropies
Part 2: The axiomatic approach (deriving formulations from first principles)

[Klir'06] Uncertainty and Information: Foundations of Generalized Information Theory, Wiley, 2006.
Ch 3.2.1. Simple Derivation of the Shannon Entropy, Ch 3.2.2. Uniqueness of the Shannon Entropy, Ch 9.3.1. Principle of Maximum Entropy

[Cox'46] Probability, frequency and reasonable expectation, American Journal of Physics, 1946.

[Shannon'48] A Mathematical Theory of Communication, The Bell System Technical Journal, 1948.

[Jaynes'03] Probability theory: the logic of science, Cambridge press, 2003. Section 11. Discrete prior probabilities: the entropy principle. (read the reviews on Amazon)

[Van Horn'03] Constructing a logic of plausible inference: a guide to Cox's theorem, International Journal of Approximate Reasoning, 2003.

[Fleming,Wallace'86] How not to lie with statistics: the correct way to summarize benchmark results. CACM 1986.

[Kleinberg'02] An Impossibility Theorem for Clustering. NeurIPS 2002.

[wikipedia Probability axioms] Probability axioms.

[wikipedia Cox's theorem] Cox's theorem.
Topic 3: Selected Applications to data management, machine learning and information retrieval

Compression

[Cover,Thomas'06] Elements of Information Theory. 2nd ed, 2006:
Ch 5.9 Shannon–Fano–Elias Coding,
Ch 13.3 Arithmetic Coding,
Ch 5.11 Generation of Discrete Distributions from Fair Coins,
Ch 13.4 Lempel–Ziv Coding,
Ch 13.5 Optimality of Lempel–Ziv Algorithms

[Moser'18] Information Theory (lecture Notes), 6th ed, 2018.
Ch 4.8.2 Shannon code,
Ch 4.8.3 Fanno code,
Ch 4.9 Huffman code,
Ch 5.3 Arithmetic coding,
Ch 7.3 LempelZiv

[LNTwww] Compression According to Lempel, Ziv and Welch. 2023:
Example 3

[MacKay'03] Information Theory, Inference, and Learning Algorithms. 2003:
Ch 6 Stream codes (Arithmetic Coding, LempelZiv)

[wiki arithmetic coding] Wikipedia, Arithmetic coding

Information Gain in Decision trees

[Tan+'18] Tan, Steinbach, Karpatne, Kumar. Introduction to Data Mining, 2nd ed, 2018.
Ch 3 Classification with Decision trees (freely available chapter)

[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009.
Ch 9.2 Treebased methods

[Mitchell'97] Introduction to Machine Learning, McGrawHill, 1997.
(Online PDF posted by Tom Mitchell).
Ch 3.6.2 Occam's Razor

[wiki information gain] Wikipedia, Information gain in decision trees

[Hyafil,Rivest'76] Constructing optimal binary decision trees is NPcomplete, Information Processing Letters, 1976.

[Russell,Norvig'20] Artificial Intelligence: A Modern Approach, 4th ed, 2020. Ch 19.3 Learning Decision Trees

Logistic regression, softmax, maximum entropy, crossentropy

[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009.
Ch 4.4 Logistic regression

[Hui,Belkin'21] Evaluation of Neural Architectures Trained with Square Loss vs CrossEntropy in Classification Tasks, ICLR 2021.

[Mount'11] The equivalence of logistic regression and maximum entropy models

[wiki Softmax] Wikipedia. Softmax function

[Cramer'03] The origins and development of the logit model, 2003 (extended book chapter)

Python notebook: logistic regression

[Zombori+'23] Zombori, Agapi Rissaki, Szabo, Gatterbauer, Benedikt. Towards Unbiased Exploration in Partial Label Learning

BradleyTerry model, Luce's choice axiom, Item Response Theory

Minimum Description Length (MDL), Kolmogorov Complexity

Information bottleneck theory

Machine Learning: VC dimensions

[ShalevShwartz,BenDavid'14] Understanding machine learning: from theory to algorithms, Cambridge University Press, 2014. Ch 6 The VCdimension

[Mohri+'18] Mohri, Rostamizadeh, Talwalkar. Foundations of Machine learning. MIT press, 2018. Ch 3 Rademacher Complexity and VCDimension

[Hastie+'09] Hastie, Tibshirani, Friedman. The Elements of Statistical Learning, Springer, 2nd ed, 2009. Ch 7.9 VC dimension

[Piantadosi'18] One parameter is always enough. 2018

[Boue'19] Real numbers, data science and chaos: How to ﬁt any dataset with a single parameter, 2019

[wiki VC dimension] Wikipedia, Vapnik–Chervonenkis dimension.

Data Management: Normal forms

Data Management: Approximate acylic schemas

Data Management: Information inequalities and cardinality estimation

Data Management: Approximate functional dependencies

Data Management: Explanation tables

[El Gebaly+'18] El Gebaly, Feng, Golab, Korn, Srivastava. Explanation Tables. IEEE Debull 2018

Information Retrieval: Inverse Document Frequency
Research best practice