A Bit of Information Theory

A Bit of Information Theory Unsupervised Learning Working Group Assaf Oron, Oct. 15 2003 Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”, 1991

Contents • Coding and Transmitting Information • Entropy etc. • Information Theory and Statistics • Information Theory and “Machine Learning”

What is Coding? (1) • We keep coding all the time • Crucial requirement for coding: “source” and “receiver” agree on the key. • Modern coding: telegraph->radio->… • Practical problems: How efficient can we make it? Tackled from 20’s on. • 1940’s: Claude Shannon

What is Coding? (2) • Shannon’s greatness: finding a solution of the “specific” problem, by working on the “general” problem. • Namely: how does one quantify information, its coding and its transmission? • ANY type of information

Some Day-to-Day Codes

Information Complexity of Some Coded Messages • Let’s think written numbers: • k digits → 10k possible messages • How about written English? • k letters → 26k possible messages • k words → Dk possible messages, where D is English dictionary size ∴ Length ~ log(complexity)

Information Entropy • The expected length (bits) of a binary message conveying x-type information • other common descriptions: “code complexity”, “uncertainty”, “missing/required information”, “expected surprise”, “information content” (BAD), etc.

Why “Entropy”? • Thermodynamics (mid 19th): “amount of un-usable heat in system” • Statistical Physics (end 19th): “log (complexity of current system state)” • ⇉ amount of “mess” in the system • The two were proven to be equivalent • Statistical entropy is proportional to information entropy if p(x) is uniform • 2nd Law of Thermodynamics… • Entropy never decreases (more later)

Entropy Properties, Examples .

Kullback-Leibler Divergence(“Relative Entropy”) • In words: “the excess message length needed to use p(x)-optimized code for messages based on q(x)” • Properties, Relation to H:

Mutual Information • Relationship to D,H(hint: cond. Prob.): • Properties, Examples:

Entropy for Continuous RV’s • “Little” h, Defined in the “natural” way • However it is not the same measure: • h of discrete RV’s is always 0, and H of continuous RV’s is infinite (measure theory…) • For many continuous distributions, h is log (variance) plus some constant • Why?

The Statistical Connection (1) • K-L D⇔ Likelihood Ratio • Law of large numbers can be rephrased as a limit on D • For dist.’s with same variance, normal is the one with maximum h. • (2nd law of thermodynamics revisited) • h is an average quantity. Is the CLT, then, a “law of nature”?… (I think: “YES”!)

The Statistical Connection (2) • Mutual information is very useful • Certainly for discrete RV’s • Also for continuous (no dist. assumptions!) • A lot of implications for stochastic processes, as well • I just don’t quite understand them • English?

Machine Learning? (1) • So far, we haven’t mentioned noise • In inf. Theory, noise exists in the channel • Channel capacity: max(mutual information) between “source”, “receiver” • Noise directly decreases the capacity • Shannon’s “Biggest” result: this can be (almost) achieved with (almost) zero error • Known as the “Channel Coding Theorem”

Machine Learning? (2) • The CCT inspired practical developments • Now it all depends on code and channel! • Smarter, “error-correcting” codes • Tech developments focus on channel capacity

Machine Learning? (3) • Can you find analogy between coding and classification/clustering? (can it be useful??)

Machine Learning? (4) • Inf. Theory tells us that: • We CAN find a nearly optimal classification or clustering rule (“coding”) • We CAN find a nearly optimal parameterization+classification combo • Perhaps the newer wave of successful, but statistically “intractable” methods (boosting etc.) works by increasing channel capacity (i.e, high-dim parameterization)?

A Bit of Information Theory

A Bit of Information Theory

Presentation Transcript

A bit of history

Increasing Information per Bit

A bit of trivia...

A Bit of Background Information

A bit of History

A bit of magic

A little bit of information about stoma...

A Bit of History

A Bit of Review

A Bit of Review

A bit of “housekeeping”

A BIT OF BACKGROUND

A Bit of History

Bachelor of Information Technology (Hons) [BIT ]

Evolution of a BIT

Satisfiability modulo the Theory of Bit Vectors

Decidability or Impossibility? 02b = a bit of boring theory

A bit of history

A Unified Theory of Information

A bit of trivia...

A Bit of History

School of Information Theory