180 likes | 395 Views
A Bit of Information Theory. Unsupervised Learning Working Group Assaf Oron, Oct. 15 2003. Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”, 1991. Contents. Coding and Transmitting Information Entropy etc. Information Theory and Statistics
E N D
A Bit of Information Theory Unsupervised Learning Working Group Assaf Oron, Oct. 15 2003 Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”, 1991
Contents • Coding and Transmitting Information • Entropy etc. • Information Theory and Statistics • Information Theory and “Machine Learning”
What is Coding? (1) • We keep coding all the time • Crucial requirement for coding: “source” and “receiver” agree on the key. • Modern coding: telegraph->radio->… • Practical problems: How efficient can we make it? Tackled from 20’s on. • 1940’s: Claude Shannon
What is Coding? (2) • Shannon’s greatness: finding a solution of the “specific” problem, by working on the “general” problem. • Namely: how does one quantify information, its coding and its transmission? • ANY type of information
Information Complexity of Some Coded Messages • Let’s think written numbers: • k digits → 10k possible messages • How about written English? • k letters → 26k possible messages • k words → Dk possible messages, where D is English dictionary size ∴ Length ~ log(complexity)
Information Entropy • The expected length (bits) of a binary message conveying x-type information • other common descriptions: “code complexity”, “uncertainty”, “missing/required information”, “expected surprise”, “information content” (BAD), etc.
Why “Entropy”? • Thermodynamics (mid 19th): “amount of un-usable heat in system” • Statistical Physics (end 19th): “log (complexity of current system state)” • ⇉ amount of “mess” in the system • The two were proven to be equivalent • Statistical entropy is proportional to information entropy if p(x) is uniform • 2nd Law of Thermodynamics… • Entropy never decreases (more later)
Kullback-Leibler Divergence(“Relative Entropy”) • In words: “the excess message length needed to use p(x)-optimized code for messages based on q(x)” • Properties, Relation to H:
Mutual Information • Relationship to D,H(hint: cond. Prob.): • Properties, Examples:
Entropy for Continuous RV’s • “Little” h, Defined in the “natural” way • However it is not the same measure: • h of discrete RV’s is always 0, and H of continuous RV’s is infinite (measure theory…) • For many continuous distributions, h is log (variance) plus some constant • Why?
The Statistical Connection (1) • K-L D⇔ Likelihood Ratio • Law of large numbers can be rephrased as a limit on D • For dist.’s with same variance, normal is the one with maximum h. • (2nd law of thermodynamics revisited) • h is an average quantity. Is the CLT, then, a “law of nature”?… (I think: “YES”!)
The Statistical Connection (2) • Mutual information is very useful • Certainly for discrete RV’s • Also for continuous (no dist. assumptions!) • A lot of implications for stochastic processes, as well • I just don’t quite understand them • English?
Machine Learning? (1) • So far, we haven’t mentioned noise • In inf. Theory, noise exists in the channel • Channel capacity: max(mutual information) between “source”, “receiver” • Noise directly decreases the capacity • Shannon’s “Biggest” result: this can be (almost) achieved with (almost) zero error • Known as the “Channel Coding Theorem”
Machine Learning? (2) • The CCT inspired practical developments • Now it all depends on code and channel! • Smarter, “error-correcting” codes • Tech developments focus on channel capacity
Machine Learning? (3) • Can you find analogy between coding and classification/clustering? (can it be useful??)
Machine Learning? (4) • Inf. Theory tells us that: • We CAN find a nearly optimal classification or clustering rule (“coding”) • We CAN find a nearly optimal parameterization+classification combo • Perhaps the newer wave of successful, but statistically “intractable” methods (boosting etc.) works by increasing channel capacity (i.e, high-dim parameterization)?