1 / 18

A Bit of Information Theory

A Bit of Information Theory. Unsupervised Learning Working Group Assaf Oron, Oct. 15 2003. Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”, 1991. Contents. Coding and Transmitting Information Entropy etc. Information Theory and Statistics

braith
Download Presentation

A Bit of Information Theory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Bit of Information Theory Unsupervised Learning Working Group Assaf Oron, Oct. 15 2003 Based mostly upon: Cover & Thomas, “Elements of Inf. Theory”, 1991

  2. Contents • Coding and Transmitting Information • Entropy etc. • Information Theory and Statistics • Information Theory and “Machine Learning”

  3. What is Coding? (1) • We keep coding all the time • Crucial requirement for coding: “source” and “receiver” agree on the key. • Modern coding: telegraph->radio->… • Practical problems: How efficient can we make it? Tackled from 20’s on. • 1940’s: Claude Shannon

  4. What is Coding? (2) • Shannon’s greatness: finding a solution of the “specific” problem, by working on the “general” problem. • Namely: how does one quantify information, its coding and its transmission? • ANY type of information

  5. Some Day-to-Day Codes

  6. Information Complexity of Some Coded Messages • Let’s think written numbers: • k digits → 10k possible messages • How about written English? • k letters → 26k possible messages • k words → Dk possible messages, where D is English dictionary size ∴ Length ~ log(complexity)

  7. Information Entropy • The expected length (bits) of a binary message conveying x-type information • other common descriptions: “code complexity”, “uncertainty”, “missing/required information”, “expected surprise”, “information content” (BAD), etc.

  8. Why “Entropy”? • Thermodynamics (mid 19th): “amount of un-usable heat in system” • Statistical Physics (end 19th): “log (complexity of current system state)” • ⇉ amount of “mess” in the system • The two were proven to be equivalent • Statistical entropy is proportional to information entropy if p(x) is uniform • 2nd Law of Thermodynamics… • Entropy never decreases (more later)

  9. Entropy Properties, Examples .

  10. Kullback-Leibler Divergence(“Relative Entropy”) • In words: “the excess message length needed to use p(x)-optimized code for messages based on q(x)” • Properties, Relation to H:

  11. Mutual Information • Relationship to D,H(hint: cond. Prob.): • Properties, Examples:

  12. Entropy for Continuous RV’s • “Little” h, Defined in the “natural” way • However it is not the same measure: • h of discrete RV’s is always 0, and H of continuous RV’s is infinite (measure theory…) • For many continuous distributions, h is log (variance) plus some constant • Why?

  13. The Statistical Connection (1) • K-L D⇔ Likelihood Ratio • Law of large numbers can be rephrased as a limit on D • For dist.’s with same variance, normal is the one with maximum h. • (2nd law of thermodynamics revisited) • h is an average quantity. Is the CLT, then, a “law of nature”?… (I think: “YES”!)

  14. The Statistical Connection (2) • Mutual information is very useful • Certainly for discrete RV’s • Also for continuous (no dist. assumptions!) • A lot of implications for stochastic processes, as well • I just don’t quite understand them • English?

  15. Machine Learning? (1) • So far, we haven’t mentioned noise • In inf. Theory, noise exists in the channel • Channel capacity: max(mutual information) between “source”, “receiver” • Noise directly decreases the capacity • Shannon’s “Biggest” result: this can be (almost) achieved with (almost) zero error • Known as the “Channel Coding Theorem”

  16. Machine Learning? (2) • The CCT inspired practical developments • Now it all depends on code and channel! • Smarter, “error-correcting” codes • Tech developments focus on channel capacity

  17. Machine Learning? (3) • Can you find analogy between coding and classification/clustering? (can it be useful??)

  18. Machine Learning? (4) • Inf. Theory tells us that: • We CAN find a nearly optimal classification or clustering rule (“coding”) • We CAN find a nearly optimal parameterization+classification combo • Perhaps the newer wave of successful, but statistically “intractable” methods (boosting etc.) works by increasing channel capacity (i.e, high-dim parameterization)?

More Related