240 likes | 408 Views
2. Mathematical Foundations. Foundations of Statistic Natural Language Processing. 2001. 7. 10. 인공지능연구실 성경희. Contents – Part 1. 1. Elementary Probability Theory Conditional probability Bayes’ theorem Random variable Joint and conditional distributions Standard distribution.
E N D
2. Mathematical Foundations Foundations of Statistic Natural Language Processing 2001. 7. 10. 인공지능연구실 성경희
Contents – Part 1 1. Elementary Probability Theory • Conditional probability • Bayes’ theorem • Random variable • Joint and conditional distributions • Standard distribution
Conditional probability (1/2) • P(A) : the probability of the event A • Ex1> A coin is tossed 3 times. W = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} A = {HHT, HTH, THH} : 2 heads, P(A)=3/8 B = {HHH, HHT, HTH, HTT} : first head, P(B)=1/2 : conditional probability
Conditional probability (2/2) • Multiplication rule • Chain rule • Two events A, B are independent If
Bayes’ theorem (1/2) Generally, if and the Bi are disjoint Bayes’ theorem
Bayes’ theorem (2/2) • Ex2> G : the event of the sentence having a parasitic gap T : the event of the test being positive • This poor result comes about because the prior probability of a sentence containing a parasitic gap is so low.
Random variable • Ex3>Random variable X for the sum of two dice. Expectation : Variance : S={2,…,12} probability mass function(pmf) : p(x) = p(X=x), X ~ p(x) If X:W {0,1}, then X is called an indicator RV or a Bernoulli trial
Joint and conditional distributions • The joint pmf for two discrete random variables X, Y • Marginal pmfs, which total up the probability mass for the values of each variable separately. • Conditional pmf for y such that
Standard distributions (1/3) • Discrete distributions: The binomial distribution • When one has a series of trials with only two outcomes, each trial being independent from all the others. • The number r of successes out of n trials given that the probability of success in any trial is p. : • Expectation : np, variance : np(1-p) where
Standard distributions (2/3) • Discrete distributions: The binomial distribution
Standard distributions (3/3) • Continuous distributions: The normal distribution • For the Mean m and the standard deviation s : Probability density function (pdf)
Contents – Part 2 2. Essential Information Theory • Entropy • Joint entropy and conditional entropy • Mutual information • The noisy channel model • Relative entropy or Kullback-Leibler divergence
Shannon’s Information Theory • Maximizing the amount of information that one can transmit over an imperfect communication channel such as a noisy phone line. • Theoretical maxima for data compression • Entropy H • Theoretical maxima for the transmission rate • Channel Capacity
Entropy (1/4) • The entropy H (or self-information) is the average uncertainty of a single random variable X. • Entropy is a measure of uncertainty. • The more we know about something, the lower the entropy will be. • We can use entropy as a measure of the quality of our models. • Entropy measures the amount of information in a random variable (measured in bits). where, p(x) is pmf of X
Entropy (2/4) • The entropy of a weighted coin. The horizontal axis shows the probability of a weighted coin to come up heads. The vertical axis shows the entropy of tossing the corresponding coin once. back 23 page p
Entropy (3/4) • Ex7> The result of rolling an 8-sided die.(uniform distribution) • Entropy : The average length of the message needed to transmit an outcome of that variable. • For expectation E
Entropy (4/4) • Ex8> Simplified Polynesian • We can design a code that on average takes bits to transmit a letter • Entropy can be interpreted as a measure of the size of the ‘search space’ consisting of the possible values of a random variable. bits
Joint entropy and conditional entropy (1/3) • The joint entropy of a pair of discrete random variable X,Y~ p(x,y) • The conditional entropy • The chain rule for entropy
p t k a i u p t k a i 0 u 0 1 Joint entropy and conditional entropy (2/3) • Ex9> Simplified Polynesian revisited • All words of consist of sequence of CV(consonant-vowel) syllables Marginal probabilities (per-syllable basis) Per-letter basis probabilities double back 8 page
p t k a i 0 u 0 1 Joint entropy and conditional entropy (3/3)
Mutual information (1/2) • By the chain rule for entropy • : mutual information • Mutual information between X and Y • The amount of information one random variable contains about another. (symmetric, non-negative) • It is 0 only when two variables are independent. • It grows not only with the degree of dependence, but also according to the entropy of the variables. • It is actually better to think of it as a measure of independence.
Mutual information (2/2) • Since (entropy is called self-information) • Conditional MI and a chain rule =I(x,y) Pointwise MI
Noisy channel model • Channel capacity : the rate at which one can transmit information through the channel (optimal) • Binary symmetric channel • since entropy is non-negative, go 15 page
Relative entropy or Kullback-Leibler divergence • Relative entropy for two pmfs, p(x), q(x) • A measure of how close two pmfs are. • Non-negative, and D(p||q)=0 if p=q • Conditional relative entropy and chain rule