260 likes | 454 Views
SNLP Chapter 2 Mathematical Foundation. 인공지능연구실 정 성 원. Contents – Part 1. 1. Elementary Probability Theory Conditional probability Bayes’ theorem Random variable Joint and conditional distribution. Probability spaces.
E N D
SNLP Chapter 2 Mathematical Foundation 인공지능연구실 정 성 원
Contents – Part 1 1. Elementary Probability Theory • Conditional probability • Bayes’ theorem • Random variable • Joint and conditional distribution
Probability spaces • Probability theory deals with predicting how likely it is that something will happen. • The collection of basic outcomes (or sample points) for our experiment is called the sample space(Ω). • An event is a subset of the sample space. • σ-field • Probabilities are numbers between 0 and 1, where 0 indicates impossibility and 1, certainty. • A probability function/distribution distributes a probability mass of 1 throughout the sample space. • A well-founded probability space consists of a sample space Ω, σ-field of event F, and a probability function P.
Conditional probability (1/2) • P(A) : the probability of the event A • Ex1> A coin is tossed 3 times. W = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} A = {HHT, HTH, THH} : 2 heads, P(A)=3/8 B = {HHH, HHT, HTH, HTT} : first head, P(B)=1/2 : conditional probability
Conditional probability (2/2) • Multiplication rule • Chain rule • Two events A, B are independent • Conditionally independent If
Bayes’ theorem (2/2) • Ex2> G : the event of the sentence having a parasitic gap T : the event of the test being positive • This poor result comes about because the prior probability of a sentence containing a parasitic gap is so low.
Random variable (확률 변수) • Ex3>Random variable X for the sum of two dice. Expectation : (기대값) Variance : (분산) S={2,…,12} probability mass function(pmf) : p(x) = p(X=x), X ~ p(x) (확률 질량 함수)
Joint and conditional distributions • The joint pmf for two discrete random variables X, Y • Marginal pmfs, which total up the probability mass for the values of each variable separately. • Conditional pmf • Chain rule for y such that
Contents – Part 2 2. Essential Information Theory • Entropy • Joint entropy and conditional entropy • Mutual information • The noisy channel model • Relative entropy or Kullback-Leibler divergence
Shannon’s Information Theory • Maximizing the amount of information that one can transmit over an imperfect communication channel such as a noisy phone line. • Theoretical maxima for data compression • Entropy H • Theoretical maxima for the transmission rate • Channel Capacity
Entropy (1/4) • The entropy H (or self-information) is the average uncertainty of a single random variable X. • Entropy is a measure of uncertainty. • The more we know about something, the lower the entropy will be. • We can use entropy as a measure of the quality of our models. • Entropy measures the amount of information in a random variable (measured in bits). where, p(x) is pmf of X
Entropy (2/4) • The entropy of a weighted coin. The horizontal axis shows the probability of a weighted coin to come up heads. The vertical axis shows the entropy of tossing the corresponding coin once. P
Entropy (3/4) • Ex7> The result of rolling an 8-sided die.(uniform distribution) • Entropy : The average length of the message needed to transmit an outcome of that variable. • For expectation E
Entropy (4/4) • Ex8> Simplified Polynesian • We can design a code that on average takes bits to transmit a letter • Entropy can be interpreted as a measure of the size of the ‘search space’ consisting of the possible values of a random variable. bits
Joint entropy and conditional entropy (1/2) • The joint entropy of a pair of discrete random variable X,Y~ p(x,y) • The conditional entropy • The chain rule for entropy
p t k a i u p t k a i 0 u 0 1 Joint entropy and conditional entropy (2/3) • Ex9> Simplified Polynesian revisited • All words of consist of sequence of CV(consonant-vowel) syllables Marginal probabilities (per-syllable basis) Per-letter basis probabilities double back 8 page
p t k a i 0 u 0 1 Joint entropy and conditional entropy (3/3)
Mutual information (1/2) • By the chain rule for entropy • : mutual information • Mutual information between X and Y • The amount of information one random variable contains about another. (symmetric, non-negative) • It is 0 only when two variables are independent. • It grows not only with the degree of dependence, but also according to the entropy of the variables. • It is actually better to think of it as a measure of independence.
Mutual information (2/2) • Since (entropy is called self-information) • Conditional MI and a chain rule =I(x,y) Pointwise MI
Noisy channel model (1/2) • Channel capacity : the rate at which one can transmit information through the channel (optimal) • Binary symmetric channel • since entropy is non-negative,
Relative entropy or Kullback-Leibler divergence • Relative entropy for two pmfs, p(x), q(x) • A measure of how close two pmfs are. • Non-negative, and D(p||q)=0 if p=q • Conditional relative entropy and chain rule
The relation of language :Cross entropy • Use entropy as a measure of the quality of our models • Pointwise Entropy • Minimize D(p||m) • Cross Entropy