LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 11 part 2 2/18/2013

Recommended Reading • Manning & Schutze Chapter 2, Mathematical Foundations • Bayesian networks • http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html • http://www.cs.ubc.ca/~murphyk/Papers/intro_gm.pdf • http://www.autonlab.org/tutorials/bayesnet.html • http://en.wikipedia.org/wiki/Bayesian_network

Outline • Probability theory • Some probability problems • Minimal encoding of probability distributions

Probability topics • Random variables and sample spaces • Probability distribution • Frequentist probability estimation • Expected value • Joint probability • Conditional probability • Marginal probability • Independence • Conditional independence • Product rule • Chain rule • Bayes rule • Subjective probability

1. Discrete random variables • A discrete random variabletakes on a range of values, or events • The set of possible events is the sample space, Ω • Example: rolling a die Ω = {1 dot, 2 dots, 3 dots, 4 dots, 5 dots, 6 dots} • The occurrence of a random value taking on a particular value from the sample space is a trial

2. Probability distribution • A set of data can be described as a probability distribution over a set of events • Definition of a probability distribution: • We have a set of events x drawn from a finite sample space Ω • Probability of each event is between 0 and 1 • Sum of probabilities of all events is

Example: Probability distribution • Suppose you have a die that is equally weighted on all sides. • Let X be the random variable for the outcome of a single roll. p(X=1 dot) = 1 / 6 p(X=2 dots) = 1 / 6 p(X=3 dots) = 1 / 6 p(X=4 dots) = 1 / 6 p(X=5 dots) = 1 / 6 p(X=6 dots) = 1 / 6

3. Frequentist probability estimation • Suppose you have a die and you don’t know how it is weighted. • Let X be the random variable for the outcome of a roll. • Want to produce values for p̂(X), which is an estimate of the probability distribution of X. • Read as “p-hat” • Do this through Maximum Likelihood Estimation (MLE): the probability of an event is the number of times it occurs, divided by the total number of trials.

Example: roll a die; random variable X • Data: roll a die 60 times, record the frequency of each event • 1 dot 9 rolls • 2 dots 10 rolls • 3 dots 9 rolls • 4 dots 12 rolls • 5 dots 9 roll • 6 dots 11 rolls

Example: roll a die; random variable X • Maximum Likelihood Estimate: p̂(X=x) = count(x) / total_count_of_all_events • p̂( X = 1 dot) = 9 / 60 = 0.150 p̂( X = 2 dots) = 10 / 60 = 0.166 p̂( X = 3 dots) = 9 / 60 = 0.150 p̂( X = 4 dots) = 12 / 60 = 0.200 p̂( X = 5 dots) = 9 / 60 = 0.150 p̂( X = 6 dots) = 11 / 60 = 0.183 Sum = 60 / 60 = 1.0

Convergence of p̂(X) • Suppose we know that the die is equally weighted. • We observe that our values for p̂(X) are close to p(X), but not all exactly equal. • We would expect that as the number of trials increases, p̂(X) will get closer to p(X). • For example, we could roll the die 1,000,000 times. Probability estimate will improve with more data.

Simplify notation • People are often not precise, and write “p(X)” when they mean “p̂(X)” • We will do this also • Can also leave out the name of the random variable when it is understood • Example: p(X=4 dots) p(4 dots)

4. Expected value • Roll the die, get these results: p( X = roll 1) = 3 / 20 p( X = roll 4) = 2 / 20 p( X = roll 2) = 2 / 20 p( X = roll 5) = 1 / 20 p( X = roll 3) = 4 / 20 p( X = roll 6) = 8 / 20 • On average, if I roll the die, how many dots will there be? • Answer is not ( 1 + 2 + 3 + 4 + 5 + 6 ) / 6 = 1.83 • Need to consider the probability of each event

Expected value of a random variable • The expected value of a random variable X is a weighted sum of the values of X. • i.e., for each event x in the sample space for the random variable X, multiply the probability of each event by the value of the event, and sum these • The expected value is not necessary equal to one of the events in the sample space.

Expected value: example • The expected value of a random variable X is a weighted sum of the values of X. • Example: the average number of dots that I rolled Suppose: p( X = roll 1) = 3 / 20 p( X = roll 4) = 2 / 20 p( X = roll 2) = 2 / 20 p( X = roll 5) = 1 / 20 p( X = roll 3) = 4 / 20 p( X = roll 6) = 8 / 20 • E[X] = (3/15)*1 + (2/15)*2 + (4/15)*3 + (2/15)*4 + (1/15)*5 + (3/15)*8 = 3.73

5. Joint prob.: multiple random variables • Complex data can be described as a combination of values of multiple random variables • Example: 2 random variables • COLOR ∈ { blue, red } • SHAPE ∈ { square, circle } • Frequency of events: • count(COLOR=blue, SHAPE=square) = 1 • count(COLOR=red, SHAPE=square) = 2 • count(COLOR=red, SHAPE=circle) = 3 • count(COLOR=blue, SHAPE=circle) = 2

Probability dist. over events that are combinations of random variables p(COLOR=blue, SHAPE=square) = 1 / 8 p(COLOR=red, SHAPE=square) = 2 / 8 p(COLOR=red, SHAPE=circle) = 3 / 8 p(COLOR=blue, SHAPE=circle) = 2 / 8 Sum = 8 / 8 = 1.0 Joint probability distribution

May omit name of random variableif it’s understood • Joint probability distribution p: • p( blue, square ) = 1 / 8 = .125 • p( red, square ) = 2 / 8 = .250 • p( red, circle ) = 3 / 8 = .375 • p( blue, circle ) = 2 / 8 = .250 • Sum = 8 / 8 = 1.0

6. Conditional probability • Example: • You have 4 pink puppies, 5 pink kitties, and 2 blue puppies. What is p(pink | puppy) ? • Read as “probability of pink given puppy” • In conditional probability: • the probability calculation is restricted to a subset of events in the joint distribution • that subset is determined by the values of the random variables being conditioned on

Conditional probability • Sample space for probability calculation is restricted to particular events in the joint distribution • p( SHAPE = square | COLOR = red ) = 2 / 5 • p( SHAPE = circle | COLOR = red ) = 3 / 5 • p( COLOR = blue | SHAPE = square ) = 1 / 3 • p( COLOR = red | SHAPE = square ) = 2 / 3 • p( COLOR = blue | SHAPE = circle ) = 2 / 5

Compare to unconditional probability • Unconditional probability: sample space for probability calculation is unrestricted • p( SHAPE = square) = 3/ 8 • = p( SHAPE = square | COLOR=blue or COLOR=red) = 3 / 8 • p( SHAPE = circle ) = 5 / 8 • p( COLOR = blue ) = 3 / 8 • p( COLOR = red) = 5 / 8

7. Marginal (unconditional) probability • Probability for a subset of the random variable(s), ignoring other random variable(s) • If you know only the joint distribution, you can calculate the marginal probability of a random variable • Sum over values of all other random variables:

Marginal probability: example • p(COLOR=blue) = ? • Calculate by counting blue objects: 3/8 • Calculate through marginal probability: p(COLOR=blue) = p(COLOR=blue,SHAPE=circle) + p(COLOR=blue,SHAPE=square) = 2/8 + 1/8 = 3/8

Why it’s called “marginal probability”:margins of the joint prob. table • Sum probs. in each row and column to get marginal probs p(COLOR=blue) p(COLOR=red) Total probability: p(COLOR, SHAPE) p(SHAPE=square) p(SHAPE=circle)

Calculate conditional probability through joint and marginal probability • Conditional probability is the quotient of joint and marginal probability: p(B|A) = p(A, B) p(A) • Probability of events of B, restricted to events of A • For numerator, only consider events that occur in both A and B B A A&B

8. Independence • Two random variables A and B are independent if p(A, B) = p(A) * p(B) • i.e., if the joint probability equals the product of the marginal probabilities • “Independent”: a random variable has no effect on the distribution of another random variable

Independence: example • Flip a fair coin: p(heads) = .5, p(tails) = .5 • Flip the coin twice. • Let X be the random variable for the 1st flip. • Let Y be the random variable for the 2nd flip. • The two flips don’t influence each other, so you would expect that p(X, Y) = p(X) * p(Y) • p(X=heads, Y=tails) = p(X=heads) * p(Y=tails) = .5*.5 = .25

Non-independence: example • Suppose a class has a midterm and a final, and the final is cumulative. No one drops out of the class. • Midterm: 200 pass, 130 fail • Final: 180 pass, 150 fail • Contingency table shows marginal total counts • Rate of failure increases over time

p(MIDTERM, FINAL) • This table shows values for joint probability • Divide each cell’s count by total count of 330 • Margins show marginal probabilities • Example: p(MIDTERM=fail) = 0.394

p(MIDTERM) * p(FINAL) • Suppose MIDTERM and FINAL are independent. • Then p(MIDTERM, FINAL) = p(MIDTERM) * p(FINAL) • Expected probabilities assuming independence: For each cell, p(MIDTERM=x, FINAL=y) = p(MIDTERM=x) * p(FINAL=y) Example: p(MIDTERM=fail, FINAL=pass) = p(MIDTERM=fail) * p(FINAL=pass) = .394 * .545 = .215

MIDTERM and FINAL are not independent:p(MIDTERM, FINAL) != p(MIDTERM) * p(FINAL) • Observed probability Joint prob. under independence

Calculate conditional probability through joint and marginal probability • Conditional probability is the quotient of joint and marginal probability: p(A|B) = p(B, A) p(B) • Probability of events of A, restricted to events of B • For numerator, only consider events that occur in both A and B B A A&B

Calculate conditional probability through joint and marginal probability • Conditional probability is the quotient of joint and marginal probability: p(B|A) = p(A, B) p(A) • Probability of events of B, restricted to events of A • For numerator, only consider events that occur in both A and B B A A&B

9. Conditional independence • A and B are conditionally independent given C if p(A, B | C) = p(A|C) * p(B|C) • In the subset of the data specified by C, A and B are independent • Does not necessarily mean that A and B are independent

Conditional independence: example • 3 random variables: • COLOR ∈ {red, blue} • SHAPE ∈ {circle, square} • KITTY ∈ {True, False} • COLOR and SHAPE are not independent. • For example, p(blue, circle) = 2/8 • but p(blue)*p(circle) = 4/8 * 5/8 = 20/64 = 2.5/8

10. Product rule • Conditional probability: P(B | A) = P(A, B) P(A) • Product rule: P(A) * P(B | A) = P(A, B) • Generates joint probability from an unconditional probability and a conditional probability B A A&B

Product rule, conditional probability, and independence • Product rule: P(A) * P(B | A) = P(A, B) • Suppose A and B are independent: P(A) * P(B) = P(A, B) • Then p(B | A) = p(B) • Explanation: B has a particular probability in the sample space. When restricted to the subset of events belonging to A, the proportion of events also in B does not change from the unrestricted sample space.

Conditional probability and independence • B has a particular probability in the sample space. When restricted to the subset of events belonging to A, the proportion of events in B does not change. • Example: • p(COLOR=blue) = 3/9 = 1/3 • P(COLOR=blue|SHAPE=square) = 1/3 • P(COLOR=blue|SHAPE=circle) = 1/3 • p(COLOR=red) = 6/9 = 2/3 • P(COLOR=red|SHAPE=square) = 2/3 • P(COLOR=red|SHAPE=circle) = 2/3 • Therefore p(COLOR) = p(COLOR|SHAPE)

11. Chain rule • Product rule: P(A) * P(B | A) = P(A, B) • Chain rule: generalization of the product rule to N random variables • p(X1, …, Xn) = p(X1, ..., Xn-1) * p(Xn| X1, ..., Xn-1) • Example: N = 3 • p(A, B, C) = p(A, B) * p(C | A, B) = p(A) * p(B | A) * p(C | A, B)

12. Bayes rule • Thomas Bayes 1702 - 1761

Inconsistent terminology • Bayes’ theorem • Bayes theorem • Bayes’s theorem • Bayes’ rule • Bayes rule  preferable? • Bayes’s rule • Baye’s theorem • Baye’s rule • Bayesian theorem • Bayesian rule

13. Subjective probability • Two schools of thought in the interpretation of probability • 1. Frequentist interpretation • Probability is the chance of occurrence of an event • Probability is estimated from measurements • 2. Bayesian, or subjective interpretation • Probability is one’s degree of belief about an event • Probability estimation involves both measurements, and numerical estimates of your beliefs about data

Bayesian interpretation of conditional probability as additional evidence • Unconditional probability: p(A) • Belief about an event without any additional information • Conditional probability: p(A|B) • Belief about an event has been modified by additional knowledge about value of B

Example: belief in COLOR changes when you know SHAPE • Unconditional belief of COLOR (no knowledge of value of SHAPE) • P(COLOR=blue) = P(COLOR=blue | SHAPE=circle or SHAPE=square) =.375 • P(COLOR=red) = P(COLOR=red | SHAPE=circle or SHAPE=square) = .625 • Knowledge of SHAPE changes belief in COLOR • P(COLOR=blue | SHAPE=square ) = .333 (decreases from uncondprob) • P(COLOR=red | SHAPE=square ) = .666 (increases from uncondprob)

Prior, posterior, and likelihood • Bayes rule: p(A|B) = p(B|A) * p(A) / p(B) • Prior probability: p( A ) • Belief about A, without any additional evidence • Example: p( rain ) = .2 • Posterior probability: p( A | B ) • Probabilities of events change with new evidence • Example: p ( rain | hurricane ) = .999 • Likelihood: p( B | A ) • How likely is B in the first place, given A ? • Example: p( hurricane | rain ) = .000001

Outline • Probability theory • Some probability problems • Minimal encoding of probability distributions

#1. Sample space, joint and conditional probability • You have 4 pink puppies, 5 pink kitties, and 2 blue puppies. What is p(pink | puppy) ? • I have two children. What is the probability that both are girls? • I have two children. At least one of them is a girl. What is the probability that both are girls?

LING / C SC 439/539 Statistical Natural Language Processing