COMP 791A: Statistical Language Processing

COMP 791A: Statistical Language Processing Mathematical Essentials Chap. 2

Motivations • Statistical NLP aims to do statistical inference for the field of NL • Statistical inference consists of: • taking some data (generated in accordance with some unknownprobability distribution) • then making some inference about this distribution • Ex. of statistical inference: language modeling • how to predict the next word given the previous words • to do this, we need a model of the language • probability theory helps us finding such model

Notions of Probability Theory • Probability theory • deals with predicting how likely it is that something will happen • Experiment (or trial) • the process by which an observation is made • Ex. tossing a coin twice

Sample Spaces and events • Sample space Ω : • set of all possible basic outcomes of an experiment • Coin toss: Ω = {head, tail} • Tossing a coin twice: Ω = {HH, HT, TH, TT} • Uttering a word: |Ω| = vocabulary size • Every observation (element in Ω) is a basic outcome or sample point • An event A is a set of basic outcomes with A  Ω • Ω is then the certain event • Ø is the impossible (or null) event • Example - rolling a die: • Sample space Ω = {1, 2, 3, 4, 5, 6} • Event A that an even number occurs A = {2, 4, 6}

Events and Probability • The probability of an event A is denoted p(A) • also called the prior probability • i.e. the probability before we consider any additional knowledge • Example: experiment of tossing a coin 3 times • Ω = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} • events with two or more tails: • A = {HTT, THT, TTH, TTT} • P(A) = |A|/|Ω| = ½ (assuming uniform distribution) • events with all heads: • A = {HHH} • P(A) = |A|/|Ω| = ⅛

Probability Properties • A probability function P (or probability distribution): • Distributes a probability mass of 1 over the sample space Ω • [0,1] • P(Ω) = 1 • For disjoint events Ai (ie : AiAj = Ø for all i ≠j) • P( Ai) = Σ P(Ai) • Immediate consequences: • P(Ø ) = 0 • P(Ā) = 1 - P(A) • AB ==> P(A) ≤ P(B) • ΣaєΩ P(a) = 1

Joint probability • Joint probability of A and B: • P(A,B) = P(AB) Ω A B AB

Conditional probability • Prior (or unconditional) probability • Probability of an event before any evidence is obtained • P(A) = 0.1 P(rain today) = 0.1 • i.e. Your belief about A given that you have no evidence • Posterior (or conditional) probability • Probability of an event given that all we know is B (some evidence) • P(A|B) = 0.8 P(rain today| cloudy) = 0.8 • i.e. Your belief about A given that all you know is B

Conditional probability (con’t) Ω A B AB

Chain rule • With 3 events, the probability that A, B and C occur is: • The probability that Aoccurs • Times, the probability that Boccurs, assuming that Aoccurred • Times, the probability that Coccurs, assuming that A and B have occurred • With multiple events, we can generalize to the Chain rule: P(A1, A2, A3, A4, ..., An) = P (Ai) = P(A1) × P(A2|A1) × P(A3|A1,A2) × ... × P(An|A1,A2,A3,…,An-1) • (important to NLP)

Bayes’ theorem

So? • we typically want to know: P(Cause | Effect) ex: P(Disease | Symptoms) ex: P(linguistic phenomenon | linguistic observations) • But this information is hard to gather • However P(Effect | Cause) is easier to gather (from training data) • So

Example • Rare syntactic construction occurs in 1/100,000 sentences • A system identifies sentences with such a construction, but it is not perfect • If sentence has the construction --> system identifies it 95% of the time • If sentence does not have the construction --> system says it does 0.5% of the time • Question: • if the system says that sentence S has the construction… what is the probability that it is right?

Example (con’t) • What is P(sentence has the construction | the system says yes) ? • Let: • cons = sentence has the construction • yes = system says yes • not_cons = sentence does not have the construction • we have: • P(cons) = 1/100,000 = 0.00001 • P(yes | cons) = 95% = 0.95 • P(yes | not_cons) = 0.5% = 0.005 • P(yes) = ? • P(B) = P(B|A) P(A) + P(B|Ā) P(Ā) • P(yes) = P(yes | cons) × P(cons) + P(yes | not_cons) × P(not_cons) = 0.95 × 0.00001 + 0.005 × 0.9999

Example (con’t) • So: • So in only 1 sentence out of 500 that the system says yes, it is actually right!!!

Statistical Independence vs. Statistical Dependence • How likely are we to have Head in a coin toss, given that it is raining today? • A: having a head in a coin toss • B: raining today • Some variables are independent… • How likely is the word “ambulance” to appear, given that we’ve seen “car accident”? • Words in text are not independent

Independent events • Two events A and B are independent: • if the occurrence of one of them does not influence the occurrence of the other • i.e. A is independent of B if P(A) = P(A|B) • If A and B are independent, then: • P(A,B) = P(A|B) x P(B) (by chain rule) = P(A) x P(B) (by independence) • In NLP, we often assume independence of variables

Bayes’ Theorem revisited (a golden rule in statistical NLP) • If we are interested in which event B is most likely to occur given an observation A • we can chose the B with the largest P(B|A) • P(A) • is a normalization constant (to ensure 0…1) • is the same for all possible Bs (and is hard to gather anyways) • so we can drop it • So Bayesian reasoning: • In NLP:

Application of Bayesian Reasoning • Diagnostic systems: • P(Disease | Symptoms) • Categorization: • P(Category of object| Features of object) • Text classification: P(sports-news | words in text) • Character recognition: P(character | bitmap) • Speech recognition: P(words | signals) • Image processing: P(face-person | image) • …

Random Variables • A random variable X is a function • X: Ω --> Rn (typically n= 1) • Example – tossing 2 dice • Ω = {(1,1), (1,2), (1,3), … (6,6)} • X : Ω --> Rx assigns to each point in Ω, the sum of the 2 dice • X(1,1) = 2 X(1,2) = 3, … X(6,6) = 12 • Rx= {2,3,4,5,6,7,8,9,10,11,12} • A random variable X is discrete if: • X: Ω --> Swhere S is a countable subset of R • In particular, if X: Ω --> {0,1} • then X is called a Bernoulli trial. • A random variable X is continuous if: • X: Ω --> S where S is a continuum of numbers

Probability distribution of an RV • Let X be a finite random variable • Rx= {x1, x2, x3,… xn} • A probability mass function f gives the probability of X at different in points in Rx • f(xk) = P(X=xk) = p(xk) • p(xk) ≥ 0 • Σk p(xk) = 1

Example: Tossing 2 dice • X = sum of the faces • X: Ω --> S • Ω = {(1,1), (1,2), (1,3), …, (6,6)} • S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} • X = maximum of the faces • X: Ω --> S • Ω = {(1,1), (1,2), (1,3), …, (6,6)} • S = {1, 2, 3, 4, 5, 6}

Expectation • The expectation (μ) is the mean (or average or expected value) of a random variable X • Intuitively, it is: • the weighted average of the outcomes • where each outcome is weighted by its probability • ex: the average sum of the dice • If X and Y are 2 random variables on the same sample space, then: • E(X+Y) = E(X) + E(Y)

Example • The expectation of the sum of the faces on two dice? (the average sum of the dice) • If equiprobable… (2+3+4+5+…+12)/11 • But not, equiprobable • Or more simply: • E(SUM)=E(Die1+Die2)=E(Die1)+E(Die2) • Each face on 1 die is equiprobable E(Die) = (1+2+3+4+5+6)/6 = 3.5 E(SUM) = 3.5 + 3.5 = 7

Variance and standard deviation • The variance of a random variable X is a measure of whether the values of the RV tend to be consistent over trials or to vary a lot • The standard deviation of X is the square root of the variance • Both measure the weighted “spread” of the values xi around the mean E(X)

Example • What is the variance of the sum of the faces on two dice?

Back to NLP • What is the probability that someone says the sentence:“Mary is reading a book.” • In general, for language events, the probability function P is unknown • We need to estimate P (or a model M of the language) by looking at a sample of data (training set) • 2 approaches: • Frequentist statistics • Bayesian statistics (we will not see)

Frequentist Statistics • To estimate P, we use the relative frequency of the outcome in a sample of data • i.e. the proportion of times a certain outcome o occurs. • Where C(o)is the number of times o occurs in N trials • For N--> ∞ the relative frequency stabilizes to some number: the estimate of the probability function • Two approaches to estimate the probability function: • Parametric (assuming a known distribution) • Non-parametric (distribution free)… we will not see

Parametric Methods • Assume that some phenomenon in language is modeled by a well-known family of distributions (ex. binomial, normal) • The advantages: • we have an explicit probabilistic model of the process by which the data was generated • determining a particular probability distribution within the family requires only the specification of a few parameters (so, less training data) • But: • Our assumption on the probability distribution may be wrong…

Non-Parametric Methods • No assumption is made about the underlying distribution of the data • For ex, we can simply estimate P empirically by counting a large number of random events • But: because we use less prior information (no assumption on the distribution), more training data is needed

Standard Distributions • Many applications give rise to the same basic form of a probability distribution - but with different parameters. • Discrete Distributions: • the binomial distribution (2 outcomes) • the multinomial distribution (more than 2 outcomes) • … • Continuous Distributions: • the normal distribution (Gaussian) • …

Binomial Distribution (discrete) • Also known as Bernoulli distribution • Each trial has only two outcomes (success or failure) • The probability of success is the same for each trial • The trials are independent • There are a fixed number of trials • Distribution has 2 parameters: • nb of trials n • probability of success p in 1 trial • Ex: Flipping a coin 10 times and counting the number of heads that occur • Can only get a head or a tail (2 outcomes) • For each flip there is the same chance of getting a head (same prob.) • The coin flips do not effect each other (independence) • There are 10 coin flips (n = 10)

Examples b(n,p) = B(10, 0.1) Nb trials = 10 Prob(head) = 0.1 b(n,p) = B(10, 0.7) Nb trials = 10 Prob(head) = 0.7

Binomial probability function • let: • n = nb of trials • p = probability of success in any trial • r = nb of successes out of the n trials The number of ways of having r successes in n trials. The probability of having n-r failures. The probability of having r successes

* * * Example • What is the probability of rolling higher than 4 in 2 rolls of 3 dice rolls? n trials =3 p probability of success in 1 trial = r successes = 2

Properties of binomial distribution • B(n,p) • Mean E(X) = μ = np • Ex: • Flipping a coin 10 times • E(head) = 10 x ½ = 5 • Variance σ2= np(1-p) • Ex: • Flipping a coin 10 times • σ2 = 10 x ½ ( ½ ) = 2.5

Binomial distribution in NLP • Works well for tossing a coin • But, in NLP we do not always have complete independence from one trial to the next • Consecutive sentences are not independent • Consecutive POS tags are not independent • So, binomial distribution in NLP is an approximation (but a fair one) • When we count how many times something is present or absent • And we ignore the possibility of dependencies between one trial and the next • Then, we implicitly use the binomial distribution • Ex: • Count how many sentences contain the word “the” • Assume each sentence is independent • Count how many times a verb is used as transitive • Assume each occurrence of the verb is independent of the others…

Normal Distribution (continuous) • Also known as Gaussian distribution (or Bell curve) • to model a random variable X on an infinite sample space • (ex. height, length…) • X is a continuous random variable if there is a function f(x) defined on the real line R = (-∞, +∞) such that: • f is non-negative f(x) ≥ 0 • The area under the curve of f is one • The probability that X lies in the interval [a,b] is equal to the area under f between x=a and x=b

Normal Distribution (con’t) • has 2 parameters: • mean μ • standard deviation σ n(μ,σ)= n(0,1) μ=0; σ= 1 n(μ,σ)=n(1.5,2) μ=1.5; σ=2

The standard normal distribution • if μ=0 and σ=1, then called standard normal distribution Z

Frequentist vs Bayesian Statistics • Assume we toss a coin 10 times, and get 8 heads: • Frequentists will conclude (from the observations) that a head comes 8/10 -- Maximum Likelihood Estimate (MLE) • if we look at the coin, we would be reluctant to accept 8/10… because we have prior beliefs • Bayesian statisticians will use an a-priori probability distribution (their belief) • will update the beliefs when new evidence comes in (a sequence of observations) • by calculating the Maximum A Posteriori (MAP)distribution. • The MAP probability becomes the new prior probability and the process repeats on each new observation

Essential Information Theory • Developed by Shannon in the 40s • To maximize the amount of information that can be transmitted over an imperfect communication channel (the noisy channel) • Notion of entropy (informational content): • How informative is a piece of information? • ex. How informative is the answer to a question • If you already have a good guess about the answer, the actual answer is less informative… low entropy

Entropy - intuition • Ex: Betting 1$ to the flip of a coin • If the coin is fair: • Expected gain is ½ (+1) + ½ (-1) = 0$ • So you’d be willing to pay up to 1$ for advanced information (1$ - 0$ average win) • If the coin is rigged • P(head) = 0.99 • P(tail) = 0.01 • assuming you bet on head (!) • Expected gain is 0.99(+1) + 0.01(-1) = 0.98$ • So you’d be willing to pay up to 2¢ for advanced information (1$ - 0.98$ average win) • Entropy of fair coin is 1$ > entropy of rigged coin 0.02$

Entropy • Let X be a discrete RV • Entropy (or self-information) • measures the amount of information in a RV • average uncertainty of a RV • the average length of the message needed to transmit an outcome xi of that variable • the size of the search space consisting of the possible values of a RV and its associated probabilities • measured in bits • Properties: • H(X) ≥ 0 • If H(X) = 0 then it provides no new information

Example: The coin flip • Fair coin: • Rigged coin: Entropy P(head)

Example: Simplified Polynesian • In simplified Polynesian, we have 6 letters with frequencies: • The per-letter entropy is • We can design a code that on average takes 2.5bits to transmit a letter • Can be viewed as the average nb of yes/no questions you need to ask to identify the outcome (ex: is it a ‘t’? Is it a ‘p’?)

Entropy in NLP • Entropy is a measure of uncertainty • The more we know about something the lower its entropy • So if a language model captures more of the structure of the language, then its entropy should be lower • in NLP, language models are compared by using their entropy. • ex: given 2 grammars and a corpus, we use entropy to determine which grammar beter matches the corpus.

COMP 791A: Statistical Language Processing