Corpora and Statistical Methods

Corpora and Statistical Methods Albert Gatt

Part 2 Probability distributions

Example 1: Book publishing • Case: • publishing house considers whether to publish a new textbook on statistical NLP • considerations include: production cost, expected sales, net profits (given cost) • Problem: • to publish or not to publish? • depends on expected sales and profits • if published, how many copies? • depends on demand and cost

Example 1: Demand & cost figures • Suppose: • book costs €35, of which: • publisher gets €25 • bookstore gets €6 • author gets €4 • To make a decision, publisher needs to estimate profits as a function of the probability of selling n books, for different values of n. • profit = (€25 * n) – overall production cost

Terminology • Random variable • In this example, the expected profit from selling n books is our random variable • It takes on different values, depending on n • We use uppercase (e.g. X) to denote the random variable • Distribution • The different values of X (denoted x) form a distribution. • If each value x can be assigned a probability (the probability of making a given profit), then we can plot each value x against its likelihood.

Definitions • Random variable • A variable whose numerical value is determined by chance. Formally, a function that returns a unique numerical value determined by the outcome of an uncertain situation. • Can be discrete (our exclusive focus) or continuous • Probability distribution • For a discrete random variable X, the probability distribution p(x) gives the probabilities for each value x of X. • The probabilities p(x) of all possible values of X sum to 1. • The distribution tells us how much out of the overall probability space (the “probability mass”), each value of x takes up.

Tabulated probability distribution

Plotting the distribution

Uses of a probability distribution • Computation of: • mean: the expected value of X in the long run • based on the specific values of X, and their probability • NB: NOT interpreted as value in a sample of data, but expected (future) value based on sample. • standard deviation & variance: the extent to which actual values of X will differ from the mean • skewness: the extent to which our distribution is “balanced”, i.e. whether it’s symmetrical

In graphics… Mean: expected value in the long run SD & variance: How much actual values deviate from mean overall Skewness: Symmetry or “tail” of our distribution

Measures of expectation and variation

The expected value (mean) • The expected value of a discrete random variable X, denoted E[X] or μ, is a weighted average of the values of X • weighted, because not all values x will have the same probability • estimated by summing, for all values of X, the product of x and its probability p(x)

More on expected value • The mean or expected value tells us that, in the long run, we can expect X to have the value μ. • E.g. in our example, our book publisher can expect long-term profits of: (-150,000 * .2) + (-50,000 * .4) + (150,000 * .25) + (350,000 * .1) + (550,000 * .05) = €50,000

Variance • Mean is the expected value of X, E[X] • Variance (σ2) reflects the extent to which the actual outcomes deviate from expectation (i.e. from E[X]) • σ2 = E[(X – μ)2] = Σ(x – μ)2p(x) • i.e. the weighted sum of deviations squared • Reasons for squaring: • eliminates the distinction between +ve and –ve • makes it exponential: larger deviations are given more importance • e.g. one deviation of 10 is as large as 4 deviations of 5

Standard deviation • Variance gives overall dispersion or variation • Standard deviation (σ) is the dispersion of possible outcomes; it indicates how spread out the distribution is. • estimated as square root of variance

The book publishing example again • Recall that for our new book on stat NLP, expected profit is £50,000 • What’s the standard deviation? • need to estimate (50000-x)2 for all x • multiply by p(x) in each case • take the square root of the result • This is left as an exercise…

Skewness • The mean gives us the “centre” of a distribution. • Standard deviation gives us dispersion. • Skewness (denoted γ “gamma”) is a measure of the symmetry of the outcomes.

Skewness, continued • The formula calculates the average value of cubed deviations by the standard deviation cubed. • Why cubed? • The cube of a positive deviation is itself positive; that of a negative is itself negative. We want both, as we want to know deviations both to the left (-ve) and right (+ve) of the mean. • Like the variance estimation, this emphasises large deviations in either direction (it’s exponential). • If the outcomes are symmetrical around the mean, then +ve and –ve deviations are balanced, and skewness is 0.

Graphical display of skewness Positive skewness: tail going right Negative skewness: tail going left

Skewness and language • By Zipf’s law (next week), word frequencies do not cluster around the mean. • There are a few highly frequent words (making up a large proportion of overall word frequency) • There are many highly infrequent words (f = 1 or f = 2) • So the Zipfian distribution is highly skewed. • We will hear more on the Zipfian distribution in the next lecture.

The concept of information

What is information? • Main ingredient: • an information source, which “transmits” symbols from a finite alphabet S • every symbol is denoted si • we call a sequence of such symbols a text • assume a probability distribution s.t. every sihas probability p(si) • Example: • a dice is an information source; every throw yields a symbol from the alphabet {1,2,3,4,5,6} • 6 successive throws yield a text of 6 symbols

Quantifying information • Intuition: • the more probable a symbol is, the less information it yields • “something seen very often is not very surprising” • So information is the inverse probability of the symbol • for some b > 1. Usually we use base 2 • Another term for I(s) is surprisal

Properties of I • Non-negative • If p(s) = 1, I(s) = 0 • If 2 events s1, s2 are independent, then: • Monotonic: slight changes in probability result in slight changes in I

Aggregate measure of information • What is the information content of a text (sequence of symbols)? • this is the same as finding the average information of a random variable • the measure is called Entropy, denoted H • Define X as a random variable over the symbols in our alphabet P(s) = P(X=s) for all s in our alphabet • Estimate H(P)

Entropy • The entropy (or information) of a probability distribution is • entropy is the expected value (mean) of the surprisal • the value is interpreted as the number of “bits” of information

Entropy example • Source = an 8-sided die • Alphabet S = {1,2,3,4,5,6,7,8} • every si has p = 1/8

Interpretation of entropy • The information contained in the distribution P (the more unpredictable the outcomes, the higher the entropy) • The message length if the message was generated according to P and coded optimally

Interpretation cont/d • For the 8-sided die example, the result H(P)=3 tells us we need 3 bits on average to “transmit” the result of rolling an 8-sided die: • We can’t do it inless than 3 bits

Entropy for multiple variables • So far we have dealt with a single random variable • The joint entropy of a pair of RVs:

Conditional Entropy • Given X and Y, how much information about Y do we gain if we know X? • a version of entropy using conditional probability: H(Y|X)

Mutual information

Mutual information • Just as probability can change based on posterior knowledge, so can information. • Suppose our distribution gives us the probability P(a) of observing the symbol a. • Suppose we first observe the symbol b. • If a and b are not independent, this should alter our information state with respect to the probability of observing a. • i.e. we can compute p(a|b)

Mutual info between two symbols • The change in our information about a on observing b is: • If a and b are completely independent, I(a;b)=0.

Averaging mutual information • We want to average mutual information between all values of a random variable A and those of a random variable B. • And similarly:

Combining the two… • Thus, mutual info involves taking the joint probability and dividing by the individual probabilities. • I.e. a comparison of the likelihood of observing a, b together vs. separately.

Mutual Information: summary • Gives a measure of reduction in uncertainty about a random variable X, given knowledge of Y • quantifies how much information about X is contained in Y

Some more on I(X;Y) • In statistical NLP, we often calculate pointwise mutual information • this is the mutual information between two points on a distribution • I(x;y) rather than I(X;Y) • used for some applications in lexical acquisition

Mutual Information -- example • Suppose we’re interested in the collocational strength of two words x and y • e.g. bread and butter • mutual information quantifies the likelihood of observing x and y together (in some window) • If there is no interesting relationship, knowing about bread tells us nothing about the likelihood of encountering butter • Here, P(x,y) = P(x)P(y) and I(x;y) = 0 • This is the Church and Hanks (1991) approach. • NB. The approach uses pointwise MI

Corpora and Statistical Methods