Probability for linguists

Probability for linguists John Goldsmith

What is probability theory? • It is the quantitative theory of evidence. • It derives, historically, from the study of games of chance – gambling – and of the insurance business. • To this day, the basic examples of probability are often presented in terms that grow out of gambling: toss of a die, the draw of a card. • Alas.

Knowledge of language? • Knowledge of language is the long-term knowledge that speakers bring to bear in the analysis, and the generation, of specific utterances in particular contexts. • Long-term knowledge? • Understanding a specific utterance is a task that combines knowledge of the sounds that were uttered and long-term knowledge of the language in question.

But before we can get to that… • The basic game plan of all of probability theory is to • define a universe (“sample space” or “universe of possible outcomes”) • define a method of dividing up “probability mass” (whose total weight is 1.0) over all of the sample space. • That’s it!

Again: For this discrete case, probability theory obliges us to: • Establish the universe of outcomes that we are interested in; • Establish a means for distributing probability mass over the outcomes, in such a way that the total amount of probability distributed sums to 1.0 – no more, and no less.

The role of a model • In every interesting case, the probability assigned to the basic members of the sample space is calculated according to some model that we devise – and therein lies our contribution. • Let’s turn to some examples.

Letters • Let our sample space be the 26 letters of the Latin alphabet. • To make a probabilistic model, we need an assignment of probability to each letter a…z: prob(a) ≥ 0; • and the sum of all of these probabilities is 1:

Why? • Probabilities are our eventual contact with reality, and with observations. • Probabilities are our model’s way of making quantitative predictions. • We will (eventually) evaluate our model by seeing how good its probabilities are.

Counts, frequencies, and probabilities • Counts are integral – at least they are for now; eventually we drop that assumption. For now, they are the result of empirical measurement. • Frequency (relative or proportional) is the ratio of a count of a subset to the count of the whole universe. • Probability is a parameter in a probabilistic model.

Distribution • A distribution is a set of non-negative numbers which sum to 1.

Brief digression: some niceties • The sample space Ω need not be finite or discrete, and we actually care about a large set F of subsets of Ω. The sets in F are called events, and we require that F be a s-algebra (so F is closed under complementation, countable union and intersection). A probability measure P on (Ω, F) is a function P such that F(Ω) = 1, F(Ø) = 0, and the probability of a countable set of pairwise disjoint members of F is the sum of their probabilities:

Uniform distribution • A uniform distribution is one in which all elements in the sample space have the same probability. • If there are 26 letters in our sample space, each letter is assigned probability 1/26 by the uniform distribution.

Uniform distribution (2) • If the sample space is rolls of a die, and the distribution is uniform, then each die-roll has probability equal to 1/6. • Sample space can be complex – e.g., sequences of 3 rolls of a die. In that case, there are 63 = 216 members, and each has probability 1/216 if the distribution is uniform. • If the sample space is N rolls of a die, then there are 6N members, each with prob 1/6N.

More examples • If the sample space is all sequences of 1, 2 or 3 rolls of a die, then the size of the sample space is 6 + 62 + 63 = 258, and if the distribution is uniform, each has probability 1/258.

Words in the sample space • Suppose our sample space has 1,000 elements, and each is associated with 1 of the 1,000 most frequent words in the Brown corpus; each is also associated to the number (between 1 and 1,000) associated with the rank of that number.

Probability distribution • We could associate a uniform distribution to that sample space; or we could assign a probability to each word based on its frequency in a corpus. • Let N = total number of words in the corpus (1,160,322 in Brown corpus). • frequency of wordi =

E.g., frequency (“the”) = 69,681 / 1,160,322 = .0600… Is it clear that the sum of all these frequencies is necessarily 1.0?

Sequences When we think about sequences, we see that the probability assigned to each element (=each sequence) is assigned through a model, not directly. In the simplest case, the probability of a sequence is the product of the probabilities of the individual elements.

Random variable (r.v.) • A random variable is a function from our probability space (the sample space, endowed with a probability distribution) to the real numbers. • Example: Our sample space has 1,000 elements, and each is associated with 1 of the 1,000 most frequent words in the Brown corpus; our r.v. R maps each word to the number (between 1 and 1,000) associated with the rank of that number.

Sample space: words • Suppose we replace our usual die with one with 1,000 faces, and each face has a number (1-1,000) and one of the 1,000 most frequent words from the Brown corpus on it (1=the, 2=of, etc.). Now we choose numbers at random: for example, 320 990 646 94 756 • which translates into: whether designed passed must southern. • For the moment, assume a uniform distribution…

We’re more interested in the probability our model assigns to sentences that already exist. For example, (1) In the beginning was the word Since these words are in the Brown corpus, each has probability 1/1000, and the probability of (1) is

Any sequence of 6 words in the top 1,000 words from the Brown Corpus will have this probability. Is this good, or bad? • We could increase our vocabulary to the entire 47,885 words of the Brown Corpus. Then sentence (1) would have probability

Or: use frequency as our probability • pr (the) = 0.0600, as above • pr (leaders) = (ranked 1,000 in corpus) Be clear on the difference between frequencies and probabilities.

What is the probability of the sentence“The woman arrived” ? • What is the sample space? Strings of 3 words from the Brown Corpus. • Probability of each word is its frequency in the Brown Corpus. pr ("the“) = 0.060 080; pr (“woman”) = 0.000 212 030; pr (“arrived”) = .000 053 482

Some notation • S = “the woman arrived” • S[1] = “the”, S[2] = “woman”, S[3] = “arrived” • pr (S) = pr (S[1] = “the” & S[2] = “woman” & S[3] = “arrived”)

Stationary model • For all sentences S, all words w and all positions i and j:prob ( S[i] = wn ) = prob ( S[j] = wn ). • Is this true?

Back to the sentence • Pr (“the woman arrived”) = 6.008 020 * 10-2 2.122 030 * 10-5 * 5.348 207 * 10-5 = 6.818 x 10-11

“in the beginning was the word” • We calculated that this sentence’s probability was 1 in 1018 in the sample space of all sentences of exactly 6 words, selected from the 1,000 most frequent words in the Brown Corpus. • 8.6 * 10-29 in the sample space of all six-word sentences built from the Brown vocabulary. • And in frequency-based distribution: • 1.84 x 10-14

We prefer a model that scores better (by assigning a higher probability) to sentences that actually and already exist – • we prefer that model to any other model that assigns a lower probability to the actual corpus.

Some win, some lose • In order for a model to assign higher probability to actual and existing sentences, it must assign less probability mass to other sentences (since the total amount of probability mass that it has at its disposal to assign totals up to 1.000, and no more). So of course it assigns lower probability to a lot of unobserved strings.

Word order? • Alas, word order counts for nothing, so far.

Probability mass • It is sometimes helpful to think of a distribution as a way of sharing an abstract goo called probability mass around all of the members of the universe of basic outcomes (also known as the sample space). Think of there being 1 kilogram of goo, and it is cut up and assigned to the various members of the universe. None can have more than 1.0 kg, and none can have a negative amount, and the total amount must add up to 1.0 kg. And we can modify the model by moving probability mass from one outcome to another if we so choose.

Conditional probability • Sometimes we want to shift the universe of discussion to a more restricted sub-universe – this is always a case of having additional information, or at least of acting as if we had additional information. • Universe = sample space.

We look at our universe of outcomes, with its probability mass spread out over the set of outcomes, and we say, let us consider only a sub-universe, and ignore all possibilities outside of that sub-universe. We then must ask: how do we have to change the probabilities inside that sub-universe so as to ensure that the probabilities inside it add up to 1.0 (to make it a distribution)? • Answer: Divide probabilities by total amount of probability mass in the sub-universe.

How do we get the new information? • Peek and see the color of a card. • Maybe we know the word will be a noun. • Maybe we know what the last word was. Generally: we have to pick an outcome, and we have some case-particular information that bears on the choice.

Cards: pr (Queen of Hearts|red) • Pr (Queen Hearts) = 1/52; • Pr(red card)= ½; • Pr (Queen Hearts, given Red) =

Definition of probability of A, given B:

Guessing a word, given knowledge of previous word: • pk( S[i] = wj given that S[i-1] = wk ), which is usually written in this way: • pk( S[i] = wj | S[i-1] = wk ) • Pr (“the”) is high, but not if the preceding word is “a”.

Almost nothing about language or about English in particular has crept in. The fact that we have considered conditioning our probabilities of a word based on what word preceded is entirely arbitrary; we could just as well look at the conditional probability of words conditioned on what word follows, or even conditioned on what the word was two words to the left.

One of the most striking things is how few nouns, and how many adjectives, there are among the most frequent words here -- that's probably not what you would have guessed. None of them are very high in frequency; none place as high as 1 percent of the total.

Among the words after "of", one word is over 25%: "the". So not all words are equally helpful in helping to guess what the next word is.

Note the prepositions.

If you know a word is "the", then the probability that the word-after-next is "of" is greater than 15% -- which is quite a bit.

Exercise: What do you think the probability distribution is for the 10th word after "the"? What are the two most likely words? Why?

Probability for linguists

Probability for linguists

Presentation Transcript

Experimental Design for Linguists

Computational Tools for Linguists

Programming for Linguists

Programming for Linguists

What Linguists do….

Programming for Linguists

Programming for Linguists

Programming for Linguists

Programming for Linguists

Programming for Linguists

CLARIN for Linguists Search Illustration 1

Programming for Linguists

CLARIN for Linguists Introduction

Programming for Linguists

LING 408/508: Programming for Linguists

LING 408/508: Programming for Linguists

LING 408/508: Programming for Linguists

LING 408/508: Programming for Linguists

Programming for Linguists

LING 408/508: Programming for Linguists

LING 408/508: Programming for Linguists

What Linguists Want