1.36k likes | 1.53k Views
Unsupervised learning of natural language morphology. John Goldsmith University of Chicago May 2005. The goal is to learn about language structure . Controversy: how much structure is innate, hence need not be learned?
E N D
Unsupervised learning of natural language morphology John Goldsmith University of Chicago May 2005
The goal is to learn about language structure. Controversy: how much structure is innate, hence need not be learned? Avoid that concern by looking at problems for which the innatist hypothesis is implausible: segmentation. How does evidence lead the learner to solutions of the segmentation problem?
Working with Yu Hu and Irina Matveeva (Department of Computer Science) and Colin Sprague (Department of Linguistics)
1. Word breaking: de Marcken. 2. Learning morphology: learning signatures. … Inadequacy of this approach for many rich languages. 3. SED heuristic: learning templates on the way to learning an FST.
How I got started on this… • Problem of word-breaking, tackled by Carl de Marcken. Unsupervised language acquisitionMIT thesis (CS), 1996
Broad outline • Goal: take a large unbroken corpus (no indication of where word boundaries are), find the best analysis of the corpus into words. • “Best”? Interpret the goal in the context of MDL (Minimum Description Length) theory
Description length For each model, calculate: • Length of the model, in bits; it must be one that assigns probability to relevant data (this approximation assumes that the shortest one can ignore the role of the longer implementations of the same idea) • Log probability of the data under that model (p[ositive] log) – in bits. Description length of the data under the model is the sum of these two. Find the model whose description length is a minimum, given some data set.
If we assume that the relevant chunks of the corpus are all of the letters, the model will be small, but the log prob of the data will be way large. If we assume that the relevant chunk of the corpus is the whole corpus, the model will be huge, and the log prob will be 0. We need to find a happy medium. • Conjecture: the happy medium settles on human words. [Any guesses where the problems will be?]
Is it clear why having a chunk is so often better than not having one? blah-blah-blah-t-h-blah-t-h-blah-t-h-blah Log prob: X + 3 log prob t + 3 log prob h blah-blah-blah-th-blah-th-blah-th-blah Difference in log prob: Add 3 log prob th Subtract 3 log prob t + 3 log prob h The log prob of remaining t’s and h’s is greater, (numerator effect), but the log prob of all other chunks is less (denominator effect). Denominator went from Z to Z-3.
Difference in log prob: Add 3 log prob th Subtract 3 log prob t + 3 log prob h The log prob of remaining t’s and h’s is greater, (numerator effect), but the log prob of all other chunks is less (denominator effect). Denominator went from Z to Z-3. where Z is the total number of chunks in the given analysis of the corpus (remember that this is a hidden, not an observed, variable).
de Marcken’s approach A lot depends on exactly what you assume the lexicon to look like. de Marcken’s approach: There is a Corpus, and there is a Lexicon.
We encode the corpus, so the encoded length is the sum of the lengths of the encodings of the (hypothesized) chunks. • There is (thus) assumed to be no syntax: sentences are just sequences of independent words. (Sorry.) • The lexicon provides a set of lengths for each word with the property that they are (roughly) the p-logs of a distribution. • Thus the encoded length of the corpus is approximately its p-log probability, under the model (=lexicon).
Lexicon How large is the lexicon? • Encoding enters into most lexical items too. An item’s length is the sum of the lengths of the items that compose it.
deM’s: |u| + S |wi| • |u| = S p-log (wi), summing over tokens in the corpus • |wi| is 0 for the atomic letters; and • |wi| = S|m| summing over the chunks m that compose w, where |.| is the p-log of that chunk.
A quick look at implementation • de Marcken begins with a lexicon consisting of all of the atomic symbols of the corpus (e.g., the 27 letters, or 53 letters, or whatever), and • builds up from there.
Iterate several times • Construct tentative new entries for lexicon with tentative counts; From counts, calculate rough probabilities. • EM (Expectation/Maximization): iterate 5 times: • Expectation: find all possible occurrences of each lexical entry in the corpus; assign relative weights to each occurrence found, based on its probability; use this to assign (non-integral!) counts of words in the corpus. • Maximization: convert counts into probabilities. • Test each lexical entry to see whether description length is better without it in the lexicon. If true, remove it. • Find best parse (Viterbi-parse), the one with highest probability.
T H E R E N T I S D U E Lexicon: D E H I N R S T U T 2 E 3 All others 1 Total count: 12
Step 0 • Initialize the lexicon with all of the symbols in the lexicon (the alphabet, the set of phonemes, whatever it is). • Each symbol has a probability, which is simply its frequency. • There are no (non-trivial) chunks in the lexicon.
Step 1 • 1.1 Create tentative members • TH HE ER RE EN NT TI IS SD DU UE • Give each of these a count of 1. • Now the total count of “words” in the corpus is 12 + 11= 23. • Calculate new probabilities: pr(E) = 3/23; pr(TH) = 1/23. • Prob’s of the lexicon form a distribution.
Expectation/Maximization (EM)iterative: • This is a widely used algorithm to do something important and almost miraculous: to find the best values for hidden parameters. • Expectation: Find all occurrences of each lexical item in the corpus, and compute Maximum Likelihood parameters. Use the Forward/Backward algorithm.
Forward algorithm • Find all ways of parsing the corpus from the beginning to each point, and associate with each point the sum of the probabilities for all of those ways. We don’t know which is the right one, really.
Forward Start at position 1, after T: THERENTISDUE The only way to get there and put a word break there (“T HERENTISDUE”) utilizes the word(?) “T”, whose probability is 2/23. Forward(1) = 2/23. Now, after position 2, after TH: There are 2 ways to get this: T H ERENTISDUE (a) or TH ERENETISDUE (b) • has probability 2/23 * 1/23 = 2/529 =.003781 • Has prob 1/23 = 0.0435
There are 2 ways to get this: T H ERENTISDUE (a) or TH ERENETISDUE (b) • has probability 2/23 * 1/23 = 2/529 =.003781 • Has prob 1/23 = 0.0435 So the Forward probability after letter 2 (after “TH”) is 0.0472. After letter 3 (after “THE”), we have to consider the possibilities: (1)T-HE and (2)TH-E and (3)T-H-E
(1)T-HE (2)TH-E (3)T-H-E • We calculate this prob as Prob of a break after (1) = “T” = 2/23 = .0869 * prob (HE) ( which is 1/23 = 0.0434) = 0.00378 • We combine cases (2) and (3), giving us for both, together: Prob of a break after position 2 (the H), already calculated as 0.0472 * prob of (E) = 0.0472 * 0.13 =0.00616.
Forward T H E P1a P1b P2 Value of Forward here is the sum of the probabilities going by the two paths, P1 and P2
Forward • T H E P2a P2b P3 Value of Forward here is the sum of the probabilities going by the two paths, P2 and P3 You only need to go back (from where you are) the length of the longest lexical entry (which is now 2).
Conceptually • We are computing for each break (between letters) what the probability is that there is a break there, by considering all possible chunkings of the (prefix) string, the string up to that point from the left. • This is the Forward probability of that break.
Backward • We do exactly the same thing from right to left, giving us a backward probability: • …. D U E
Now the tricky step: • T H E R E N T I S D U E • Note that we know the probability of the entire string (it’s Forward(12), which is the sum of the probabilities of all the ways of chunking the string)=Pr(string) • What is the probability that -R- is a word, given the string?
T H E R E N T I S D U E • That is, we’re wondering whether the R here is a chunk, or part of the chunk ER, or part of the chunk RE. It can’t be all three, but we’re not in a position (yet) to decide which it is. How do we count it? • We take the count of 1, and divide it up among the three options in proportion to their probabilities.
T H E R E N T I S D U E • Probability that R is a word can be found in this expression: (a) This is the fractional count that goes to R.
Do this for all members of the lexicon • Compute Forward and Backward just once for the whole corpus, or for each sentence or subutterance if you have that information. • Compute the counts of all lexical items that conceivably could occur (in each sentence, etc.). • End of Expectation. • Maximization: use these “soft counts” to recalculate the parameters.
So how good is this? • First: 3749 sentences, 400,000 characters. • TheFultonCountyGrandJurysaidFridayaninvestigationofAtlanta'srecentprimaryelectionproducednoevidencethatanyirregularitiestookplace. • Thejuryfurthersaidinterm-endpresentmentsthattheCityExecutiveCommittee,whichhadover-allchargeoftheelection,deservesthepraiseandthanksoftheCityofAtlantaforthemannerinwhichtheelectionwasconducted (etc)
The Fulton County Grand Ju ry s aid Friday an investi gation of At l anta 's recent prim ary e lection produc ed no e videnc e that any ir regul ar it i es took place . • Thejury further s aid in term - end present ment s thatthe City Ex ecutive Commit t e e ,which had over - all charg e ofthe e lection , d e serv e s the pra is e and than k softhe City of At l anta forthe man ner in whichthe e lection was conduc ted.
Smaller corpus 400K characters: The Fulton County Grand Ju ry s aid Friday an investi gation of At l anta 's recent prim ary e lection produc ed no e videnc e that any ir regul ar it i es took place . Bigger corpus: 4,900K characters (51,000 lines): The Fulton County Grand Ju ry s aid Frid ay an investi g ationof At l ant a 's recent primar y e le ction produc ed no e vidence thatany ir regula r iti es tookplac e . Doesn’t get better with more data:
Errors • Chunks too big: ofthe Thejury • Chunks too small: • Ju ry s aid Friday • Commit t e e
Like the man with a hammer: • This system sees chunks wherever there is structure, because that’s the only tool it has to deal with “departures from uniform distribution of data” (Zellig Harris). • But language has simultaneously structure at many levels, including morphology and syntax.
This is the end of the beginning. • This model knows nothing of word structure, and knows nothing of grammatical (syntactic) structure. No improvements without that. • What does word structure = morphology look like from this perspective of automatic learning? • Will it be possible to formulate an objective function in such a way that the optimatization of that function solves a scientific (linguistic) problem? • Can linguistics be conceived of as an optimization problem?
Part 2: Morphology Signatures
What is structure? • In syntax: there is problem more than one kind of structure; S-CF-PSGs represent part of the answer. • In morphology… • In phonology: phonotactics – what segments prefer to be adjacent to what segments…and longer distance effects too.
Essence of MDL We are given • a corpus, and • a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes.
How can we measure the compactness of a morphology? • Let’s consider a naïve version of description length: count the number of letters. • This naïve version is nonetheless helpful in seeing the intuition involved.
Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing
Essence of MDL… The best overall theory of a corpus is the one for which the sum of • log prob (corpus) + • length of the morphology (that’s the description length) is the smallest.