350 likes | 536 Views
Unsupervised Learning of Natural Language Morphology using MDL. John Goldsmith November 9, 2001. Unsupervised learning. Input: untagged text in orthographic or phonetic form with spaces (or punctuation) separating words. But no tagging or text preparation. Output.
E N D
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001
Unsupervised learning • Input: untagged text in orthographic or phonetic form • with spaces (or punctuation) separating words. • But no tagging or text preparation.
Output • List of stems, suffixes, and prefixes • List of signatures. • A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. • Hence, a stem in a corpus has a unique signature. • A signature has a unique set of stems associated with it • …
(example of signature in English) • NULL.ed.ing.s ask call point = ask asked asking asks call called calling calls point pointed pointing points
Minimum Description Length (MDL) • Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989) • Work by Michael Brent and Carl de Marcken on word-discovery using MDL
Essence of MDL We are given • a corpus, and • a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. (“Given”? Given by who? We’ll get back to that.) (Remember: a distribution is a set of non-negative numbers summing to 1.0.)
The higher the probability is that the morphology assigns to the (observed) corpus, the better that morphology is as a model of that data. • Better said: -1 * log probability (corpus) is a measure of how well the morphology models the data: the smaller that number is, the better the morphology models the data. This is known as the optimal compressed length of the data, given the model. Using base 2 logs, this number is a measure in information theoretic bits.
Essence of MDL… • The goodness of the morphology is also measured by how compact the morphology is. • We can measure the compactness of a morphology in information theoretic bits.
How can we measure the compactness of a morphology? • Let’s consider a naïve version of description length: count the number of letters. • This naïve version is nonetheless helpful in seeing the intuition involved.
Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing
Essence of MDL… The best overall theory of a corpus is the one for which the sum of • log prob (corpus) + • length of the morphology (that’s the description length) is the smallest.
Overall logic • Search through morphology space for the morphology which provides the smallest description length.
Corpus Pick a large corpus from a language -- 5,000 to 1,000,000 words.
Corpus Feed it into the “bootstrapping” heuristic... Bootstrap heuristic
Corpus Bootstrap heuristic Out of which comes a preliminary morphology, which need not be superb. Morphology
Corpus Bootstrap heuristic Feed it to the incremental heuristics... Morphology incremental heuristics
Corpus Out comes a modified morphology. Bootstrap heuristic Morphology modified morphology incremental heuristics
Corpus Is the modification an improvement? Ask MDL! Bootstrap heuristic Morphology modified morphology incremental heuristics
Corpus If it is an improvement, replace the morphology... Bootstrap heuristic modified morphology Morphology Garbage
Corpus Send it back to the incremental heuristics again... Bootstrap heuristic modified morphology incremental heuristics
Continue until there are no improvements to try. Morphology modified morphology incremental heuristics
1. Bootstrap heuristic • A function that takes words as inputs and gives an initial hypothesis regarding what are stems and what are affixes. • In theory, the search space is enormous: each word w of length |w| has at least |w| analyses, so search space has at least members.
Better bootstrap heuristics Heuristic, not perfection! Several good heuristics. Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut;
Successor frequency n g o v e r Empirically, only one letter follows “gover”: “n”
Successor frequency e i m g o v e r n o s # Empirically, 6 letters follows “govern”: “m”
Successor frequency g o v e r n m e Empirically, 1 letter follows “governm”: “e” g o v e r 1 n 6 m 1 e peak of successor frequency
Lots of errors… 9 18 11 6 4 1 2 1 1 2 1 1 c o n s e r v a t i v e s wrong right wrong
Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.
2. Incremental heuristics Course-grained to fine-grained • 1. Stems and suffixes to split: • Accept any analysis of a word if it consists of a known stem and a known suffix. • 2. Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. • Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis. We’ll return to this in a moment.
Incremental heuristic • 3.Slide stem-suffix boundary to the left: Again, use MDL to decide. How do we use MDL to decide?
Using MDL to judge a potential stem act, acted, action, acts, acting. We have the suffixes NULL, ed, ion, ing, and s, but no signature NULL.ed.ion.ing.s Let’s compute cost versus savings of signature NULL.ed.ion.ing.s Savings: Stem savings: 4 copies of the stem act: that’s 3 x 4 = 12 letters = almost 60 bits.
Cost of NULL.ed.ing.s • A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is
Cost of signature: about 45 bits • Savings: about 60 bits so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.
Repair heuristics: using MDL We could compute the entire MDL in one state of the morphology; make a change; compute the whole MDL in the proposed (modified) state; and compared the two lengths. Original morphology + Compressed data Revised morphology+ compressed data < >