Learning linguistic structure

Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003

A large part of the field of computational linguistics has moved during the 1990s from • developing grammars, speech recognition engines, etc., that simply work, to • developing systems that learn language-specific parameters from large amounts of data.

Credo… • The application of statistically-driven methods of data analysis, when applied to natural language data, will produce results which shed light on linguistic structure.

Unsupervised learning Input: large texts in a natural language, with no prior knowledge of the language.

A bit more about the goal • What’s the input? • “Data” – which comes to the learner, in acoustic form, unsegmented: • Sentences not broken up into words • Words not broken up into their components (morphemes). • Words not assigned to lexical categories (noun, verb, article, etc.) With a meaning representation?

Idealization of the language-learning scheme • Segment the soundstream into words; the words form the lexicon of the language. • Discover internal structure of words; this is the morphology of the language. • Infer a set of lexical categories for words; each word is assigned to (at least) one lexical category. • Infer a set of phrase-structure rules for the language.

Idealization? • While these tasks are individually coherent, we make no assumption that any one must be completed before another can be begun.

Today’s task • To develop an algorithm capable of learning the morphology of a language, given knowledge of the words of the language, and of a large sample of utterances.

Goals Given a corpus, learn: The set of word-roots, prefixes, and suffixes, and principles of combinations; Principles of automatic alternations (e.g., e drops before the suffixes –ing,–ity and –ed, but not before –s) Some suffixes have one grammatical function (-ness) while others have more (e.g., -s: song-s versus sing-s).

Why? Practical applications: • Automatic stemming for multilingual information retrieval • A corpus broken into morphemes is far superior to a corpus broken into words for statistically-driven machine translation • Develop morphologies for speech recognition automatically

Theoretically There is a strong bias currently in linguistics to underestimate the difficulty of language learning – For example, to identify language learning with the selection of a phrase-structure grammar, or with the independent setting of a small number of parameters.

Morphology • The learning of morphology is a very difficult task, in the sense that every word W of length |W| can potentially be divided into 1, 2, …, L morphemes mi, constrained only by S|mi| = |W| – and that’s ignoring labeling (which is the stem, which the affix). • The number of potential morphologies for a given corpus is enormous.

So the task is a reality check for discussions of language learning

Ideally We would like to pose the problem of grammar-selection as an optimization problem, and cut our task into two parts: • Specification of the objective function to be optimized, and • Development of practical search techniques to find optima in reasonable time.

Current status • Linguistica: a C++ Windows-based program available for download at http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000 • Technical discussion in Computational Linguistics (June 2001) • Good results with 5,000 words, very fine-grained results with 500,000 words (corpus length, not lexicon count), especially in European languages.

Today’s talk • Specify the task in explicit terms • Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria. • Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics. • Morphology assigns a probability distribution over its words. • Computing the length of the morphology.

Today’s talk (continued) 6. Results 7. Some work in progress: learning syntax to learn about morphology

Given a text (but no prior knowledge of its language), we want: • List of stems, suffixes, and prefixes • List of signatures. • A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. • Hence, a stem in a corpus has a unique signature. • A signature has a unique set of stems associated with it

Example of signature in English • NULL.ed.ing.s ask call point summarizes: ask asked asking asks call called calling calls point pointed pointing points

We would like to characterize the discovery of a signature as an optimization problem • Reasonable tack: formulate the problem in terms of Minimum Description Length (Rissanen, 1989)

Minimum Description Length (MDL) • Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989) • Work by Michael Brent and Carl de Marcken on word-discovery using MDL in the mid-1990s.

Essence of MDL If we are given • a corpus, and • a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. Then we can compute an over-all measure (“description length”) which we can seek to minimize over the space of all possible analyses.

Description length of a corpus C, given a morphology M The length, in bits, of the shortest formulation of the morphology expressible on a given Turing machine + Optimal compressed length of the corpus, using that morphology .

Probabilistic morphology • To serve this function, the morphology must assign a distribution over the set of words it generates, so that the optimal compressed length of an actual, occurring corpus (the one we’re learning from) is -1 * log probability it assigns.

Essence of MDL… • The goodness of the morphology is also measured by how compact the morphology is. • We can measure the compactness of a morphology in information theoretic bits.

How can we measure the compactness of a morphology? • Let’s consider a naïve version of description length: count the number of letters. • This naïve version is nonetheless helpful in seeing the intuition involved.

Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing

Essence of MDL… The best overall theory of a corpus is the one for which the sum of • -1 * log prob (corpus) + • length of the morphology (that’s the description length) is the smallest.

Essence of MDL…

Overall logic • Search through morphology space for the morphology which provides the smallest description length.

Brief foreshadowing of our calculation of the length of the morphology • A morphology is composed of three lists: a list of stems, a list of suffixes (say), and a list of ways in which the two can be combined (“signatures”). Information content of a list =

Stem list

Bootstrap heuristic • Find a method to locate likely places to cut a word. • Allow no more than 1 cut per word (i.e., maximum of 2 morphemes). • Assume this is stem + suffix. • Associate with each stem an alphabetized list of its suffixes; call this its signature. • Accept only those word analyses associated with robust signatures…

…where a robust signature is one with a minimum of 5 stems (and at least two suffixes). Robust signatures are pieces of secure structure.

Heuristic to find likely cuts… Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut; and can put cuts too far to the right (“aborti-” problem). [Not a problem!]

Successor frequency n g o v e r Empirically, only one letter follows “gover”: “n”

Successor frequency e i m g o v e r n o s # Empirically, 6 letters follows “govern”: “n”

Successor frequency g o v e r n m e Empirically, 1 letter follows “governm”: “e” g o v e r 1 n 6 m 1 e peak of successor frequency

Lots of errors… 9 18 11 6 4 1 2 1 1 2 1 1 c o n s e r v a t i v e s wrong right wrong

Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.

Words->SuccessorFreq1(GetStems_Suffixed(), GetSuffixes(), GetSignatures(), SF1 ); CheckSignatures(); ExtendKnownStemsToKnownSuffixes(); TakeSignaturesFindStems(); ExtendKnownStemsToKnownSuffixes(); FromStemsFindSuffixes(); ExtendKnownStemsToKnownSuffixes(); LooseFit(); CheckSignatures();

2. Incremental heuristics • Enormous amount of detail being skipped…let’s look at one simple case: • Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. • Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis.

Using MDL to judge a potential stem and potential signature Suppose we find: act, acted, action, acts. We have the suffixes NULL, ed, ion, and s, but not the signature NULL.ed.ion.s Let’s compute cost versus savings of signature NULL.ed.ion.s

savings Savings: Stem savings: 3 copies of the stem act: that’s 3 x 3 = 9 letters = 40.5 bits (taking 4.5 bits/letter). Suffix savings: ed, ing, s: 6 letters, another 27 bits. Total of 67.5 bits--

Cost of NULL.ed.ing.s • A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is -- all the stems using it chip in to pay for its cost, though.

Cost of signature: about 43 bits • Savings: about 67 bits • Slight worsening in the compressed length of these 4 words. so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.

Frequency of analyzed word W is analyzed as belonging to Signature s,stem T and suffix F. [x] means the count of x’s in the corpus (token count) Where [W] is the total number of words. Actually what we care about is the log of this:

Learning linguistic structure