1 / 120

Learning linguistic structure

Learning linguistic structure. John Goldsmith Computer Science Department University of Chicago February 7, 2003. A large part of the field of computational linguistics has moved during the 1990s from developing grammars, speech recognition engines, etc., that simply work , to

skristine
Download Presentation

Learning linguistic structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003

  2. A large part of the field of computational linguistics has moved during the 1990s from • developing grammars, speech recognition engines, etc., that simply work, to • developing systems that learn language-specific parameters from large amounts of data.

  3. Credo… • The application of statistically-driven methods of data analysis, when applied to natural language data, will produce results which shed light on linguistic structure.

  4. Unsupervised learning Input: large texts in a natural language, with no prior knowledge of the language.

  5. A bit more about the goal • What’s the input? • “Data” – which comes to the learner, in acoustic form, unsegmented: • Sentences not broken up into words • Words not broken up into their components (morphemes). • Words not assigned to lexical categories (noun, verb, article, etc.) With a meaning representation?

  6. Idealization of the language-learning scheme • Segment the soundstream into words; the words form the lexicon of the language. • Discover internal structure of words; this is the morphology of the language. • Infer a set of lexical categories for words; each word is assigned to (at least) one lexical category. • Infer a set of phrase-structure rules for the language.

  7. Idealization? • While these tasks are individually coherent, we make no assumption that any one must be completed before another can be begun.

  8. Today’s task • To develop an algorithm capable of learning the morphology of a language, given knowledge of the words of the language, and of a large sample of utterances.

  9. Goals Given a corpus, learn: The set of word-roots, prefixes, and suffixes, and principles of combinations; Principles of automatic alternations (e.g., e drops before the suffixes –ing,–ity and –ed, but not before –s) Some suffixes have one grammatical function (-ness) while others have more (e.g., -s: song-s versus sing-s).

  10. Why? Practical applications: • Automatic stemming for multilingual information retrieval • A corpus broken into morphemes is far superior to a corpus broken into words for statistically-driven machine translation • Develop morphologies for speech recognition automatically

  11. Theoretically There is a strong bias currently in linguistics to underestimate the difficulty of language learning – For example, to identify language learning with the selection of a phrase-structure grammar, or with the independent setting of a small number of parameters.

  12. Morphology • The learning of morphology is a very difficult task, in the sense that every word W of length |W| can potentially be divided into 1, 2, …, L morphemes mi, constrained only by S|mi| = |W| – and that’s ignoring labeling (which is the stem, which the affix). • The number of potential morphologies for a given corpus is enormous.

  13. So the task is a reality check for discussions of language learning

  14. Ideally We would like to pose the problem of grammar-selection as an optimization problem, and cut our task into two parts: • Specification of the objective function to be optimized, and • Development of practical search techniques to find optima in reasonable time.

  15. Current status • Linguistica: a C++ Windows-based program available for download at http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000 • Technical discussion in Computational Linguistics (June 2001) • Good results with 5,000 words, very fine-grained results with 500,000 words (corpus length, not lexicon count), especially in European languages.

  16. Today’s talk • Specify the task in explicit terms • Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria. • Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics. • Morphology assigns a probability distribution over its words. • Computing the length of the morphology.

  17. Today’s talk (continued) 6. Results 7. Some work in progress: learning syntax to learn about morphology

  18. Given a text (but no prior knowledge of its language), we want: • List of stems, suffixes, and prefixes • List of signatures. • A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. • Hence, a stem in a corpus has a unique signature. • A signature has a unique set of stems associated with it

  19. Example of signature in English • NULL.ed.ing.s ask call point summarizes: ask asked asking asks call called calling calls point pointed pointing points

  20. We would like to characterize the discovery of a signature as an optimization problem • Reasonable tack: formulate the problem in terms of Minimum Description Length (Rissanen, 1989)

  21. Today’s talk • Specify the task in explicit terms • Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria. • Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics. • Morphology assigns a probability distribution over its words. • Computing the length of the morphology.

  22. Minimum Description Length (MDL) • Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989) • Work by Michael Brent and Carl de Marcken on word-discovery using MDL in the mid-1990s.

  23. Essence of MDL If we are given • a corpus, and • a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. Then we can compute an over-all measure (“description length”) which we can seek to minimize over the space of all possible analyses.

  24. Description length of a corpus C, given a morphology M The length, in bits, of the shortest formulation of the morphology expressible on a given Turing machine + Optimal compressed length of the corpus, using that morphology .

  25. Probabilistic morphology • To serve this function, the morphology must assign a distribution over the set of words it generates, so that the optimal compressed length of an actual, occurring corpus (the one we’re learning from) is -1 * log probability it assigns.

  26. Essence of MDL… • The goodness of the morphology is also measured by how compact the morphology is. • We can measure the compactness of a morphology in information theoretic bits.

  27. How can we measure the compactness of a morphology? • Let’s consider a naïve version of description length: count the number of letters. • This naïve version is nonetheless helpful in seeing the intuition involved.

  28. Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing

  29. Essence of MDL… The best overall theory of a corpus is the one for which the sum of • -1 * log prob (corpus) + • length of the morphology (that’s the description length) is the smallest.

  30. Essence of MDL…

  31. Overall logic • Search through morphology space for the morphology which provides the smallest description length.

  32. Brief foreshadowing of our calculation of the length of the morphology • A morphology is composed of three lists: a list of stems, a list of suffixes (say), and a list of ways in which the two can be combined (“signatures”). Information content of a list =

  33. Stem list

  34. Today’s talk • Specify the task in explicit terms • Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria. • Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics. • Morphology assigns a probability distribution over its words. • Computing the length of the morphology.

  35. Bootstrap heuristic • Find a method to locate likely places to cut a word. • Allow no more than 1 cut per word (i.e., maximum of 2 morphemes). • Assume this is stem + suffix. • Associate with each stem an alphabetized list of its suffixes; call this its signature. • Accept only those word analyses associated with robust signatures…

  36. …where a robust signature is one with a minimum of 5 stems (and at least two suffixes). Robust signatures are pieces of secure structure.

  37. Heuristic to find likely cuts… Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut; and can put cuts too far to the right (“aborti-” problem). [Not a problem!]

  38. Successor frequency n g o v e r Empirically, only one letter follows “gover”: “n”

  39. Successor frequency e i m g o v e r n o s # Empirically, 6 letters follows “govern”: “n”

  40. Successor frequency g o v e r n m e Empirically, 1 letter follows “governm”: “e” g o v e r 1 n 6 m 1 e peak of successor frequency

  41. Lots of errors… 9 18 11 6 4 1 2 1 1 2 1 1 c o n s e r v a t i v e s wrong right wrong

  42. Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.

  43. Words->SuccessorFreq1(GetStems_Suffixed(), GetSuffixes(), GetSignatures(), SF1 ); CheckSignatures(); ExtendKnownStemsToKnownSuffixes(); TakeSignaturesFindStems(); ExtendKnownStemsToKnownSuffixes(); FromStemsFindSuffixes(); ExtendKnownStemsToKnownSuffixes(); LooseFit(); CheckSignatures();

  44. 2. Incremental heuristics • Enormous amount of detail being skipped…let’s look at one simple case: • Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. • Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis.

  45. Using MDL to judge a potential stem and potential signature Suppose we find: act, acted, action, acts. We have the suffixes NULL, ed, ion, and s, but not the signature NULL.ed.ion.s Let’s compute cost versus savings of signature NULL.ed.ion.s

  46. savings Savings: Stem savings: 3 copies of the stem act: that’s 3 x 3 = 9 letters = 40.5 bits (taking 4.5 bits/letter). Suffix savings: ed, ing, s: 6 letters, another 27 bits. Total of 67.5 bits--

  47. Cost of NULL.ed.ing.s • A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is -- all the stems using it chip in to pay for its cost, though.

  48. Cost of signature: about 43 bits • Savings: about 67 bits • Slight worsening in the compressed length of these 4 words. so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.

  49. Today’s talk • Specify the task in explicit terms • Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria. • Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics. • Morphology assigns a probability distribution over its words. • Computing the length of the morphology.

  50. Frequency of analyzed word W is analyzed as belonging to Signature s,stem T and suffix F. [x] means the count of x’s in the corpus (token count) Where [W] is the total number of words. Actually what we care about is the log of this:

More Related