170 likes | 526 Views
nyky + ratkaisu + i + sta + mme kahvi + n + juo + ja + lle + kin tietä + isi + mme + kö + hän open + mind + ed + ness un + believ + able Induction of a Simple Morphology for Highly-Inflecting Languages {Mathias.Creutz, Krista.Lagus}@hut.fi
E N D
nyky+ratkaisu+i+sta+mme • kahvi+n+juo+ja+lle+kin • tietä+isi+mme+kö+hän • open+mind+ed+ness • un+believ+able Induction of a Simple Morphologyfor Highly-Inflecting Languages {Mathias.Creutz, Krista.Lagus}@hut.fi Current Themes in Computational Phonology and Morphology, 7th Meeting of the ACL Special Interest Group in Computational Phonology, ACL-2004. Barcelona, 26 July 2004
Goals and challenges • Learnrepresentations of • the smallest meaningful units of language (morphemes) • and their interaction • in an unsupervised manner from raw text • making as general and language-independent assumptions as possible. • Evaluate • against a given gold-standard morphological analysis of word forms • first step: learn and evaluate a morpheme segmentation of word forms • integrated in NLP applications (speech recognition) Mathias Creutz
Focus: Agglutinative morphology • Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes: • kahvi + n + juo + ja + lle + kin • (coffee + of + drink + -er + for + also) • nyky+ ratkaisu + i + sta + mme • (current + solution + -s + from + our) • tietä + isi + mme + kö+ hän • (know + would + we + INTERR + indeed) • Huge number of different possible word forms • Important to know the inner structure of words in NLP • The number of morphemes per word varies much Mathias Creutz
a =tä b =ssä g =pala d =peli e =on q =tuhat z =a Learning from data a b g d b e q g z tä ssä pala peli ssä on tuhat pala a 1. MDL model (Creutz & Lagus, 2002)(inspired by work of, e.g., J. Goldsmith) ”Invent” a set distinct strings = morphs Aim at the most concise represent- ation possible Morph lexicon Pick morphs from the lexicon and place them in a sequence Corpus / word list Mathias Creutz
a =tä b =ssä g =pala d =peli e =on q =tuhat z =a a b g d b e q g z tä ssä pala peli ssä on tuhat pala a 2. Probabilistic formulation (Creutz, 2003)(inspired by work of, e.g., M. R. Brent and M. G. Snover) Length prior ”Invent” a set distinct strings = morphs Morph lexicon Frequency prior Pick morphs from the lexicon and place them in a sequence Corpus / word list Mathias Creutz
Reflections on solutions 1 and 2 • ”Dumb” text compression algorithms • Common substrings of words appear as one segment, even when compositional structure, e.g.,: • keskustelussa (keskustel + u + ssa; ”discuss+ion in”) • biggest (bigg + est) • Rare substrings of words are split, even when no compositional structure, e.g., • a + den + auer (Adenauer; German politician) • in + s + an + e (in + sane) • Too weak structural constraints, e.g., suffixes recognized in the beginning of words: • s + can (scan) Mathias Creutz
p(STM | PRE) p(SUF | SUF) p(’nyky’ | PRE) p(’mme’ | SUF) # nyky ratkaisu i sta mme # 3. Category-learning probabilistic model • Word structure captured by a regular expression: • word = ( prefix* stemsuffix* )+ • Morph sequences (words) are generated by a Hidden Markov model: Transition probs Emission probs Mathias Creutz
1. Start with an existing baseline morph segmentation (Creutz, 2003): nyky + rat + kaisu + ista + mme Category algorithm 2. Initialize category membership probs for each morph, e.g., p(PRE | ’nyky’). Assume asymmetries between the categories: Mathias Creutz
Initialization of category membership probs • Introduce a noise category for cases where none of the proper classes is likely: • Distribute remaining probability mass proportionally, e.g., Mathias Creutz
4. Split morphs that consist of other known morphs. Then EM: nyky + rat + kaisu + i+sta + mme 5. Join noise morphs with their neighbours. Then EM: nyky + ratkaisu + i+sta + mme Category algorithm (continued) 1. Start with an existing baseline morph segmentation: nyky + rat + kaisu + ista + mme 2. Initialize category membership probs for each morph. 3. Tag morphs as prefix, stem, suffix, ornoise. Then run EM on taggings: nyky + rat + kaisu + ista + mme Mathias Creutz
believ hop liv mov us e ed es ing Experiments • Algorithms • Baseline model (Bayesian formulation) • Category-Learning model • Goldsmith’s ”Linguistica” (MDL formul.) • Data • Finnish data sets (CSC + STT) • 10 000 words, 50 000 words, 250 000 words, 16 million words • English data sets (Brown corpus) • 10 000 words, 50 000 words, 250 000 words Mathias Creutz
”Gold standard” used in evaluation • Morpheme segmentation obtained for Finnish and English words • by processing the output of Two-level morphology analyzers (FINTWOL and ENGTWOL by Lingsoft, Inc.) • Some ”fuzzy morpheme boundaries” allowed • mainly stem-final alternation considered as a seam or joint allowed to belong to the stem or suffix, e.g., • Windsori + n or Windsor + in; Windsore + i + lla or Windsor + ei + lla (cf. Windsor) • invite + s or invit + es; invite or invit + e (cf. invit + ing) • Compute precision and recall of correctly discovered morpheme boundaries Mathias Creutz
Results (evaluated against the gold-standard) Baseline 16M 10k Categories 10k 250k 250k 16M Categories 250k 10k 250k Linguistica 10k 250k Linguistica 10k 250k 10k Baseline Mathias Creutz
Discussion • The Category algorithm • overcomes many of the shortcomings of the Baseline algorithm • excessive or too little segmentation • suffixes in beginning of words • generalizes more than Linguistica, e.g., • allus+ ion + s (Categories) vs. allusions (Linguistica) • Dem+i (Categories) vs. Demi (Linguistica) • invents its own solutions • aihe+e+sta vs. aihe+i+sta (”about [the] topic/-s”) • phrase, phrase+s, phrase+d Mathias Creutz
Future directions • The Category algorithm could be expressed more elegantly • not as a post-processing procedure making use of a baseline segmentation • Segmentation into morphs is useful • e.g., n-gram language modeling in speech recognition • Detection of allomorphy, i.e., segmentation into morphemes would be even more useful • e.g., information retrieval (?) Mathias Creutz
Public demo • A demo of the baseline and category-learning algorithm is available on the Internet at http://www.cis.hut.fi/projects/morpho/. • Test it on your own Finnish or English input! Mathias Creutz
Randomly shuffle words Recursive binary splitting words opening openminded openminded reopened reopened conferences reopen minded Morphs mind open re ed Search for the optimal segmentation of the words in a corpus Convergence of descr. length? yes Done no Mathias Creutz