190 likes | 291 Views
nyky + ratkaisu + i + sta + mme. kahvi + n + juo + ja + lle + kin. tietä + isi + mme + kö + hän. open + mind + ed + ness. un + believ + able. Inducing the Morphological Lexicon of a Natural Language from Unannotated Text. { Mathias . Creutz , Krista . Lagus }@hut.fi
E N D
nyky+ratkaisu+i+sta+mme • kahvi+n+juo+ja+lle+kin • tietä+isi+mme+kö+hän • open+mind+ed+ness • un+believ+able Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz, Krista.Lagus }@hut.fi International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05) Espoo, 17 June 2005
Challenge for NLP: too many words • E.g., Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes: • kahvi + n + juo + ja + lle + kin (coffee + of + drink + -er + for + also) • nyky+ ratkaisu + i + sta + mme (current + solution + -s + from + our) • tietä + isi + mme + kö+ hän (know + would + we + INTERR + indeed) • Huge number of different possible word forms • Important to know the inner structure of words • The number of morphemes per word varies much Mathias Creutz
Goal Morfessor • Learnrepresentations of • the smallest individually meaningful units of language (morphemes) • and their interaction • in an unsupervised and data-driven manner from raw text • making as general and language-independent assumptions as possible. Mathias Creutz
State of the art • Rule-based systems • accurate, language-dependent, adaptivity issues • Unsupervised word segmentation • sentences can be of different length • context-insensitive poor modeling of syntax: • undersegmentation of frequent strings (“forthepurposeof”) • oversegmentation of rare strings (“in + s + an + e”) • no syntactic / morphotactic constraints(“s + can”) Morfessor Baseline Mathias Creutz
believ hop liv mov us e ed es ing State of the art (cont’d) • Morphology learning • Beyond segmentation: allomorphy (“foot – feet, goose – geese”) • Detection of semantic similarity (e.g., Yarowsky & Wicentowski)(“sing – sings – singe – singed”) • Learning of paradigms (e.g., John Goldsmith’s Linguistica) Very restricted syntax / morphotactics in terms of number of morphemes per word form! Mathias Creutz
P(STM | PRE) P(SUF | SUF) Transition probs P(’over’ | PRE) P(’s’ | SUF) Emission probs # over simpl ific ation s # Morfessor with morpheme categories • Lexicon / Grammar dualism • Word structure captured by a regular expression: word = ( prefix* stemsuffix* )+ • Morph sequences (words) are generated by a Hidden Markov model: Mathias Creutz
“Meaning” “Form” 14029 41 17259 4 136 1 1 4618 1 4 5 1 simpl over s Right perplexity Left perplexity Frequency Length String Morphs ... Lexicon Mathias Creutz
How meaning affects morphotactic role • Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’) • Assume asymmetries between the categories: Mathias Creutz
Distribute remaining probability mass proportionally, • e.g., How meaning affects role (cont’d) • There is an additional non-morpheme category for cases where none of the proper classes is likely: Mathias Creutz
14029 136 1 4 over 17259 1 4618 1 s 41 4 1 5 simpl P(STM | PRE) P(SUF | SUF) P(’over’ | PRE) P(’s’ | SUF) ... s # over simpl ation # ific Balance accuracy of representation of data against size of lexicon Maximum a posteriori optimization Older maximum- likelihood version: Categories-ML (lexicon controlled heuristically) Morfessor Categories-MAP: Mathias Creutz
Probability of adding an entry to the lexicon: • Probability of sequences in the corpus: vs. hands # # hand s # # Over- and undersegmentation still a problem? • Rare strings are split into smaller parts (e.g., morgan + a) • Frequent strings are left unsplit and their inner structure is “lost” (e.g., hands) Mathias Creutz
Solution: Hierarchical structures in lexicon oppositio + kansanedustaja op positio kansan edustaja Non-morpheme Stem kansa n edusta ja Suffix • Make morphs consist of submorphs. • Expand the tree when performing morpheme segmentation. • Do not expand morphs consisting of non-morphemes. Mathias Creutz
Evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard) • Evaluate the segmentation of Morfessor against a linguistic morpheme segmentation = Hutmegs • Covers • 1.4 million Finnish word forms • 120 000 English word forms • Publicly available and described in the technical report: M. Creutz and K. Lindén. 2004. Morpheme Segmentation Gold Standards for Finnish and English. Publications in Computer and Information Science, Report A77, Helsinki University of Technology. Mathias Creutz
Evaluation against the Hutmegs Gold Standard Finnish English Categories-MAP Heuristic (Categories-ML) Ctxt-insens. (Baseline) Paradigms (Linguistica) Mathias Creutz
Example segmentations Mathias Creutz
Discussion • Possibility to extend the model • rudimentary features used for “meaning” • more fine-grained categories • beyond concatenative phenomena (e.g., goose – geese) • allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful) • Already now useful in applications • automatic speech recognition (Finnish, Turkish) Mathias Creutz
Morpho project page http://www.cis.hut.fi/projects/morpho/ Mathias Creutz
http://www.cis.hut.fi/projects/morpho/ Demo 6 Mathias Creutz
Demo 7 Mathias Creutz