1 / 19

Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

nyky + ratkaisu + i + sta + mme kahvi + n + juo + ja + lle + kin tietä + isi + mme + kö + hän open + mind + ed + ness un + believ + able Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias . Creutz , Krista . Lagus }@hut.fi

albert
Download Presentation

Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. nyky+ratkaisu+i+sta+mme • kahvi+n+juo+ja+lle+kin • tietä+isi+mme+kö+hän • open+mind+ed+ness • un+believ+able Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz, Krista.Lagus }@hut.fi International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05) Espoo, 17 June 2005

  2. Challenge for NLP: too many words • E.g., Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes: • kahvi + n + juo + ja + lle + kin (coffee + of + drink + -er + for + also) • nyky+ ratkaisu + i + sta + mme (current + solution + -s + from + our) • tietä + isi + mme + kö+ hän (know + would + we + INTERR + indeed) • Huge number of different possible word forms • Important to know the inner structure of words • The number of morphemes per word varies much Mathias Creutz

  3. Goal Morfessor • Learnrepresentations of • the smallest individually meaningful units of language (morphemes) • and their interaction • in an unsupervised and data-driven manner from raw text • making as general and language-independent assumptions as possible. Mathias Creutz

  4. State of the art • Rule-based systems • accurate, language-dependent, adaptivity issues • Unsupervised word segmentation • sentences can be of different length • context-insensitive  poor modeling of syntax: • undersegmentation of frequent strings (“forthepurposeof”) • oversegmentation of rare strings (“in + s + an + e”) • no syntactic / morphotactic constraints(“s + can”) Morfessor Baseline Mathias Creutz

  5. believ hop liv mov us e ed es ing State of the art (cont’d) • Morphology learning • Beyond segmentation: allomorphy (“foot – feet, goose – geese”) • Detection of semantic similarity (e.g., Yarowsky & Wicentowski)(“sing – sings – singe – singed”) • Learning of paradigms (e.g., John Goldsmith’s Linguistica) Very restricted syntax / morphotactics in terms of number of morphemes per word form! Mathias Creutz

  6. P(STM | PRE) P(SUF | SUF) Transition probs P(’over’ | PRE) P(’s’ | SUF) Emission probs # over simpl ific ation s # Morfessor with morpheme categories • Lexicon / Grammar dualism • Word structure captured by a regular expression: word = ( prefix* stemsuffix* )+ • Morph sequences (words) are generated by a Hidden Markov model: Mathias Creutz

  7. “Meaning” “Form” 14029 41 17259 4 136 1 1 4618 1 4 5 1 simpl over s Right perplexity Left perplexity Frequency Length String Morphs ... Lexicon Mathias Creutz

  8. How meaning affects morphotactic role • Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’) • Assume asymmetries between the categories: Mathias Creutz

  9. Distribute remaining probability mass proportionally, • e.g., How meaning affects role (cont’d) • There is an additional non-morpheme category for cases where none of the proper classes is likely: Mathias Creutz

  10. 14029 136 1 4 over 17259 1 4618 1 s 41 4 1 5 simpl P(STM | PRE) P(SUF | SUF) P(’over’ | PRE) P(’s’ | SUF) ... s # over simpl ation # ific Balance accuracy of representation of data against size of lexicon Maximum a posteriori optimization Older maximum- likelihood version: Categories-ML (lexicon controlled heuristically) Morfessor Categories-MAP: Mathias Creutz

  11. Probability of adding an entry to the lexicon: • Probability of sequences in the corpus: vs. hands # # hand s # # Over- and undersegmentation still a problem? • Rare strings are split into smaller parts (e.g., morgan + a) • Frequent strings are left unsplit and their inner structure is “lost” (e.g., hands) Mathias Creutz

  12. Solution: Hierarchical structures in lexicon oppositio + kansanedustaja op positio kansan edustaja Non-morpheme Stem kansa n edusta ja Suffix • Make morphs consist of submorphs. • Expand the tree when performing morpheme segmentation. • Do not expand morphs consisting of non-morphemes. Mathias Creutz

  13. Evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard) • Evaluate the segmentation of Morfessor against a linguistic morpheme segmentation = Hutmegs • Covers • 1.4 million Finnish word forms • 120 000 English word forms • Publicly available and described in the technical report: M. Creutz and K. Lindén. 2004. Morpheme Segmentation Gold Standards for Finnish and English. Publications in Computer and Information Science, Report A77, Helsinki University of Technology. Mathias Creutz

  14. Evaluation against the Hutmegs Gold Standard Finnish English Categories-MAP Heuristic (Categories-ML) Ctxt-insens. (Baseline) Paradigms (Linguistica) Mathias Creutz

  15. Example segmentations Mathias Creutz

  16. Discussion • Possibility to extend the model • rudimentary features used for “meaning” • more fine-grained categories • beyond concatenative phenomena (e.g., goose – geese) • allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful) • Already now useful in applications • automatic speech recognition (Finnish, Turkish) Mathias Creutz

  17. Morpho project page http://www.cis.hut.fi/projects/morpho/ Mathias Creutz

  18. http://www.cis.hut.fi/projects/morpho/ Demo 6 Mathias Creutz

  19. Demo 7 Mathias Creutz

More Related