Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

nyky+ratkaisu+i+sta+mme • kahvi+n+juo+ja+lle+kin • tietä+isi+mme+kö+hän • open+mind+ed+ness • un+believ+able Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz, Krista.Lagus }@hut.fi International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05) Espoo, 17 June 2005

Challenge for NLP: too many words • E.g., Finnish words often consist of lengthy sequences of morphemes — stems,suffixes and prefixes: • kahvi + n + juo + ja + lle + kin (coffee + of + drink + -er + for + also) • nyky+ ratkaisu + i + sta + mme (current + solution + -s + from + our) • tietä + isi + mme + kö+ hän (know + would + we + INTERR + indeed) • Huge number of different possible word forms • Important to know the inner structure of words • The number of morphemes per word varies much Mathias Creutz

Goal Morfessor • Learnrepresentations of • the smallest individually meaningful units of language (morphemes) • and their interaction • in an unsupervised and data-driven manner from raw text • making as general and language-independent assumptions as possible. Mathias Creutz

State of the art • Rule-based systems • accurate, language-dependent, adaptivity issues • Unsupervised word segmentation • sentences can be of different length • context-insensitive  poor modeling of syntax: • undersegmentation of frequent strings (“forthepurposeof”) • oversegmentation of rare strings (“in + s + an + e”) • no syntactic / morphotactic constraints(“s + can”) Morfessor Baseline Mathias Creutz

believ hop liv mov us e ed es ing State of the art (cont’d) • Morphology learning • Beyond segmentation: allomorphy (“foot – feet, goose – geese”) • Detection of semantic similarity (e.g., Yarowsky & Wicentowski)(“sing – sings – singe – singed”) • Learning of paradigms (e.g., John Goldsmith’s Linguistica) Very restricted syntax / morphotactics in terms of number of morphemes per word form! Mathias Creutz

P(STM | PRE) P(SUF | SUF) Transition probs P(’over’ | PRE) P(’s’ | SUF) Emission probs # over simpl ific ation s # Morfessor with morpheme categories • Lexicon / Grammar dualism • Word structure captured by a regular expression: word = ( prefix* stemsuffix* )+ • Morph sequences (words) are generated by a Hidden Markov model: Mathias Creutz

“Meaning” “Form” 14029 41 17259 4 136 1 1 4618 1 4 5 1 simpl over s Right perplexity Left perplexity Frequency Length String Morphs ... Lexicon Mathias Creutz

How meaning affects morphotactic role • Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’) • Assume asymmetries between the categories: Mathias Creutz

Distribute remaining probability mass proportionally, • e.g., How meaning affects role (cont’d) • There is an additional non-morpheme category for cases where none of the proper classes is likely: Mathias Creutz

14029 136 1 4 over 17259 1 4618 1 s 41 4 1 5 simpl P(STM | PRE) P(SUF | SUF) P(’over’ | PRE) P(’s’ | SUF) ... s # over simpl ation # ific Balance accuracy of representation of data against size of lexicon Maximum a posteriori optimization Older maximum- likelihood version: Categories-ML (lexicon controlled heuristically) Morfessor Categories-MAP: Mathias Creutz

Probability of adding an entry to the lexicon: • Probability of sequences in the corpus: vs. hands # # hand s # # Over- and undersegmentation still a problem? • Rare strings are split into smaller parts (e.g., morgan + a) • Frequent strings are left unsplit and their inner structure is “lost” (e.g., hands) Mathias Creutz

Solution: Hierarchical structures in lexicon oppositio + kansanedustaja op positio kansan edustaja Non-morpheme Stem kansa n edusta ja Suffix • Make morphs consist of submorphs. • Expand the tree when performing morpheme segmentation. • Do not expand morphs consisting of non-morphemes. Mathias Creutz

Evaluation using Hutmegs(Helsinki University of Technology Morphological Evaluation Gold Standard) • Evaluate the segmentation of Morfessor against a linguistic morpheme segmentation = Hutmegs • Covers • 1.4 million Finnish word forms • 120 000 English word forms • Publicly available and described in the technical report: M. Creutz and K. Lindén. 2004. Morpheme Segmentation Gold Standards for Finnish and English. Publications in Computer and Information Science, Report A77, Helsinki University of Technology. Mathias Creutz

Evaluation against the Hutmegs Gold Standard Finnish English Categories-MAP Heuristic (Categories-ML) Ctxt-insens. (Baseline) Paradigms (Linguistica) Mathias Creutz

Example segmentations Mathias Creutz

Discussion • Possibility to extend the model • rudimentary features used for “meaning” • more fine-grained categories • beyond concatenative phenomena (e.g., goose – geese) • allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful) • Already now useful in applications • automatic speech recognition (Finnish, Turkish) Mathias Creutz

Morpho project page http://www.cis.hut.fi/projects/morpho/ Mathias Creutz

http://www.cis.hut.fi/projects/morpho/ Demo 6 Mathias Creutz

Demo 7 Mathias Creutz

Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

Presentation Transcript

The Lexicon

Simultaneous Morphological Analysis and Lemmatization of Arabic Text

The Lexicon

The power of test kits: Inducing change from below

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis

Natural Language Generation and Data-To-Text

Natural Language Generation and Data-To-Text

From Dewey to Natural Language

Access Control Policy Extraction from Unconstrained Natural Language Text

Lexicon Optimization Approaches for Language Models of Agglutinative Language

Natural Language Generation and Data-To-Text

Natural Language Generation and Data-To-Text

Biological information extraction from natural language text

Inducing the Morphological Lexicon of a Natural Language from Unannotated Text

Biomedical natural language processing and text mining

IDENTIFICATION OF CERTAIN EMOTIONS IN TEXT (NATURAL LANGUAGE PROCESSING)

Big Text: from Language to Knowledge

Inducing Ontologies from Folksonomies using Natural Language Understanding

The Lexicon

SIL FieldWorks Language Explorer: The lexicon component

Morphological Analysis of Inuktitut Statistical Natural Language Processing Final Project

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis