1 / 31

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis. Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion University ACL 2008, Columbus, Ohio. Unknown Words - English. The draje of the tagement starts rikking with Befa. Morphology.

theresa
Download Presentation

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion University ACL 2008, Columbus, Ohio

  2. Unknown Words - English • The draje of the tagement starts rikking with Befa. Morphology Syntax analysisi probi unknown word analysis1 prob1 … analysisn probn Motivation

  3. Unknown Words - English • The draje of the tagement starts rikking with Befa. • Morphology • tagement, rikking, Befa • Syntax: • The draje of the tagement starts rikking with Befa. Motivation

  4. Unknowns Resolution Method in English • Baseline Method • PN tag for capitalized words • Uniform distribution over open-class tags • Evaluation • 12 open-class tags • 45% of capitalized unknowns • Overall, 70% of the unknowns were tagged correctly Motivation

  5. Unknowns Resolution Method in Hebrew The baseline method resolves: only 5% of the Hebrew unknown tokens! Why? How can we improve? Motivation

  6. Unknown Words - Hebrew • The draje of the tagement starts rikking with Befa. • עם בפהדרג הטגמנט התחיל לנפן drj htgmnt hthil lnpn `m bfh Motivation

  7. Unknown Words - Hebrew • drj htgmnt hthil lngn `m bfh • Morphology • No capitalization: PN always candidate • Many open-class tags (> 3,000) • Syntax • Unmarked function words • preposition, definite article, conjunction, pronominal pronoun. • the drj of drj • Function words ambiguity • htgmnt: VBP/VBI, DEF+MM • `m: PREP (with), NN (people) • bfh: PREP+NN/PN/JJ/VB, PREP+DEF+NN/JJ… Motivation

  8. Outline • Characteristics of Hebrew Unknowns • Previous Work • Unsupervised Lexicon-based Approaches • Letters Model • Pattern Model • Linear-context Model • Evaluation • Conclusion

  9. Hebrew Text Analysis System text Tokenizer tokenized text Morphological Analyzer Lexicon known words analysis distribution Unknown Tokens Analyzer ? unknown words analysis distribution Morphological Disambiguator HMM disambiguated text Proper-name Classifier SVM disambiguated text with PN ME Noun-Phrase Chunker SVM Named-Entity Recognizer http://www.cs.bgu.ac.il/~nlpproj/demo Hebrew Unknowns Characteristics

  10. Hebrew Unknowns • Unknown tokens • Tokens which are not recognized by the lexicon • NN: פרזנטור (presenter) • VB:התחרדן (got warm under the sun) • Unknown analyses • The set of analyses suggested by the lexicon does not contain the correct analysis for a given token • PN: שמעון פרס (Shimon Peres, that a dorm cut…) • RB: לעומתי (opposition, compared with me) Hebrew Unknowns Characteristics

  11. Hebrew Unknowns - Evidence • Unknown Tokens (4%) • Only 50% of the unknown tokens are PN • Selection of default PN POS is not sufficient • More than 30% of the unknown tokens are Neologism • Neologism detection • Unknown Analyses (3.5%) • 60% of the unknown analyses are proper name • Other POS cover 15% of the unknowns (only 1.1% of the tokens) • PN classifier is sufficient for unknown analyses Hebrew Unknowns Characteristics

  12. Hebrew Unknown Tokens Analysis • Objectives • Given an unknown token, extract all possible morphological analyses, and assign likelihood for each analysis • Example: • התחרדן (got warm in the sun) • verb.singular.masculine.third.past 0.6 • Proper noun 0.2 • noun.def.singular.masculine 0.1 • noun.singular.masculine.absolute 0.05 • noun.singular.masculine.construct 0.001 • … Hebrew Unknowns Characteristics

  13. Previous Work - English • Heuristics [Weischedel et al. 95] • Tag-specific heuristics • Spelling features: capitalization, hyphens, suffixes • Guessing rules learned from raw text [Mikheev 97] • HMM with tag-suffix transitions [Thede & Harper 99] Previous Work

  14. Previous Work - Arabic • Root-pattern-features for morphological analysis and generation of Arabic dialects [Habash & Rambow 06] • Combination of lexicon-based and character-based tagger [Mansour et al. 07] Previous Work

  15. Our Approach • Resources • A large amount of unlabeled data (unsupervised) • A comprehensive lexicon (lexicon-based) • Hypothesis • Characteristics of unknown tokens are similar to known tokens • Method • Tag distribution model, based on morphological analysis of the known tokens in the corpus: • Letters model • Pattern model • Linear-context model Unsupervised Lexicon-based Approach

  16. Notation • Token • A sequence of characters bounded with spaces • בצלbcl • Prefixes • The prefixes according to each analysis • Preposition+noun (under a shadow): בb • Base-word • Token without prefixes (for each analysis) • Noun (an onion) בצלbcl • Preposition+noun (under a shadow): צלcl Unsupervised Lexicon-based Approach

  17. Raw-text corpus Lexicon Letters Model • For each possible analyses of a given token: • Features • Positioned uni-, bi- and trigram letters of the base-word • The prefixes of the base-word • The length of the base-word • Value • A full set of the morphological properties (as given by the lexicon) ME Letters Model Unsupervised Lexicon-based Approach

  18. Letters Model – An example • Known token: בצלbcl • Analyses • An onion • Base-word: bcl • Features • Grams: b:1 c:2 l:3 b:-3 c:-2 l:-1 bc:1 cl:2 bc:-2 cl:-1 bcl:1 bcl:-1 • Prefix: none • Length of base-word: 3 • Value • noun.singular.masculine.absolute • Under a shadow • Features • Grams: c:1 l:2 c:-1l:-2 cl:1 cl:-1 • Prefix: b • Length of base-word: 2 • Value • preposition+noun.singular.masculine.absolute Unsupervised Lexicon-based Approach

  19. Pattern Model • Word formation in Hebrew is based on root+template and/or affixation. • Based on [Nir 93], we defined 40 common neologism formation patterns, e.g. • Verb • Template: miCCeC מחזר, tiCCeC תארך • Noun • Suffixation: ut שיפוטיות, iya בידוריה • Template: tiCCoCet תפרוסת, miCCaCa מגננה • Adjective • Suffixation: ali סטודנטיאלי, oni טלויזיוני • Adverb • Suffixation: it לעומתית Unsupervised Lexicon-based Approach

  20. Raw-text corpus Lexicon Patterns Pattern Model • For each possible analyses of a given token: • Features • For each pattern, 1 – if the token fits the pattern, 0- otherwise • ‘no pattern’ feature • Value • A full set of the morphological properties (as given by the lexicon) ME Pattern Model Unsupervised Lexicon-based Approach

  21. Raw-text corpus Lexicon Patterns Letters+Pattern Model • For each possible analyses of a given token: • Features • Letters features • Patterns features • Value • A full set of the morphological properties (as given by the lexicon) ME Letters + Pattern Model Unsupervised Lexicon-based Approach

  22. Linear-context Model Thedrajeof the tagement starts rikking with Befa. • P(t|w) is hard to estimate for unknown tokens • P(noun| draje), P(adjective| draje), P(verb| draje) • Alternatively, P(t|c), can be learned for known contexts • P(noun| The, of), P(adjective| The, of), P(verb| The, of) • Observed Context Information • Lexical distribution • Word given context P(w|c) - P(drage|The,of) • Context given word P(c|w) - P(The, of | drage) • Relative frequencies over all the words in the corpus • Morpho-lexical distribution of known tokens • P(t|wi), - P(determiner|The)…, P(preposition|of)… • Similar words alg. [Levinger et al. 95] [Adler 07] [Goldberg et al. 08] Unsupervised Lexicon-based Approach

  23. Linear-context Model • Notation: w – known word, c – context of a known word, t - tag • Initial Conditions • Expectation p(w|c), p(c|w) raw-text corpus lexicon p(t|w) • Maximization p(t|w) = ∑cp(t|c)p(c|w) p(t|c) = ∑wp(t|w)p(w|c) Unsupervised Lexicon-based Approach

  24. Evaluation • Resources • Lexicon: MILA • Corpus • Train: unlabeled 42M tokens corpus • Test: annotated news articles of 90K token instances (3% unknown tokens, 2% unknown analyses) • PN Classifier Evaluation

  25. Evaluation - Models • Baseline • Most frequent tag - proper name - for all possible segmentations of the token • Letters model • Pattern model • Letters + Pattern model • Letters, Linear-context • Pattern, Linear-context • Letters + Pattern, Linear-context Evaluation

  26. Evaluation - Criteria • Suggested Analysis Set • Coverage of correct analysis • Ambiguity level (average number of analyses) • Average probability of correct analysis • Disambiguation accuracy • Number of correct analyses, picked in the complete system

  27. Evaluation – Full Morphological Analysis Evaluation

  28. Evaluation – Word Segmentation and POS Tagging Evaluation

  29. Evaluation - Conclusion • Error reduction > 30% over a competitive baseline, for a large-scale dataset of 90K tokens • Full morphological disambiguation: 79% accuracy • Word segmentation and POS tagging: 70% accuracy • Unsupervised linear-context model is as effective as a model which uses hand-crafted patterns • Effective combination of textual observation from unlabeled data and lexicon • Effective combination of ME model for tag distribution and SVM model for PN classification • Overall, error reduction of 5% for the whole disambiguation system Evaluation

  30. Summary • The characteristics of known words can help resolve unknown words • Unsupervised (unlabeled data) lexicon-based approach • Language independent algorithm for computing the distribution p(t|w) for unknown words • Nature of agglutinated prefixes in Hebrew [Ben-Eliahu et al. 2008]

  31. תנקס tnqs (thanks) • foreign 0.4 • propernoun 0.3 • noun.plural.feminine.absolute 0.2 • verb.singular.feminine.3.future 0.08 • verb.singular.masculine.2.future 0.02

More Related