310 likes | 487 Views
Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis. Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion University ACL 2008, Columbus, Ohio. Unknown Words - English. The draje of the tagement starts rikking with Befa. Morphology.
E N D
Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion University ACL 2008, Columbus, Ohio
Unknown Words - English • The draje of the tagement starts rikking with Befa. Morphology Syntax analysisi probi unknown word analysis1 prob1 … analysisn probn Motivation
Unknown Words - English • The draje of the tagement starts rikking with Befa. • Morphology • tagement, rikking, Befa • Syntax: • The draje of the tagement starts rikking with Befa. Motivation
Unknowns Resolution Method in English • Baseline Method • PN tag for capitalized words • Uniform distribution over open-class tags • Evaluation • 12 open-class tags • 45% of capitalized unknowns • Overall, 70% of the unknowns were tagged correctly Motivation
Unknowns Resolution Method in Hebrew The baseline method resolves: only 5% of the Hebrew unknown tokens! Why? How can we improve? Motivation
Unknown Words - Hebrew • The draje of the tagement starts rikking with Befa. • עם בפהדרג הטגמנט התחיל לנפן drj htgmnt hthil lnpn `m bfh Motivation
Unknown Words - Hebrew • drj htgmnt hthil lngn `m bfh • Morphology • No capitalization: PN always candidate • Many open-class tags (> 3,000) • Syntax • Unmarked function words • preposition, definite article, conjunction, pronominal pronoun. • the drj of drj • Function words ambiguity • htgmnt: VBP/VBI, DEF+MM • `m: PREP (with), NN (people) • bfh: PREP+NN/PN/JJ/VB, PREP+DEF+NN/JJ… Motivation
Outline • Characteristics of Hebrew Unknowns • Previous Work • Unsupervised Lexicon-based Approaches • Letters Model • Pattern Model • Linear-context Model • Evaluation • Conclusion
Hebrew Text Analysis System text Tokenizer tokenized text Morphological Analyzer Lexicon known words analysis distribution Unknown Tokens Analyzer ? unknown words analysis distribution Morphological Disambiguator HMM disambiguated text Proper-name Classifier SVM disambiguated text with PN ME Noun-Phrase Chunker SVM Named-Entity Recognizer http://www.cs.bgu.ac.il/~nlpproj/demo Hebrew Unknowns Characteristics
Hebrew Unknowns • Unknown tokens • Tokens which are not recognized by the lexicon • NN: פרזנטור (presenter) • VB:התחרדן (got warm under the sun) • Unknown analyses • The set of analyses suggested by the lexicon does not contain the correct analysis for a given token • PN: שמעון פרס (Shimon Peres, that a dorm cut…) • RB: לעומתי (opposition, compared with me) Hebrew Unknowns Characteristics
Hebrew Unknowns - Evidence • Unknown Tokens (4%) • Only 50% of the unknown tokens are PN • Selection of default PN POS is not sufficient • More than 30% of the unknown tokens are Neologism • Neologism detection • Unknown Analyses (3.5%) • 60% of the unknown analyses are proper name • Other POS cover 15% of the unknowns (only 1.1% of the tokens) • PN classifier is sufficient for unknown analyses Hebrew Unknowns Characteristics
Hebrew Unknown Tokens Analysis • Objectives • Given an unknown token, extract all possible morphological analyses, and assign likelihood for each analysis • Example: • התחרדן (got warm in the sun) • verb.singular.masculine.third.past 0.6 • Proper noun 0.2 • noun.def.singular.masculine 0.1 • noun.singular.masculine.absolute 0.05 • noun.singular.masculine.construct 0.001 • … Hebrew Unknowns Characteristics
Previous Work - English • Heuristics [Weischedel et al. 95] • Tag-specific heuristics • Spelling features: capitalization, hyphens, suffixes • Guessing rules learned from raw text [Mikheev 97] • HMM with tag-suffix transitions [Thede & Harper 99] Previous Work
Previous Work - Arabic • Root-pattern-features for morphological analysis and generation of Arabic dialects [Habash & Rambow 06] • Combination of lexicon-based and character-based tagger [Mansour et al. 07] Previous Work
Our Approach • Resources • A large amount of unlabeled data (unsupervised) • A comprehensive lexicon (lexicon-based) • Hypothesis • Characteristics of unknown tokens are similar to known tokens • Method • Tag distribution model, based on morphological analysis of the known tokens in the corpus: • Letters model • Pattern model • Linear-context model Unsupervised Lexicon-based Approach
Notation • Token • A sequence of characters bounded with spaces • בצלbcl • Prefixes • The prefixes according to each analysis • Preposition+noun (under a shadow): בb • Base-word • Token without prefixes (for each analysis) • Noun (an onion) בצלbcl • Preposition+noun (under a shadow): צלcl Unsupervised Lexicon-based Approach
Raw-text corpus Lexicon Letters Model • For each possible analyses of a given token: • Features • Positioned uni-, bi- and trigram letters of the base-word • The prefixes of the base-word • The length of the base-word • Value • A full set of the morphological properties (as given by the lexicon) ME Letters Model Unsupervised Lexicon-based Approach
Letters Model – An example • Known token: בצלbcl • Analyses • An onion • Base-word: bcl • Features • Grams: b:1 c:2 l:3 b:-3 c:-2 l:-1 bc:1 cl:2 bc:-2 cl:-1 bcl:1 bcl:-1 • Prefix: none • Length of base-word: 3 • Value • noun.singular.masculine.absolute • Under a shadow • Features • Grams: c:1 l:2 c:-1l:-2 cl:1 cl:-1 • Prefix: b • Length of base-word: 2 • Value • preposition+noun.singular.masculine.absolute Unsupervised Lexicon-based Approach
Pattern Model • Word formation in Hebrew is based on root+template and/or affixation. • Based on [Nir 93], we defined 40 common neologism formation patterns, e.g. • Verb • Template: miCCeC מחזר, tiCCeC תארך • Noun • Suffixation: ut שיפוטיות, iya בידוריה • Template: tiCCoCet תפרוסת, miCCaCa מגננה • Adjective • Suffixation: ali סטודנטיאלי, oni טלויזיוני • Adverb • Suffixation: it לעומתית Unsupervised Lexicon-based Approach
Raw-text corpus Lexicon Patterns Pattern Model • For each possible analyses of a given token: • Features • For each pattern, 1 – if the token fits the pattern, 0- otherwise • ‘no pattern’ feature • Value • A full set of the morphological properties (as given by the lexicon) ME Pattern Model Unsupervised Lexicon-based Approach
Raw-text corpus Lexicon Patterns Letters+Pattern Model • For each possible analyses of a given token: • Features • Letters features • Patterns features • Value • A full set of the morphological properties (as given by the lexicon) ME Letters + Pattern Model Unsupervised Lexicon-based Approach
Linear-context Model Thedrajeof the tagement starts rikking with Befa. • P(t|w) is hard to estimate for unknown tokens • P(noun| draje), P(adjective| draje), P(verb| draje) • Alternatively, P(t|c), can be learned for known contexts • P(noun| The, of), P(adjective| The, of), P(verb| The, of) • Observed Context Information • Lexical distribution • Word given context P(w|c) - P(drage|The,of) • Context given word P(c|w) - P(The, of | drage) • Relative frequencies over all the words in the corpus • Morpho-lexical distribution of known tokens • P(t|wi), - P(determiner|The)…, P(preposition|of)… • Similar words alg. [Levinger et al. 95] [Adler 07] [Goldberg et al. 08] Unsupervised Lexicon-based Approach
Linear-context Model • Notation: w – known word, c – context of a known word, t - tag • Initial Conditions • Expectation p(w|c), p(c|w) raw-text corpus lexicon p(t|w) • Maximization p(t|w) = ∑cp(t|c)p(c|w) p(t|c) = ∑wp(t|w)p(w|c) Unsupervised Lexicon-based Approach
Evaluation • Resources • Lexicon: MILA • Corpus • Train: unlabeled 42M tokens corpus • Test: annotated news articles of 90K token instances (3% unknown tokens, 2% unknown analyses) • PN Classifier Evaluation
Evaluation - Models • Baseline • Most frequent tag - proper name - for all possible segmentations of the token • Letters model • Pattern model • Letters + Pattern model • Letters, Linear-context • Pattern, Linear-context • Letters + Pattern, Linear-context Evaluation
Evaluation - Criteria • Suggested Analysis Set • Coverage of correct analysis • Ambiguity level (average number of analyses) • Average probability of correct analysis • Disambiguation accuracy • Number of correct analyses, picked in the complete system
Evaluation – Full Morphological Analysis Evaluation
Evaluation - Conclusion • Error reduction > 30% over a competitive baseline, for a large-scale dataset of 90K tokens • Full morphological disambiguation: 79% accuracy • Word segmentation and POS tagging: 70% accuracy • Unsupervised linear-context model is as effective as a model which uses hand-crafted patterns • Effective combination of textual observation from unlabeled data and lexicon • Effective combination of ME model for tag distribution and SVM model for PN classification • Overall, error reduction of 5% for the whole disambiguation system Evaluation
Summary • The characteristics of known words can help resolve unknown words • Unsupervised (unlabeled data) lexicon-based approach • Language independent algorithm for computing the distribution p(t|w) for unknown words • Nature of agglutinated prefixes in Hebrew [Ben-Eliahu et al. 2008]
תנקס tnqs (thanks) • foreign 0.4 • propernoun 0.3 • noun.plural.feminine.absolute 0.2 • verb.singular.feminine.3.future 0.08 • verb.singular.masculine.2.future 0.02