410 likes | 545 Views
A memory-based learning-plus-inference approach to morphological analysis. Antal van den Bosch With Walter Daelemans, Ton Weijters, Erwin Marsi, Abdelhadi Soudi, and Sander Canisius ILK / Language and Information Sciences Dept. Tilburg University, The Netherlands
E N D
A memory-based learning-plus-inference approach to morphological analysis Antal van den Bosch With Walter Daelemans, Ton Weijters, Erwin Marsi, Abdelhadi Soudi, and Sander Canisius ILK / Language and Information Sciences Dept. Tilburg University, The Netherlands FLaVoR Workshop, 17 November 2006, Leuven
Learning plus inference • Paradigmatic solution to natural language processing tasks • Decomposition: • The disambiguation of local, elemental ambiguities in context • A holistic, global coordination of local decisions over the entire sequence
Learning plus inference • Example: grapheme-phoneme conversion • Local decisions • The mapping of a vowel letter in context to a vowel phoneme with primary stress • Global coordination • Making sure that there is only one primary stress
Learning plus inference • Example: dependency parsing • Local decisions • The relation between a noun and a verb is of the “subject” type • Global coordination • The verb only has one subject relation
Learning plus inference • Example: named entity recognition • Local decisions • A name that can be a location or a person, is a location in this context • Global coordination • Everywhere in the text this name always refers to the location
Learning plus inference • Local decision making by learning • All NLP decisions can be recast as classification tasks • (Daelemans, 1996: segmentation or identification) • Global coordination by inference • Given local proposals that may conflict, find the best overall solution • (e.g. minimizing conflict, or adhering to language model) • Collins and colleagues; Manning and Klein and colleagues; Dan Roth & colleagues; Marquez and Carreras; etc.
L+I and morphology • Segmentation boundaries, spelling changes, and PoS tagging recast as classification • Global inference checks for • Noun stem followed by noun inflection • Infix in a noun-noun compound is surrounded by two nouns • Etc.
Talk overview • English morphological segmentation • Easy learning • Inference not really needed • Dutch morphological analysis • Learning operations rather than simple decisions • Reasonably complex inference • Arabic morphological analysis • Learning as an attempt at lowering the massive ambiguity • Inference as an attempt to separate the chaff from the grain
English segmentation • (Van den Bosch, Daelemans, Weijters, NeMLaP 1996) • Morphological segmentation as classification • Versus traditional approach: • E.g. Mitalk’s DECOMP, analysing scarcity: • First analysis: scar|city - both stems found in morpheme lexicon, and validated as a possible analysis • Second analysis: scarc|ity - stem scarce found due to application of e-deletion rule; suffix -ity found; validated as a possible analysis • Cost-based heuristic prefers stem|derivation over stem|stem • Ingredients: morpheme lexicons, finite state analysis validator, spelling changing rules, cost heuristics • Validator, rules, and cost heuristics are costly knowledge-based resources
English segmentation • Segmentations as local decisions • To segment or not to segment • If segment, identify start (or end) of • Stem • Affixes • Inflectional morpheme
English segmentation • Three tasks: given a letter in context, is it the start of • a segment or not • a derivational morpheme (stem or affix), inflection, or not • a stem, a stress-affecting affix, a stress-neutral affix, an inflection, or not
Local classification • Memory-based learning • k-nearest neighbor classification • (Daelemans & Van den Bosch, 2005) • E.g. instance # 9 • m a l i t i e ? • Nearest neighbors: a lot of evidence for “2”: Instance distance clones m a l i t i e 2 0 2x t a l i t i e 2 1 3x u a l i t i e 2 1 2x i a l i t i e 2 1 11x g a l i t i e 2 1 2x n a l i t i e 2 1 7x r a l i t i e 2 1 5x c a l i t i e 2 1 7x p a l i t i e 2 1 2x h a l i t i c s 2 1x …
Memory-based learning • Similarity function: • X and Y are instances • n is the number of features • xi is the value of the ith feature of X • wiis the weight of the ith feature
Generalizing lexicon • A memory-based morphological analyzer is • A lexicon: 100% accurate reconstruction of all examples in training material • At the same time, capable of processing unseen words • In essence, unseen words are the only problem remaining • CELEX Dutch has +300k words; average coverage of text is 90%-95% • Evaluation should focus solely on unseen words • So, a held-out test from CELEX is fairly representative of unseen words
Experiments • CELEX English • 65,558 segmented words • 573,544 instances • 10-fold cross-validation • Measuring accuracy • M1: 88.0% correct test words • M2: 85.6% correct test words • M3: 82.4% correct test words
Add inference • (Van den Bosch and Canisius, SIGPHON 2006) • Original approach: only learning • Now: inference • Constraint satisfaction inference • Based on Van den Bosch and Daelemans (CoNLL 2005) trigram prediction
Constraint satisfaction inference • Predict trigrams, and use them as complete as possible • Formulate the inference procedure as a constraint satisfaction problem • Constraint satisfaction • Assigning values to a number of variables while satisfying certain predefined constraints • Constraint satisfaction for inference • Each token maps to a variable, the domain of which corresponds to the three candidate labels • Constraints are derived from the predicted trigrams
Constraint satisfaction inference Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{ h,a → h,{ a,n → {,n a,n → {,n n,d → n,t n,d → n,d Unigram constraints h → h h → h a → { a → { a → { n → n n → n n → n d → t d → d input output h _ h { (1) a h { n (2) n { n t (3) n d d _ (4)
Constraint satisfaction inference Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{ h,a → h,{ a,n → {,n a,n → {,n n,d → n,t n,d → n,d Unigram constraints h → h h → h a → { a → { a → { n → n n → n n → n d → t d → d input output h _ h { (1) a h { n (2) n { n t (3) n d d _ (4)
Constraint satisfaction inference Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{ h,a → h,{ a,n → {,n a,n → {,n n,d → n,t n,d → n,d Unigram constraints h → h h → h a → { a → { a → { n → n n → n n → n d → t d → d input output h _ h { (1) a h { n (2) n { n t (3) n d d _ (4)
Constraint satisfaction inference Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{ h,a → h,{ a,n → {,n a,n → {,n n,d → n,t n,d → n,d Unigram constraints h → h h → h a → { a → { a → { n → n n → n n → n d → t d → d input output h _ h { (1) a h { n (2) n { n t (3) n d d _ (4)
Constraint satisfaction inference Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{ h,a → h,{ a,n → {,n a,n → {,n n,d → n,t n,d → n,d Unigram constraints h → h h → h a → { a → { a → { n → n n → n n → n d → t d → d input output h _ h { (1) a h { n (2) n { n t (3) n d d _ (4) Conflicting constraints
Weighted constraint satisfaction • Extension of constraint satisfaction to deal with overconstrainedness • Each constraint has a weight associated to it • Optimal solution assigns those values to the variables that optimise the sum of weights of the constraints that are satisfied • For constrained satisfaction inference, a constraint's weight should reflect the classifier's confidence in its correctness
Example instances Left focus right uni tri _ _ _ _ _ a b n o r m 2 -20 _ _ _ _ a b n o r m a 0 20s _ _ _ a b n o r m a l s 0s0 _ _ a b n o r m a l i 0 s00 _ a b n o r m a l i t 0 000 a b n o r m a l i t i 0 000 b n o r m a l i t i e 0 000 n o r m a l i t i e s 0 001 o r m a l i t i e s _ 1 010 r m a l i t i e s _ _ 0 100 m a l i t i e s _ _ _ 0 000 a l i t i e s _ _ _ _ 0 00i l i t i e s _ _ _ _ _ i 0i-
Results • Only learning: • M3: 82.4% correct unseen words • Learning + CSI: • M3: 85.4% correct unseen words • Mild effect.
Dutch morphological analysis • (Van den Bosch & Daelemans, 1999; Van den Bosch & Canisius, 2006) • Task expanded to • Spelling changes • Part-of-speech tagging • Analysis generation • Dutch is mildly productive • Compounding • A bit more inflection than in English • Infixes, diminutives, …
Dutch morphological analysis Left focus right uni tri _ _ _ _ _ a b n o r m A -A0 _ _ _ _ a b n o r m a 0 A00 _ _ _ a b n o r m a l 0 000 _ _ a b n o r m a l i 0 000 _ a b n o r m a l i t 0 000 a b n o r m a l i t e 0 000 b n o r m a l i t e i 0 00+Da n o r m a l i t e i t +Da 0+DaA_->N o r m a l i t e i t e A_->N +DaA_->N0 r m a l i t e i t e n 0 A_->N00 m a l i t e i t e n _ 0 000 a l i t e i t e n _ _ 0 000 l i t e i t e n _ _ _ 0 00plural i t e i t e n _ _ _ _ plural 0plural0 t e i t e n _ _ _ _ _ 0 plural0-
Spelling changes • Deletion, insertion, replacement b n o r m a l i t e i 0 n o r m a l i t e i t +Da o r m a l i t e i t e A_->N • abnormaliteiten analyzed as [[abnormaal]A iteit]N[en]plural • Root form has double a, wordform drops one a
Part-of-speech • Selection processes in derivation n o r m a l i t e i t +Da o r m a l i t e i t e A_->N r m a l i t e i t e n 0 • Stem abnormaal is an adjective; • Affix -iteit seeks an adjective to its left to turn it into a noun
Experiments • CELEX Dutch: • 336,698 words • 3,209,090 instances • 10-fold cross validation • Learning only: 41.3% correct unseen words • With CSI: 51.9% correct unseen words • Useful improvement
Arabic analysis • Problem of undergeneration and overgeneration of analyses • Undergeneration: at k=1, • 7 out of 10 analyses of unknown words are correct, but • 4 out of 5 of the real analyses are not generated • Overgeneration: at k=10, • Only 3 out of 5 are missed, but • Half of the generated analyses is incorrect • Harmony at k=3 (F-score 0.42)
Discussion (1) • Memory-based morphological analysis • Lexicon and analyzer in one • Extremely simple algorithm • Unseen words are the remaining problem • Learning: local classifications • From simple boundary decisions • To complex operations • And trigrams • Inference: • More complex morphologies need more inference effort
Discussion (2) • Ceiling not reached yet; good solutions still wanted • Particularly for unknown words with unknown stems • Also, recent work by De Pauw! • External evaluation needed • Integration with part-of-speech tagging (software packages forthcoming) • Effect on IR, IE, QA • Effect in ASR
Thank you. http://ilk.uvt.nl Antal.vdnBosch@uvt.nl