410 likes | 423 Views
This paper discusses a memory-based learning + inference approach to morphological analysis, demonstrating its application in natural language processing tasks such as segmentation, dependency parsing, and named entity recognition. It also presents experiments conducted on English and Dutch languages.
E N D
A memory-based learning-plus-inference approach to morphological analysis Antal van den Bosch With Walter Daelemans, Ton Weijters, Erwin Marsi, Abdelhadi Soudi, and Sander Canisius ILK / Language and Information Sciences Dept. Tilburg University, The Netherlands FLaVoR Workshop, 17 November 2006, Leuven
Learning plus inference • Paradigmatic solution to natural language processing tasks • Decomposition: • The disambiguation of local, elemental ambiguities in context • A holistic, global coordination of local decisions over the entire sequence
Learning plus inference • Example: grapheme-phoneme conversion • Local decisions • The mapping of a vowel letter in context to a vowel phoneme with primary stress • Global coordination • Making sure that there is only one primary stress
Learning plus inference • Example: dependency parsing • Local decisions • The relation between a noun and a verb is of the “subject” type • Global coordination • The verb only has one subject relation
Learning plus inference • Example: named entity recognition • Local decisions • A name that can be a location or a person, is a location in this context • Global coordination • Everywhere in the text this name always refers to the location
Learning plus inference • Local decision making by learning • All NLP decisions can be recast as classification tasks • (Daelemans, 1996: segmentation or identification) • Global coordination by inference • Given local proposals that may conflict, find the best overall solution • (e.g. minimizing conflict, or adhering to language model) • Collins and colleagues; Manning and Klein and colleagues; Dan Roth & colleagues; Marquez and Carreras; etc.
L+I and morphology • Segmentation boundaries, spelling changes, and PoS tagging recast as classification • Global inference checks for • Noun stem followed by noun inflection • Infix in a noun-noun compound is surrounded by two nouns • Etc.
Talk overview • English morphological segmentation • Easy learning • Inference not really needed • Dutch morphological analysis • Learning operations rather than simple decisions • Reasonably complex inference • Arabic morphological analysis • Learning as an attempt at lowering the massive ambiguity • Inference as an attempt to separate the chaff from the grain
English segmentation • (Van den Bosch, Daelemans, Weijters, NeMLaP 1996) • Morphological segmentation as classification • Versus traditional approach: • E.g. Mitalk’s DECOMP, analysing scarcity: • First analysis: scar|city - both stems found in morpheme lexicon, and validated as a possible analysis • Second analysis: scarc|ity - stem scarce found due to application of e-deletion rule; suffix -ity found; validated as a possible analysis • Cost-based heuristic prefers stem|derivation over stem|stem • Ingredients: morpheme lexicons, finite state analysis validator, spelling changing rules, cost heuristics • Validator, rules, and cost heuristics are costly knowledge-based resources
English segmentation • Segmentations as local decisions • To segment or not to segment • If segment, identify start (or end) of • Stem • Affixes • Inflectional morpheme
English segmentation • Three tasks: given a letter in context, is it the start of • a segment or not • a derivational morpheme (stem or affix), inflection, or not • a stem, a stress-affecting affix, a stress-neutral affix, an inflection, or not
Local classification • Memory-based learning • k-nearest neighbor classification • (Daelemans & Van den Bosch, 2005) • E.g. instance # 9 • m a l i t i e ? • Nearest neighbors: a lot of evidence for “2”: Instance distance clones m a l i t i e 2 0 2x t a l i t i e 2 1 3x u a l i t i e 2 1 2x i a l i t i e 2 1 11x g a l i t i e 2 1 2x n a l i t i e 2 1 7x r a l i t i e 2 1 5x c a l i t i e 2 1 7x p a l i t i e 2 1 2x h a l i t i c s 2 1x …
Memory-based learning • Similarity function: • X and Y are instances • n is the number of features • xi is the value of the ith feature of X • wiis the weight of the ith feature
Generalizing lexicon • A memory-based morphological analyzer is • A lexicon: 100% accurate reconstruction of all examples in training material • At the same time, capable of processing unseen words • In essence, unseen words are the only problem remaining • CELEX Dutch has +300k words; average coverage of text is 90%-95% • Evaluation should focus solely on unseen words • So, a held-out test from CELEX is fairly representative of unseen words
Experiments • CELEX English • 65,558 segmented words • 573,544 instances • 10-fold cross-validation • Measuring accuracy • M1: 88.0% correct test words • M2: 85.6% correct test words • M3: 82.4% correct test words
Add inference • (Van den Bosch and Canisius, SIGPHON 2006) • Original approach: only learning • Now: inference • Constraint satisfaction inference • Based on Van den Bosch and Daelemans (CoNLL 2005) trigram prediction
Constraint satisfaction inference • Predict trigrams, and use them as complete as possible • Formulate the inference procedure as a constraint satisfaction problem • Constraint satisfaction • Assigning values to a number of variables while satisfying certain predefined constraints • Constraint satisfaction for inference • Each token maps to a variable, the domain of which corresponds to the three candidate labels • Constraints are derived from the predicted trigrams
Constraint satisfaction inference Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{ h,a → h,{ a,n → {,n a,n → {,n n,d → n,t n,d → n,d Unigram constraints h → h h → h a → { a → { a → { n → n n → n n → n d → t d → d input output h _ h { (1) a h { n (2) n { n t (3) n d d _ (4)
Constraint satisfaction inference Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{ h,a → h,{ a,n → {,n a,n → {,n n,d → n,t n,d → n,d Unigram constraints h → h h → h a → { a → { a → { n → n n → n n → n d → t d → d input output h _ h { (1) a h { n (2) n { n t (3) n d d _ (4)
Constraint satisfaction inference Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{ h,a → h,{ a,n → {,n a,n → {,n n,d → n,t n,d → n,d Unigram constraints h → h h → h a → { a → { a → { n → n n → n n → n d → t d → d input output h _ h { (1) a h { n (2) n { n t (3) n d d _ (4)
Constraint satisfaction inference Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{ h,a → h,{ a,n → {,n a,n → {,n n,d → n,t n,d → n,d Unigram constraints h → h h → h a → { a → { a → { n → n n → n n → n d → t d → d input output h _ h { (1) a h { n (2) n { n t (3) n d d _ (4)
Constraint satisfaction inference Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{ h,a → h,{ a,n → {,n a,n → {,n n,d → n,t n,d → n,d Unigram constraints h → h h → h a → { a → { a → { n → n n → n n → n d → t d → d input output h _ h { (1) a h { n (2) n { n t (3) n d d _ (4) Conflicting constraints
Weighted constraint satisfaction • Extension of constraint satisfaction to deal with overconstrainedness • Each constraint has a weight associated to it • Optimal solution assigns those values to the variables that optimise the sum of weights of the constraints that are satisfied • For constrained satisfaction inference, a constraint's weight should reflect the classifier's confidence in its correctness
Example instances Left focus right uni tri _ _ _ _ _ a b n o r m 2 -20 _ _ _ _ a b n o r m a 0 20s _ _ _ a b n o r m a l s 0s0 _ _ a b n o r m a l i 0 s00 _ a b n o r m a l i t 0 000 a b n o r m a l i t i 0 000 b n o r m a l i t i e 0 000 n o r m a l i t i e s 0 001 o r m a l i t i e s _ 1 010 r m a l i t i e s _ _ 0 100 m a l i t i e s _ _ _ 0 000 a l i t i e s _ _ _ _ 0 00i l i t i e s _ _ _ _ _ i 0i-
Results • Only learning: • M3: 82.4% correct unseen words • Learning + CSI: • M3: 85.4% correct unseen words • Mild effect.
Dutch morphological analysis • (Van den Bosch & Daelemans, 1999; Van den Bosch & Canisius, 2006) • Task expanded to • Spelling changes • Part-of-speech tagging • Analysis generation • Dutch is mildly productive • Compounding • A bit more inflection than in English • Infixes, diminutives, …
Dutch morphological analysis Left focus right uni tri _ _ _ _ _ a b n o r m A -A0 _ _ _ _ a b n o r m a 0 A00 _ _ _ a b n o r m a l 0 000 _ _ a b n o r m a l i 0 000 _ a b n o r m a l i t 0 000 a b n o r m a l i t e 0 000 b n o r m a l i t e i 0 00+Da n o r m a l i t e i t +Da 0+DaA_->N o r m a l i t e i t e A_->N +DaA_->N0 r m a l i t e i t e n 0 A_->N00 m a l i t e i t e n _ 0 000 a l i t e i t e n _ _ 0 000 l i t e i t e n _ _ _ 0 00plural i t e i t e n _ _ _ _ plural 0plural0 t e i t e n _ _ _ _ _ 0 plural0-
Spelling changes • Deletion, insertion, replacement b n o r m a l i t e i 0 n o r m a l i t e i t +Da o r m a l i t e i t e A_->N • abnormaliteiten analyzed as [[abnormaal]A iteit]N[en]plural • Root form has double a, wordform drops one a
Part-of-speech • Selection processes in derivation n o r m a l i t e i t +Da o r m a l i t e i t e A_->N r m a l i t e i t e n 0 • Stem abnormaal is an adjective; • Affix -iteit seeks an adjective to its left to turn it into a noun
Experiments • CELEX Dutch: • 336,698 words • 3,209,090 instances • 10-fold cross validation • Learning only: 41.3% correct unseen words • With CSI: 51.9% correct unseen words • Useful improvement
Arabic analysis • Problem of undergeneration and overgeneration of analyses • Undergeneration: at k=1, • 7 out of 10 analyses of unknown words are correct, but • 4 out of 5 of the real analyses are not generated • Overgeneration: at k=10, • Only 3 out of 5 are missed, but • Half of the generated analyses is incorrect • Harmony at k=3 (F-score 0.42)
Discussion (1) • Memory-based morphological analysis • Lexicon and analyzer in one • Extremely simple algorithm • Unseen words are the remaining problem • Learning: local classifications • From simple boundary decisions • To complex operations • And trigrams • Inference: • More complex morphologies need more inference effort
Discussion (2) • Ceiling not reached yet; good solutions still wanted • Particularly for unknown words with unknown stems • Also, recent work by De Pauw! • External evaluation needed • Integration with part-of-speech tagging (software packages forthcoming) • Effect on IR, IE, QA • Effect in ASR
Thank you. http://ilk.uvt.nl Antal.vdnBosch@uvt.nl