330 likes | 593 Views
Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques. Guy De Pauw Walter Daelemans guy.depauw@ua.ac.be walter.daelemans@ua.ac.be CNTS – Language Technology Group http://www.cnts.ua.ac.be. Morpho-Syntactic Analysis using Machine Learning Techniques. Why?
E N D
Morpho-SyntacticAnalysis and Language Modeling using Machine Learning Techniques Guy De Pauw Walter Daelemans guy.depauw@ua.ac.be walter.daelemans@ua.ac.be CNTS – Language Technology Group http://www.cnts.ua.ac.be
Morpho-Syntactic Analysis using Machine Learning Techniques • Why? • As an NLP tool proper (!) • Annotate new datasets (e.g. Mediargus) • Extra information source for language modeling • How? • Machine Learning techniques (MBL + maxent) • Shallow linguistic analysis
Shallow linguistic analysis • For many NLP applications, full analysis is often not necessary • e.g. morphological analysis • uitzonderingsgevallen: FULL:((((uitzonder)[V],(ing)[N|V.])[N],(s)[N|N.N],(geval)[N])[N]),(en)[N-m] vs SHALLOW: uitzonder@V + ing@N|V. + s@N|N.N + geval@N + en@N-m • Shallow Analysis: fast + robust
Shallow linguistic analysis [ADVP nu] [SMAIN tref+t] [NP de niets+vermoed+end+e pool+reiziger] [NP vuilnis+belt+en] [PP tussen ] [NP de ijs+berg+en] [SVP aan] .
Morphological Analysis parelvissers Segmentation parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm
Morphological Analysis Parelvissers Segmentation Parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm
Morphological Segmentation • Trained and evaluated on (adapted) morphological database of CELEX • Experimental Results (full word score): • FS (minimal boundaries + unigram): 86.7% • Morpheme Boundary Prediction: 89.2% • FS + Morpheme Prediction: 94.8%
Morphological Analysis Parelvissers Segmentation Parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm 96%
Morphological Analysis parelvissers Segmentation parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm
Alternation • Map parel+viss+er+s to parel+vis+er+s aan+lop+en to aan+loop+en but also aan+ge+bracht to aan+ge+breng
Alternation • Grapheme based alternation
Alternation • Grapheme based alternation
Alternation • Grapheme based alternation
Alternation • Grapheme based alternation • 99.4% of morphemes correctly alternated • Including complex alternations like bracht->breng
Morphological Analysis • Use morphological analysis cascade to analyze all words in CGN and Mediargus (not in CELEX) e.g. F1: flowerpower-afstammelingen F2: flowerpower-@N+af@P+stamm@V+eling@N|V.+en@INFLm F3: flowerpower@N+af@P+stam@V+eling@N|V.+en@INFLm F4: m • Huge morphological database of ±2.7M words
Part-of-Speech Tagging • Trained and evaluated on CGN + STIL • Some Experimental Results • Contextual + orthographic features 96.6% (uw82.5%) • + morphological information 97.2% (uw86.9%) • Tags of morphemes • Lemma • Flection tag
Shallow linguistic analysis • 89.5% tagging accuracy • 87.4 F-score
System for morpho-syntactic analysis • Morphological analysis: ±5 w/s • Tagging + Phrase Chunking: ±450 w/s • Used to annotate entire Mediargus corpus • Morphological analysis (±2B morphemes) • Part-of-speech tags • Phrase chunks ::demo:: http://www.cnts.ua.ac.be/flavor
Language Modeling • Problem1: input is not a sequence of words, but a sequence of morphemes • Problem2: scoring hypotheses using shallow linguistic annotation
Language Modeling • Problem1: input is a sequence of morphemes Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan • Disambiguate between word and morpheme boundaries • Use morphologically analyzed mediargus as training material • Approach: morpheme sequence tagging
Language Modeling • Problem1: input is a sequence of morphemes Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan [w nu ] [w tref t] [w de ] [w niets vermoed end e ] … • word boundaries: 97.2% • Morpheme boundaries: 93.1% • F-score of 92.3%
Language Modeling • (Big) remaining problem: • aanlopen -> aan+lop+en or aan+loop+en • gebracht -> ge+bracht or ge+breng • But not: aan+loop+en en ge+bracht • Information not available in CELEX • But: Orthography closest guess True pronounced morphemes quite workable Decent accuracy on harder task ?? Regular expression + grapheme-to-phoneme conversion • Not yet integrated in recognizer
Language Modeling • Turn morphemes into word forms (+ reverse alternation) • Re-analyze word form • Tag + shallow parse sequence of words ::demo:: www.cnts.ua.ac.be/flavor
Language Modeling • Problem2: scoring hypotheses • Option1: n-gram models trained on annotated Mediargus corpus • Morpheme N-grams: de niets vermoed end <e> • Tagged-morpheme N-grams Ewb B V A|BV. <INFLPWB> • Word n-grams • Part-of-Speech tag n-grams • Shallow Parsing tag n-grams • Combination: de@LID@NP <kan@WW@NP> or <kan@N1@NP> • Interpolate LM scores
Language Modeling • Problem2: scoring hypotheses • Option2: classifier “certainty” • Use maximum entropy classifiers, that can output proper probabilities • Quite informative for WSJ LM-task
Language Modeling • Problem2: scoring hypotheses • Option3: Maxent classifier as LM • Information Source: surrounding context (words, morphemes, linguistic annotation) • To classify: word (or morpheme) • VERY slow training time
Language Modeling: circumstantial evidence • Wall-Street Journal: n-gram rescoring • VP set: 8.11% 7.57% • NVP set: 8.08% 7.74% + maxent classifier probabilities + POS 3-grams • Mediargus: perplexity • Word 3-gram: 148.42 • Morpheme 3-gram: 56.36 • Tagged Morpheme 3-gram: 53.17
Limitations • Morpheme representation problematic for integration in recognizer • Efficiency as LM not yet properly evaluated for Dutch
Available Tools & Data Tools: • All-in-one morpho-syntactic analyzer for Dutch • Morphological analyzer • Part-of-Speech tagger • Phrase Chunker • Word vs Morpheme Boundary detector for Dutch • Promising outlook for Dutch N-gram LM using extra annotation layers Data: • Adjusted version of CELEX (incl segmented orthographic forms) • 2.7M word database of morphologically analyzed words • Morphologically analyzed, tagged & shallow-parsed Mediargus