Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morpho-SyntacticAnalysis and Language Modeling using Machine Learning Techniques Guy De Pauw Walter Daelemans guy.depauw@ua.ac.be walter.daelemans@ua.ac.be CNTS – Language Technology Group http://www.cnts.ua.ac.be

Morpho-Syntactic Analysis using Machine Learning Techniques • Why? • As an NLP tool proper (!) • Annotate new datasets (e.g. Mediargus) • Extra information source for language modeling • How? • Machine Learning techniques (MBL + maxent) • Shallow linguistic analysis

Shallow linguistic analysis • For many NLP applications, full analysis is often not necessary • e.g. morphological analysis • uitzonderingsgevallen: FULL:((((uitzonder)[V],(ing)[N|V.])[N],(s)[N|N.N],(geval)[N])[N]),(en)[N-m] vs SHALLOW: uitzonder@V + ing@N|V. + s@N|N.N + geval@N + en@N-m • Shallow Analysis: fast + robust

Shallow linguistic analysis

Shallow linguistic analysis [ADVP nu] [SMAIN tref+t] [NP de niets+vermoed+end+e pool+reiziger] [NP vuilnis+belt+en] [PP tussen ] [NP de ijs+berg+en] [SVP aan] .

Morphological Analysis parelvissers Segmentation parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm

Morphological Analysis Parelvissers Segmentation Parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm

Morphological Segmentation

Morphological Segmentation • Trained and evaluated on (adapted) morphological database of CELEX • Experimental Results (full word score): • FS (minimal boundaries + unigram): 86.7% • Morpheme Boundary Prediction: 89.2% • FS + Morpheme Prediction: 94.8%

Morphological Analysis Parelvissers Segmentation Parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm 96%

Morphological Analysis parelvissers Segmentation parel+viss+er+s Tagging parel@N+viss@V+er@N|V.+s@INFLm Alternation parel@N+vis@V+er@N|V.+s@INFLm

Alternation • Map parel+viss+er+s to parel+vis+er+s aan+lop+en to aan+loop+en but also aan+ge+bracht to aan+ge+breng

Alternation • Grapheme based alternation

Alternation • Grapheme based alternation • 99.4% of morphemes correctly alternated • Including complex alternations like bracht->breng

Morphological Analysis • Use morphological analysis cascade to analyze all words in CGN and Mediargus (not in CELEX) e.g. F1: flowerpower-afstammelingen F2: flowerpower-@N+af@P+stamm@V+eling@N|V.+en@INFLm F3: flowerpower@N+af@P+stam@V+eling@N|V.+en@INFLm F4: m • Huge morphological database of ±2.7M words

Shallow linguistic analysis

Part-of-Speech Tagging • Trained and evaluated on CGN + STIL • Some Experimental Results • Contextual + orthographic features 96.6% (uw82.5%) • + morphological information 97.2% (uw86.9%) • Tags of morphemes • Lemma • Flection tag

Shallow linguistic analysis • 89.5% tagging accuracy • 87.4 F-score

System for morpho-syntactic analysis • Morphological analysis: ±5 w/s • Tagging + Phrase Chunking: ±450 w/s • Used to annotate entire Mediargus corpus • Morphological analysis (±2B morphemes) • Part-of-speech tags • Phrase chunks ::demo:: http://www.cnts.ua.ac.be/flavor

Language Modeling • Problem1: input is not a sequence of words, but a sequence of morphemes • Problem2: scoring hypotheses using shallow linguistic annotation

Language Modeling • Problem1: input is a sequence of morphemes Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan • Disambiguate between word and morpheme boundaries • Use morphologically analyzed mediargus as training material • Approach: morpheme sequence tagging

Language Modeling

Language Modeling • Problem1: input is a sequence of morphemes Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan [w nu ] [w tref t] [w de ] [w niets vermoed end e ] … • word boundaries: 97.2% • Morpheme boundaries: 93.1% • F-score of 92.3%

Language Modeling • (Big) remaining problem: • aanlopen -> aan+lop+en or aan+loop+en • gebracht -> ge+bracht or ge+breng • But not: aan+loop+en en ge+bracht • Information not available in CELEX • But: Orthography closest guess True pronounced morphemes quite workable Decent accuracy on harder task ?? Regular expression + grapheme-to-phoneme conversion • Not yet integrated in recognizer

Language Modeling • Turn morphemes into word forms (+ reverse alternation) • Re-analyze word form • Tag + shallow parse sequence of words ::demo:: www.cnts.ua.ac.be/flavor

Language Modeling • Problem2: scoring hypotheses • Option1: n-gram models trained on annotated Mediargus corpus • Morpheme N-grams: de niets vermoed end <e> • Tagged-morpheme N-grams Ewb B V A|BV. <INFLPWB> • Word n-grams • Part-of-Speech tag n-grams • Shallow Parsing tag n-grams • Combination: de@LID@NP <kan@WW@NP> or <kan@N1@NP> • Interpolate LM scores

Language Modeling • Problem2: scoring hypotheses • Option2: classifier “certainty” • Use maximum entropy classifiers, that can output proper probabilities • Quite informative for WSJ LM-task

Language Modeling • Problem2: scoring hypotheses • Option3: Maxent classifier as LM • Information Source: surrounding context (words, morphemes, linguistic annotation) • To classify: word (or morpheme) • VERY slow training time

Language Modeling: circumstantial evidence • Wall-Street Journal: n-gram rescoring • VP set: 8.11% 7.57% • NVP set: 8.08% 7.74% + maxent classifier probabilities + POS 3-grams • Mediargus: perplexity • Word 3-gram: 148.42 • Morpheme 3-gram: 56.36 • Tagged Morpheme 3-gram: 53.17

Limitations • Morpheme representation problematic for integration in recognizer • Efficiency as LM not yet properly evaluated for Dutch

Available Tools & Data Tools: • All-in-one morpho-syntactic analyzer for Dutch • Morphological analyzer • Part-of-Speech tagger • Phrase Chunker • Word vs Morpheme Boundary detector for Dutch • Promising outlook for Dutch N-gram LM using extra annotation layers Data: • Adjusted version of CELEX (incl segmented orthographic forms) • 2.7M word database of morphologically analyzed words • Morphologically analyzed, tagged & shallow-parsed Mediargus

Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Presentation Transcript

Syntactic Analysis

Syntactic Analysis and Parsing

Electrical Load Forecasting Using Machine Learning Techniques

Language Resources and Machine Learning

Lexical and Syntactic Analysis

Frog classiﬁcation using machine learning techniques

Character Recognition Using Machine Learning Techniques

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

Statistical Analysis and Machine Learning using Hadoop

Syntactic analysis using Context Free Grammars

Formal Analysis of language: Syntactic structures

Ecological Metadata Language (EML) and Morpho

Syntactic Analysis and Parsing

Syntactic Analysis

Reservoir Uncertainty Assessment Using Machine Learning Techniques

LIRICS WP3: Morpho-syntactic and syntactic annotations

Data Category Registry: Morpho -syntactic Profile

Lexical and Syntactic Analysis

Syntactic Analysis and Parsing

Maximum Entropy Language Modeling with Syntactic, Semantic and Collocational Dependencies

Maximum Entropy Language Modeling with Semantic, Syntactic and Collocational Dependencies

Lexical and Syntactic Analysis