130 likes | 381 Views
A Splitter for German compound words. Pasquale Imbemba Free University of Bozen-Bolzano Supervisor: Dr. Raffaella Bernardi. Scenario. User input Compound words in German Problem for IR -retrieval of German books -no direct keyword matching
E N D
A Splitter for German compound words Pasquale Imbemba Free University of Bozen-Bolzano Supervisor: Dr. Raffaella Bernardi
Scenario • User input • Compound words in German • Problem for IR-retrieval of German books-no direct keyword matching • Problem for CLIR-retrieval of IT & EN books-No direct translation in dictionary
Problem: German compound words Compounding is productive: • Combine pre-existing morphemes to form a new word (aka Univerbierung) • Compounds of nouns most frequent cases • User input may not be in the lexicon used by CLIR search engines • Donau + Dampf + Schiff + Fahrt (tr.: Steam navigation on the Danube) • User input may be a lexicalized “compound” word • Malerei (tr.: painting) no: Maler+Ei (tr.: painter and egg) • Hence, need of a splitter to handle both cases • Furthermore, language is in continuous evolution (neologism); need of constantly up-to-date lexical resources
State of the art • TAGH (Berlin-Brandenburg Academy of Sciences / University of Potsdam) • Weighted FSA: choose combination with least cost • MORPHY (University of Paderborn) • Reduce to base form and affixes, look them up • MORPA (Tilburg University) • Probabilistic calculus to determine segmentation • De Rijke/Monz (University of Amsterdam) • Shallow approach • Given a word, if substring is in lexicon, subtract it. Repeat until no substring is left.
Tools • Splitter • Mechanism to segment nouns • Implemented, evaluated and improved De Rijke/Monz algorithm using Java • Lexicon • Morphy (57,000 nouns), dated (Lezius) • deWaC (440,000 nouns), recent (Baroni & Kilgarriff) • Lexical resource to execute lookup onto • Extracted nouns from Morphy & deWaC • Regular Expression filtering on deWaC • Resources indexed with Lucene
P r e i s p p p p r r r r e e e e i i i i s s s s r = split(substr,i to length) Ö l p r e i s Ö l r = preis De Rijke/Monz algorithm Split (word) For i := 1 to length-1 do if substring(0,i)isInNounLex && split(substr(i+1,length) != “ “ do r = split(substr(i+1,length) return concat (substr(1,i),+,r) if (isInNounLex(word)) return word; else return ““;
Enhanced Splitter workflow • Cascading lexical resources • Increases split correctness • Improves overall correctness • Lookup first • Lexicalized elements • Reduces amount of incorrect splits
Query Input Donaudampfschifffahrt Name Recognition Donau Dampfschifffahrt Morphological Analysis Dampfschifffahrt_N Multiword recognition Dampfschifffahrt_N MuSiL Integration Split and Translate Multilingual Dictionary EN: vapour_N | steam_N (...) 2 EN: ship_N | (...) 1 Splitter EN: drive_N | navigation_N (...) 3 IT: vapore_nm | (...) IT: nave_nf | (...) IT: guida_nf | navigazione_nf (...) Multilingual Thesaurus
Evaluation • Total correctness improved • By increasing the amount of non splits with deWaC and Morphy
De Rijke/Monz Best case: We scan the input word from first to last position Worst case: Calls to split Exponential growth Our splitter: Best case: We find the word immediately to exist in the lexical resources of nouns Worst case: Execute function recursively every time we encounter a word in the lexicon and the remaining substring is not empty (see De Rijke/Monz) Complexity of the split function
Performance on MuSiL • Increased amount of retrieved documents • More relevant documents are top ranked
Conclusion and future work • Good: • Cascade method • Deal with lexicalized elements • Open topics: • Choose correct segmentation among alternatives • Metrics for correctness of segmentation • Weights, probability …