1 / 13

A Splitter for German compound words

A Splitter for German compound words. Pasquale Imbemba Free University of Bozen-Bolzano Supervisor: Dr. Raffaella Bernardi. Scenario. User input Compound words in German Problem for IR -retrieval of German books -no direct keyword matching

fai
Download Presentation

A Splitter for German compound words

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Splitter for German compound words Pasquale Imbemba Free University of Bozen-Bolzano Supervisor: Dr. Raffaella Bernardi

  2. Scenario • User input • Compound words in German • Problem for IR-retrieval of German books-no direct keyword matching • Problem for CLIR-retrieval of IT & EN books-No direct translation in dictionary

  3. Problem: German compound words Compounding is productive: • Combine pre-existing morphemes to form a new word (aka Univerbierung) • Compounds of nouns most frequent cases • User input may not be in the lexicon used by CLIR search engines • Donau + Dampf + Schiff + Fahrt (tr.: Steam navigation on the Danube) • User input may be a lexicalized “compound” word • Malerei (tr.: painting) no: Maler+Ei (tr.: painter and egg) • Hence, need of a splitter to handle both cases • Furthermore, language is in continuous evolution (neologism); need of constantly up-to-date lexical resources

  4. State of the art • TAGH (Berlin-Brandenburg Academy of Sciences / University of Potsdam) • Weighted FSA: choose combination with least cost • MORPHY (University of Paderborn) • Reduce to base form and affixes, look them up • MORPA (Tilburg University) • Probabilistic calculus to determine segmentation • De Rijke/Monz (University of Amsterdam) • Shallow approach • Given a word, if substring is in lexicon, subtract it. Repeat until no substring is left.

  5. Tools • Splitter • Mechanism to segment nouns • Implemented, evaluated and improved De Rijke/Monz algorithm using Java • Lexicon • Morphy (57,000 nouns), dated (Lezius) • deWaC (440,000 nouns), recent (Baroni & Kilgarriff) • Lexical resource to execute lookup onto • Extracted nouns from Morphy & deWaC • Regular Expression filtering on deWaC • Resources indexed with Lucene

  6. P r e i s p p p p r r r r e e e e i i i i s s s s r = split(substr,i to length) Ö l p r e i s Ö l r = preis De Rijke/Monz algorithm Split (word) For i := 1 to length-1 do if substring(0,i)isInNounLex && split(substr(i+1,length) != “ “ do r = split(substr(i+1,length) return concat (substr(1,i),+,r) if (isInNounLex(word)) return word; else return ““;

  7. Enhanced Splitter workflow • Cascading lexical resources • Increases split correctness • Improves overall correctness • Lookup first • Lexicalized elements • Reduces amount of incorrect splits

  8. Splitter diagram

  9. Query Input Donaudampfschifffahrt Name Recognition Donau Dampfschifffahrt Morphological Analysis Dampfschifffahrt_N Multiword recognition Dampfschifffahrt_N MuSiL Integration Split and Translate Multilingual Dictionary EN: vapour_N | steam_N (...) 2 EN: ship_N | (...) 1 Splitter EN: drive_N | navigation_N (...) 3 IT: vapore_nm | (...) IT: nave_nf | (...) IT: guida_nf | navigazione_nf (...) Multilingual Thesaurus

  10. Evaluation • Total correctness improved • By increasing the amount of non splits with deWaC and Morphy

  11. De Rijke/Monz Best case: We scan the input word from first to last position Worst case: Calls to split Exponential growth Our splitter: Best case: We find the word immediately to exist in the lexical resources of nouns Worst case: Execute function recursively every time we encounter a word in the lexicon and the remaining substring is not empty (see De Rijke/Monz) Complexity of the split function

  12. Performance on MuSiL • Increased amount of retrieved documents • More relevant documents are top ranked

  13. Conclusion and future work • Good: • Cascade method • Deal with lexicalized elements • Open topics: • Choose correct segmentation among alternatives • Metrics for correctness of segmentation • Weights, probability …

More Related