1 / 16

Development of a German-English Translator

This summary covers the development of a German-English translator in Quarter 1's progress, including improvements in rule-based part-of-speech tagging, morphological analysis, statistical language processing, new components like a lemmatizer, and tools for noun-verb agreement. Challenges and future research directions are also discussed.

smaurer
Download Presentation

Development of a German-English Translator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Felix Zhang Development of a German-English Translator

  2. Summary of Quarter 1 • Rule-based part of speech tagging • Morphological analysis • Created dictionary • Completely avoided statistical methods

  3. Scope • Expanded • Now includes statistical methods • Part of speech tagging using corpus • Rule-based only as backup

  4. Statistical language processing • State-of-the-art • Find chances that n-grams will translate into something else • Method is much simpler than current techniques • Context-free • Based on frequency of occurrence

  5. New Components • Lemmatizer • Noun-verb agreement • Inflection • Lookup • Noun-phrase chunking • Statistical part of speech tagging

  6. Lemmatizer • Break down words into root form • Takes info from morphological analysis • Does not consider stop words • Sample input: “Der Mann macht die Kinder” (“the man makes the children”)‏ • [['Mann', ['Mann']], ['macht', ['machen']], ['Kinder', ['Kinder', 'Ki', 'Kinde', 'Kind']]]

  7. Dictionary Lookup • All pronouns and definite articles • Small sample of nouns and verbs for testing • Looks up lemmatized words • [['der', 'the'], ['Mann', 'man'], ['macht', 'make'], ['die', 'the'], ['Kinder', 'child']]

  8. Noun phrase chunking • Group noun phrases into “chunks” • “The old man greets young children.” • Groupings: [The old man], [greets], [young children] • Use for parse trees and noun-verb agreement

  9. Statistical Tagging • Monolingual corpus – TIGER Corpus in German • Based on frequency of tag occurrence

  10. Noun-verb Agreement • Disambiguation • Der Mann sieht die Kinder. (The man sees the children)‏ • Der Mann: feminine singular indirect object or masculine singular subject • Die Kinder: feminine subject / direct object or plural subject / direct object • Sieht: singular, third person; or plural, second person • Der Mann “agrees” with verb – Same number, person if masculine singular subject

  11. Inflection • Simple in English • Plurals – Add –s or –es • Singular verb – Add –s or –es • Not yet added: Past tense

  12. Full run of program • fzhang@kilauea ~/research $ python proj.py • Part of speech tags: [['der', 'art'], ['Mann', 'nou'], ['macht', 'ver'], ['die', 'art'], ['Kinder', 'nou']] • Morphological analysis: [[['Mann', 'nou'], [['nom', 'mas'], ['dat', 'fem']]], [['macht', 'ver'], [['3', 'sing'], ['2', 'pl']], 'pres'], [['Kinder', 'nou'], [['nom', 'fem'], ['akk', 'fem'], ['nom', 'pl'], ['akk', 'pl']]]] • Disambiguated after noun-verb agreement: [[['Mann', 'nou'], [['nom', 'nou']]], [['macht', 'ver'], [['3', 'sing']], 'pres'], [['Kinder', 'nou'], [['nom', 'fem'], ['akk', 'fem'], ['nom', 'pl'], ['akk', 'pl']]]] • Lemmatized: [['Mann', ['Mann']], ['macht', ['machen']], ['Kinder', ['Kinder', 'Ki', 'Kinde', 'Kind']]] • Root translated: [['der', 'the'], ['Mann', 'man'], ['macht', 'make'], ['die', 'the'], ['Kinder', 'child']] • Inflected: • man makes child child childs childs

  13. Results • NV-agreement can disambiguate and determine subject, reduce to 2-3 possibilities • Statistical methods are NOT too complex to implement • Tagging should reach 90% accuracy

  14. Problems • Irregular verbs – stem changes • Singular conjugation • Expected: Lesen  er lest • Actual: Lesen  er liest • Strong verbs vs. Weak verbs – Past tenses • Weak: Machen  gemacht • Strong: Gehen  gegangen • Must include past tense in dictionary

  15. Problems • Corpus file is huge – 42 megabytes • Impractical, takes long to run

  16. Future research • Implement more statistical methods • Morphological info • Actual translation – bilingual corpus • Create parse tree – Actual grammar • Method for predicting stem changes in strong verbs

More Related