Development of a German-English Translator

Felix Zhang Development of a German-English Translator

Summary of Quarter 1 • Rule-based part of speech tagging • Morphological analysis • Created dictionary • Completely avoided statistical methods

Scope • Expanded • Now includes statistical methods • Part of speech tagging using corpus • Rule-based only as backup

Statistical language processing • State-of-the-art • Find chances that n-grams will translate into something else • Method is much simpler than current techniques • Context-free • Based on frequency of occurrence

New Components • Lemmatizer • Noun-verb agreement • Inflection • Lookup • Noun-phrase chunking • Statistical part of speech tagging

Lemmatizer • Break down words into root form • Takes info from morphological analysis • Does not consider stop words • Sample input: “Der Mann macht die Kinder” (“the man makes the children”)‏ • [['Mann', ['Mann']], ['macht', ['machen']], ['Kinder', ['Kinder', 'Ki', 'Kinde', 'Kind']]]

Dictionary Lookup • All pronouns and definite articles • Small sample of nouns and verbs for testing • Looks up lemmatized words • [['der', 'the'], ['Mann', 'man'], ['macht', 'make'], ['die', 'the'], ['Kinder', 'child']]

Noun phrase chunking • Group noun phrases into “chunks” • “The old man greets young children.” • Groupings: [The old man], [greets], [young children] • Use for parse trees and noun-verb agreement

Statistical Tagging • Monolingual corpus – TIGER Corpus in German • Based on frequency of tag occurrence

Noun-verb Agreement • Disambiguation • Der Mann sieht die Kinder. (The man sees the children)‏ • Der Mann: feminine singular indirect object or masculine singular subject • Die Kinder: feminine subject / direct object or plural subject / direct object • Sieht: singular, third person; or plural, second person • Der Mann “agrees” with verb – Same number, person if masculine singular subject

Inflection • Simple in English • Plurals – Add –s or –es • Singular verb – Add –s or –es • Not yet added: Past tense

Full run of program • fzhang@kilauea ~/research $ python proj.py • Part of speech tags: [['der', 'art'], ['Mann', 'nou'], ['macht', 'ver'], ['die', 'art'], ['Kinder', 'nou']] • Morphological analysis: [[['Mann', 'nou'], [['nom', 'mas'], ['dat', 'fem']]], [['macht', 'ver'], [['3', 'sing'], ['2', 'pl']], 'pres'], [['Kinder', 'nou'], [['nom', 'fem'], ['akk', 'fem'], ['nom', 'pl'], ['akk', 'pl']]]] • Disambiguated after noun-verb agreement: [[['Mann', 'nou'], [['nom', 'nou']]], [['macht', 'ver'], [['3', 'sing']], 'pres'], [['Kinder', 'nou'], [['nom', 'fem'], ['akk', 'fem'], ['nom', 'pl'], ['akk', 'pl']]]] • Lemmatized: [['Mann', ['Mann']], ['macht', ['machen']], ['Kinder', ['Kinder', 'Ki', 'Kinde', 'Kind']]] • Root translated: [['der', 'the'], ['Mann', 'man'], ['macht', 'make'], ['die', 'the'], ['Kinder', 'child']] • Inflected: • man makes child child childs childs

Results • NV-agreement can disambiguate and determine subject, reduce to 2-3 possibilities • Statistical methods are NOT too complex to implement • Tagging should reach 90% accuracy

Problems • Irregular verbs – stem changes • Singular conjugation • Expected: Lesen  er lest • Actual: Lesen  er liest • Strong verbs vs. Weak verbs – Past tenses • Weak: Machen  gemacht • Strong: Gehen  gegangen • Must include past tense in dictionary

Problems • Corpus file is huge – 42 megabytes • Impractical, takes long to run

Future research • Implement more statistical methods • Morphological info • Actual translation – bilingual corpus • Create parse tree – Actual grammar • Method for predicting stem changes in strong verbs

Development of a German-English Translator