160 likes | 181 Views
This summary covers the development of a German-English translator in Quarter 1's progress, including improvements in rule-based part-of-speech tagging, morphological analysis, statistical language processing, new components like a lemmatizer, and tools for noun-verb agreement. Challenges and future research directions are also discussed.
E N D
Felix Zhang Development of a German-English Translator
Summary of Quarter 1 • Rule-based part of speech tagging • Morphological analysis • Created dictionary • Completely avoided statistical methods
Scope • Expanded • Now includes statistical methods • Part of speech tagging using corpus • Rule-based only as backup
Statistical language processing • State-of-the-art • Find chances that n-grams will translate into something else • Method is much simpler than current techniques • Context-free • Based on frequency of occurrence
New Components • Lemmatizer • Noun-verb agreement • Inflection • Lookup • Noun-phrase chunking • Statistical part of speech tagging
Lemmatizer • Break down words into root form • Takes info from morphological analysis • Does not consider stop words • Sample input: “Der Mann macht die Kinder” (“the man makes the children”) • [['Mann', ['Mann']], ['macht', ['machen']], ['Kinder', ['Kinder', 'Ki', 'Kinde', 'Kind']]]
Dictionary Lookup • All pronouns and definite articles • Small sample of nouns and verbs for testing • Looks up lemmatized words • [['der', 'the'], ['Mann', 'man'], ['macht', 'make'], ['die', 'the'], ['Kinder', 'child']]
Noun phrase chunking • Group noun phrases into “chunks” • “The old man greets young children.” • Groupings: [The old man], [greets], [young children] • Use for parse trees and noun-verb agreement
Statistical Tagging • Monolingual corpus – TIGER Corpus in German • Based on frequency of tag occurrence
Noun-verb Agreement • Disambiguation • Der Mann sieht die Kinder. (The man sees the children) • Der Mann: feminine singular indirect object or masculine singular subject • Die Kinder: feminine subject / direct object or plural subject / direct object • Sieht: singular, third person; or plural, second person • Der Mann “agrees” with verb – Same number, person if masculine singular subject
Inflection • Simple in English • Plurals – Add –s or –es • Singular verb – Add –s or –es • Not yet added: Past tense
Full run of program • fzhang@kilauea ~/research $ python proj.py • Part of speech tags: [['der', 'art'], ['Mann', 'nou'], ['macht', 'ver'], ['die', 'art'], ['Kinder', 'nou']] • Morphological analysis: [[['Mann', 'nou'], [['nom', 'mas'], ['dat', 'fem']]], [['macht', 'ver'], [['3', 'sing'], ['2', 'pl']], 'pres'], [['Kinder', 'nou'], [['nom', 'fem'], ['akk', 'fem'], ['nom', 'pl'], ['akk', 'pl']]]] • Disambiguated after noun-verb agreement: [[['Mann', 'nou'], [['nom', 'nou']]], [['macht', 'ver'], [['3', 'sing']], 'pres'], [['Kinder', 'nou'], [['nom', 'fem'], ['akk', 'fem'], ['nom', 'pl'], ['akk', 'pl']]]] • Lemmatized: [['Mann', ['Mann']], ['macht', ['machen']], ['Kinder', ['Kinder', 'Ki', 'Kinde', 'Kind']]] • Root translated: [['der', 'the'], ['Mann', 'man'], ['macht', 'make'], ['die', 'the'], ['Kinder', 'child']] • Inflected: • man makes child child childs childs
Results • NV-agreement can disambiguate and determine subject, reduce to 2-3 possibilities • Statistical methods are NOT too complex to implement • Tagging should reach 90% accuracy
Problems • Irregular verbs – stem changes • Singular conjugation • Expected: Lesen er lest • Actual: Lesen er liest • Strong verbs vs. Weak verbs – Past tenses • Weak: Machen gemacht • Strong: Gehen gegangen • Must include past tense in dictionary
Problems • Corpus file is huge – 42 megabytes • Impractical, takes long to run
Future research • Implement more statistical methods • Morphological info • Actual translation – bilingual corpus • Create parse tree – Actual grammar • Method for predicting stem changes in strong verbs