270 likes | 373 Views
Eliciting a corpus of word-aligned phrases for MT. Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University. Introduction. Problem: Building Machine Translation systems for languages with scarce resources:
E N D
Eliciting a corpus of word-aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University
Introduction • Problem: Building Machine Translation systems for languages with scarce resources: • Not enough data for Statistical MT and Example-Based MT • Not enough human linguistic expertise for writing rules • Approach: • Elicit high quality, word-aligned data from bilingual speakers • Learn transfer rules from the elicited data
Word-aligned elicited data English Language Model Learning Module Run Time Transfer System Word-to-Word Translation Probabilities Transfer Rules {PP,4894};;Score:0.0470PP::PP [NP POSTP] -> [PREP NP]((X2::Y1)(X1::Y2)) Decoder Lattice Translation Lexicon Modules of the AVENUE/MilliRADD rule learning system and MT system
Outline • Demo of elicitation interface • Description of elicitation corpus • Overview of automated rule learning
Demo of Elicitation Tool • Speaker needs to be bilingual and literate: no other knowledge necessary • Mappings between words and phrases: Many-to-many, one-to-none, many-to-none, etc. • Create phrasal mappings • Fonts and character sets: • Including Hindi, Chinese, and Arabic • Add morpheme boundaries to target language • Add alternate translations • Notes and context
Testing of Elicitation Tool • DARPA Hindi Surprise Language Exercise • Around 10 Hindi speakers • Around 17,000 phrases translated and aligned • Elicitation corpus • NPs and PPs from Treebanked Brown Corpus
Elicitation Corpus: Basic Principles • Minimal pairs • Syntactic compositionality • Special semantic/pragmatic constructions • Navigation based on language typology and universals • Challenges
Eng: I fell. Sp: Caí M: Tranün Eng: You (John) fell. Sp: Tu (Juan) caiste M: Eymi tranimi (Kuan) Eng: You (Mary) fell. ;; Sp: Tu (María) caiste M: Eymi tranimi (Maria) Eng: I am falling. Sp: Estoy cayendo M: Tranmeken Eng: You (John) are falling. Sp: Tu (Juan) estás cayendo M: Eimi(Kuan) tranmekeymi Elicitation Corpus: Minimal Pairs Mapudungun: Spoken by around one million people in Chile and Argentina.
Using feature vectors to detect minimal pairs • np1:(subj-of cl1).pro-pers.hum.2.sg. masc.no-clusn.no-def.no-alien • cl1:(subj np1).intr-ag.past.complete • Eng: You (John) fell. Sp: Tu (Juan) caiste M: Eymi tranimi (Kuan) • np1:(subj-of cl1).pro-pers.hum.2.sg. fem.no-clusn.no-def.no-alien • cl1:(subj np1).intr-ag.past.complete • Eng: You (Mary) fell. ;; Sp: Tu (María) caiste M: Eymi tranimi (Maria) Feature vectors can be extracted from the output of a parser for English or Spanish. (Except for features that English and Spanish do not have…)
Syntactic Compositionality • The tree • The tree fell. • I think that the tree fell. • We learn rules for smaller phrases • E.g., NP • Their root nodes become non-terminals in the rules for larger phrases. • E.g., S containing an NP • Meaning of a phrase is predictable from the meanings of the parts.
Special Semantic and Pragmatic Constructions • Meaning may not be compositional • Not predictable from the meanings of the parts • May not follow normal rules of grammar. • Suggestion: Why not go? • Word-for-word translation may not work. • Tend to be sources of MT mismatches • Comparative: • English: Hotel A is [closer than Hotel B] • Japanese: Hoteru A wa [Hoteru B yori] [tikai desu] Hotel A TOP Hotel B than close is • “Closer than Hotel B” is a constituent in English, but “Hoteru B yori tikai” is not a constituent in Japanese.
Examples of Semantic/Pragmatic Categories • Speech Acts: requests, suggestions, etc. • Comparatives and Equatives • Modality: possibility, probability, ability, obligation, uncertainty, evidentiality • Correllatives: (the more the merrier) • Causatives • Etc.
A Challenge: Combinatorics • Person (1, 2, 3, 4) • Number (sg, pl, du, paucal) • Gender/Noun Class (?) • Animacy (animate/inanimate) • Definiteness (definite/indefinite) • Proximity (near, far, very far, etc.) • Inclusion/exclusion • Multiply with: tenses and aspects (complete, incomplete, real, unreal, iterative, habitual, present, past, recent past, future, recent future, non-past, non-future, etc.) • Multiply with verb class: agentive intransitive, non-agentive intransitive, transitive, ditransitive, etc. • (Case marking and agreement may vary with verb tense, verb class, animacy, definiteness, and whether or not object outranks subject in person or animacy.)
Solutions to Combinatorics • Generate paradigms of feature vectors, and then automatically generate sentences to match each feature vector. • Use known universals to eliminate features: e.g., Languages without plurals don’t have duals.
Other Challenges of Computer Based Elicitation • Inconsistency of human translation and alignment • Bias toward word order of the elicitation language • Need to provide discourse context for given and new information • How to elicit things that aren’t grammaticalized in the elicitation language: • Evidential: I see that it is raining/Apparently it is raining/It must be raining. • Context: You are inside the house. Your friend comes in wet.
Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Transfer Rule Formalism ;SL: the man, TL: der Mann NP::NP [DET N] -> [DET N] ( (X1::Y1) (X2::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X2 AGR) = *3-SING) ((X2 COUNT) = +) ((Y1 AGR) = *3-SING) ((Y1 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y1 GENDER)) )
Rule Learning - Overview • Goal: Acquire Syntactic Transfer Rules • Use available knowledge from the source side (grammatical structure) • Three steps: • Flat Seed Generation: first guesses at transfer rules; flat syntactic structure • Compositionality:use previously learned rules to add hierarchical structure • Seeded Version Space Learning: refine rules by learning appropriate feature constraints
Manual Transfer Rules: Example ;; PASSIVE OF SIMPLE PAST (NO AUX) WITH LIGHT VERB ;; passive of 43 (7b) {VP,28} VP::VP : [V V V] -> [Aux V] ( (X1::Y2) ((x1 form) = root) ((x2 type) =c light) ((x2 form) = part) ((x2 aspect) = perf) ((x3 lexwx) = 'jAnA') ((x3 form) = part) ((x3 aspect) = perf) (x0 = x1) ((y1 lex) = be) ((y1 tense) = past) ((y1 agr num) = (x3 agr num)) ((y1 agr pers) = (x3 agr pers)) ((y2 form) = part) )
Manual Transfer Rules: Example NP PP NP1 NP P Adj N N1 ke eka aXyAya N jIvana NP NP1 PP Adj N P NP one chapter of N1 N life ; NP1 ke NP2 -> NP2 of NP1 ; Ex: jIvana ke eka aXyAya ; life of (one) chapter ; ==> a chapter of life ; {NP,12} NP::NP : [PP NP1] -> [NP1 PP] ( (X1::Y2) (X2::Y1) ; ((x2 lexwx) = 'kA') ) {NP,13} NP::NP : [NP1] -> [NP1] ( (X1::Y1) ) {PP,12} PP::PP : [NP Postp] -> [Prep NP] ( (X1::Y2) (X2::Y1) )