200 likes | 348 Views
Hindi SLE Debriefing AVENUE Transfer System. July 3, 2003. Summary of our Final Hindi-to-English Transfer System. Overview of our Lexical Resources and how they were used in the system Grammar Development Transfer System Runtime Configuration Dev-test Evaluation Results
E N D
Hindi SLE DebriefingAVENUE Transfer System July 3, 2003
Summary of our Final Hindi-to-English Transfer System • Overview of our Lexical Resources and how they were used in the system • Grammar Development • Transfer System Runtime Configuration • Dev-test Evaluation Results • Observations and Lessons Learned Hindi SLE Debriefing
Elicited Data Collection • Goal: Acquire high quality word aligned Hindi-English data to support system development, especially grammar development and automatic grammar learning • We recruited a sizeable team of bilingual speakers – Rachel… • “Original” Elicitation Corpus was translated into Hindi • Corpus of Phrases extracted from Brown Corpus (NPs and PPs) was broken into files and assigned to translators, here and in India • Resulting in total of 17589 word aligned translated phrases Hindi SLE Debriefing
Summary of Lexical Resources • Manual: manually written phrase transfer rules (72) • Postpos: manually writen postpos rules (105) • Bigram: translations of 500 most frequent bigrams in Hindi (from Ralf) • Elicited: elicited data from controlled corpus and Brown, w-to-w and p-to-p, total of 84619 lexical and phrase rules • LDC: “master” bilingual dict from LDC, frequency sorted, Richard and Shobha cleaned up manually top 12% of entries, total of 87902 rules • NE: Named Entity lists from LDC website and from Fei, total of 1237+2109= 3346 rules • IBM: statistical w-to-w and p-to-p lexicon from IBM, sorted by translation prob, 81664 rules • JOY: SMT system w-to-w and p-to-p lexicon, sorted by translation prob, 189583 rules • TOTAL: 447791 rules Hindi SLE Debriefing
Ordering of Lexical Resources • Corresponds to three passes of system: • Phrase-to-phrase (used in first pass) • POS-tagged w-to-w pass (morph, enhanced, sorted, can feed into grammar) • LEX-tagged w-to-w pass (full forms, can only be used for w-to-w, no grammar). Hindi SLE Debriefing
Ordering of Lexical Resources • Man rules (p-to-p, w-to-w) • Postpos (w-to-w) • Bigrams (p-to-p) • LDC (w-to-w, enhanced, sorted) • Etposrules (w-to-w, enhanced, sorted) • NE (p-to-p , w-to-w) • Etlexrules (w-to-w, sorted) • Etphraserules (p-to-p) • IBM (p-to-p, w-to-w, sorted) • JOY (p-to-p, w-to-w, sorted) Cleaned up and duplicates removed Total Rules in Global Lexicon: xxx Hindi SLE Debriefing
Grammar Development • Grammar covers mostly VPs (verb complexes) • 73 grammar rules, covering all tenses, active and passive, subjunctive • Experimented also with simple NP and PP rules (movement of postpos in Hindi to prep in English), hurt performance • Problems in grammar testing and debugging – Ari… Hindi SLE Debriefing
Example Grammar Rule ;; SIMPLE PRESENT AND PAST (depends on the tense of the Aux) ; Ex: (tu) bolta hE -> (I) (usually) speak ; Ex: (maiM) sotA hUM -> (I) sleep (now) ; Ex: (maiM) sotA thA -> (I) slept (used to spleep) {VP,5} VP::VP : [V Aux] -> [V] ( (X1::Y1) ((x1 form) = part) ((x1 aspect) = imperf) ((x2 lexwx) = 'honA') ((x2 tense) = (*NOT* fut)) ((x2 tense) = (*NOT* subj)) ((x0 tense) = (x2 tense)) ((x0 agr num) = (x2 agr num)) ((x0 agr pers) = (x2 agr pers)) (x0 = x1) ((y1 tense) = (x0 tense)) ((y1 agr num) = (x0 agr num)) ; not always agrees, try commenting ((y1 agr pers) = (x0 agr pers)) ) Hindi SLE Debriefing
Transfer Runtime System • Three passes: • Pass1: match against p-to-p entries, halt if match found (ver2 allows to continue) • Pass2: morph analyze word and match against all w-to-w resources, halt if match found • Pass3: match original word against all w-to-w resources, provides only w-to-w output, no feeding into grammar rules. • Selection of best set of arcs: greedy left-to-right search that prefers longer input segments • Unk word policy: replace with English “the” • Post-processing: • remove be/give at eos if preceded by a verb • Replace all remaining “be” with “is” Hindi SLE Debriefing
Development Testing • Three dev-test sets: • India Today: 59 sentences, single ref • Full ISI: 358 sents, newswire, single ref • Small ISI: first 25 sentences of Full ISI • Full ISI was most meaningful test-set, tested on IT earlier on, and to ensure no over-fitting. Hindi SLE Debriefing
Final Performance: ISI-Full • Lexicon xferdict.0630-al-4 Hindi SLE Debriefing
Debug Output with Sources amerikI senA ne kahA hE ki irAka kI galiyoM meM cAro waraPa vyApwa aparAXa ko niyaMwriwa karane ke lie uMhoMne irAkiyoM ko senA ke kAma meM Ane vAle haWiyAra sOMpane ke lie 2 sapwAha kA samaya xiyA hE . <AMERICAN---ETLEX> <ARMY---SORTED> <SAID---ETLEX> <BE---SORTED> <THAT ---POSTPOS> <IRAQ---IBM> <OF---MANUALERIK> <LANES---SORTED> <IN---MANUALERIK> <ROUND---IBM> <SIDE---SORTED> <PERVADING---IBM> <TO THE CRIME---ETPHRASE> <CONTROLLED---IBM> <TO MAKE---ETLEX> <THE> <IRAQIS---MANUALARI> <TO---MANUALERIK> <ARMY---SORTED> <WORK---JOY> <CAME TO---JOY> <ONES---SORTED> <WEAPONS---SORTED> <CHARGE---SORTED> <FOR---BIGR> <2> <WEEK---SORTED> <OF---MANUALERIK> <TIME---SORTED> <THEY HAVE---JOY> <.> Hindi SLE Debriefing
Histogram of Source Information SORTED total = 2425 IBM total = 447 JOY total = 483 MANUAL ERIK total = 619 MANUAL ARI total = 139 BIGR total = 196 POSTPOS total = 510 TIMEEXP total = 4 ETLEX total = 0 ETPHRASE total = 0 Hindi SLE Debriefing
Things we Tried at Last Minute • Allowing the second pass to take place even if matches on phrases in first pass – no improvement in score • Throwing in NP rules and solving the lost unigrams by a clever final pass that replaces the choices for words – hurt score slightly… Hindi SLE Debriefing
Real Eval Set Transfer Run • Eval set consisted of 450 sentences from a variety of newswire sources • Suspicion of some sents drawn from dev data! • We submitted XFER-ONLY and XFER-ONLY+CASE • Aggregate stats from our run: • Coverage: 88.3% • Compounds matched: 2279 (token) • Went thru Morph and matched: 6256/9605 • Unkown Hindi words: 1122 Hindi SLE Debriefing
Limited Resource Scenario • The “rules of the game” were skewed against us in this evaluation: • 1.5 Million words of parallel text • Noisy statistical lexical resources • We don’t have a strong statistical selection model • How do we do in the minority language scenario, with our limited resources? • Kathrin ran test with Lexicon constructed just from Man rules, bigrams, postpos, LDC dict and Elicited data • We will also test EBMT and SMT under the same scenario! Hindi SLE Debriefing
Results: ISI-Full Hindi SLE Debriefing
Observations and Lessons • Serious grammar development occurred very late in the process (last few days) • Very hard time getting grammar to start pulling performance numbers up • Grammar rules are often blocked from applying because of phrasal matches • Rather hard to find cases where they were supposed to apply and didn’t • NP/PP rules did not help, partly because NPs boundaries were not adequately found • Strange phenomena of loosing unigrams when NP rules apply. Need to debug this thoroughly Hindi SLE Debriefing
Things we should find out • What sources did output come from in real eval test run? Get histogram… • What is the marginal contribution of various resources to our performance? • Conduct runs with individual resources omitted, in particular, without: • Our elicited data • The IBM data • The “Joy” data • The LDC Lexicon • Without the phrase-to-phrase pass • Can more grammar development help? Hindi SLE Debriefing
Further Work on our Hindi System • Our Hindi system ideal platform for advanced thesis-related research work from now on • Eval test set will remain unseen test data for future experimentation (ref translations will be available soon) • Low pace further system development throughout July (grammar, bug fixes) • Worthwhile new results to be reported at the August PI meeting Hindi SLE Debriefing