200 likes | 371 Views
Coping with Surprise: Multiple CMU MT Approaches. Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language Technologies Institute Carnegie Mellon University Joint work with: Katharina Probst, Erik Peterson, Joy Zhang,
E N D
Coping with Surprise:Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language Technologies Institute Carnegie Mellon University Joint work with: Katharina Probst, Erik Peterson, Joy Zhang, Fei Huang, Alicia Tribble, Ariadna Font-Llitjos, Rachel Reynolds, Richard Cohen
Main Hindi SLE Efforts • Data Collection • Elicited Data Collection • Data from contacts in India • Web Crawling • Language Processing Utilities • Morphology • Encoding identification and conversion • MT system development • XFER system • SMT system • EBMT system TIDES PI Meeting/ SLE
Elicited Data Collection • Goal: Acquire high quality word aligned Hindi-English data to support XFER system development (grammar learning) • Recruited team of ~20 bilingual speakers at CMU and in India • Extracted a corpus of phrases (NPs and PPs) from Brown Corpus section of Penn TreeBank • Controlled Elicitation Corpus (typologically diverse, limited vocabulary) also translated into Hindi • Resulting in total of 17589word aligned translated phrases (~50KB words) TIDES PI Meeting/ SLE
The CMU Elicitation Tool TIDES PI Meeting/ SLE
Elicited Data Collection: High quality, word-aligned data Controlled elicitation corpus translated and aligned by Hindi speakers - Typologically diverse, vocabulary limited TIDES PI Meeting/ SLE
Elicited Data Collection: High quality, word-aligned data Uncontrolled elicitation corpus: English phrases extracted from the Brown Corpus, translated by Hindi Speakers - Specific constituent types, large vocabulary TIDES PI Meeting/ SLE
Elicited Data Collection: High quality, word-aligned data Variety of phrase complexities and phrase lengths TIDES PI Meeting/ SLE
Elicited Data Collection • Problems and issues: • English Hindi direction allowed us to use the Penn TreeBank to extract accurate phrases • However, bilingual informants not well accustomed to type Hindi some typos • Limits utility of the data, little effect on accuracy • Using the WSJ portion of the PennTB may have been a better fit for genre TIDES PI Meeting/ SLE
Main CMU Contributions to SLE Shared Resources • Elicited Data Corpus (~50KB) • Indian Government Parallel Text ERDC.tgz (338 MB) • CMU Phrase Lexicon Joyphrase.gz (3.5 MB) • Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5 MB) • CMU Aligned Sentences CMU-aligned-sentences.tar.gz (1.3 MB) • CMU Phrases and sentences CMU-phrases+sentences.zip (468 KB) • Bilingual Named Entity List IndiaTodayLPNETranslists.tar.gz (54KB) Web Crawling: • Most sites with possible parallel texts had Hindi in proprietary encodings • Osho http://www.osho.com/Content.cfm?Language=Hindi TIDES PI Meeting/ SLE
Hindi Morphological Analyzer • http://www.iiit.net/ltrc/morph/index.htm • High quality and high coverage morphological analyzer from IIIT • Input: full inflected forms (RomanWX encoding) • Output: root form + collection of features • Installing as a local server required some effort, e.g. UTF-8 RomanWX • Used primarily in our XFER system TIDES PI Meeting/ SLE
Other Hindi Processing Utilities • Encoding identification and conversion tools • Built two automatic encoding identifiers, used for web data collection • Located and installed encoding converters from a variety of encodings • Most widely used was UTF-8 to RomanWX TIDES PI Meeting/ SLE
XFER System for Hindi • Three transfer strategies: • match against phrase-to-phrase entries (full-forms, no morphology) • morphologically analyze input words and match against lexicon • matches feed into manual and learned transfer rules • match original word against lexicon - provides word-to-word translation as fall-back for input not otherwise covered • Simple decoding: greedy left-to-right search that prefers longer input segments: NIST 5.35 • “Strong” decoding with lattices+LM: NIST 5.47 TIDES PI Meeting/ SLE
Examples of Learned Rules TIDES PI Meeting/ SLE
SMT System for Hindi • Resources • Trained on commonly available bilingual corpora • Used bilingual Hindi-English dictionary • Named Entities • 70 million word English LM • CMU SMT System • Tuned on ISI devtest data • Monotone decoding, as reordering did not result in improvement on this test set • Mixed casing based on Named Entities and simple rules • NIST score: 6.74 TIDES PI Meeting/ SLE
EBMT System for Hindi • Training data: same as SMT + a few hand-written equivalent class generalizations • English LM built from APW portion of GigaWord Corpus (600M words) • Encoding variation: raw training data in a variety of different encodings all converted to UTF-8 (already supported by EBMT) • Preprocessing of example phrases to improve word matching: • Match Hindi possessive with English ‘s • NIST Score: 5.98 TIDES PI Meeting/ SLE
A Truly Limited Data Scenario for Hindi-to-English • Put together a scenario with very miserly data resources: • Elicited Data corpus: 17589 phrases • Cleaned portion (top 12%) of LDC dictionary: ~2725 Hindi words (23612 translation pairs) • Manually acquired resources during the SLE: • 500 manual bigram translations • 72 manually written phrase transfer rules • 105 manually written postposition rules • 48 manually written time expression rules • No additional parallel text!! • Results presented tomorrow… TIDES PI Meeting/ SLE
Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: [From TidesSLList Archive website] • Vogel email 6/2 • Hindi Language Resources: http://www.cs.colostate.edu/~malaiya/hindilinks.html • General Information on Hindi Script: http://www.latrobe.edu.au/indiangallery/devanagari.htm • Dictionaries at: http://www.iiit.net/ltrc/Dictionaries/Dict_Frame.html • English to Hindu dictionary in different formats: http://sanskrit.gde.to/hindi/ • A small English to Urdu dictionary: http://www.cs.wisc.edu/~navin/india/urdu.dictionary • The Bible at: http://www.gospelcom.net/ibs/bibles/ • The Emille Project: http://www.emille.lancs.ac.uk/home.htm • [Hardcopy phrasebook references] • A Monthly Newsletter of Vigyan Prasar • http://www.vigyanprasar.com/dream/index.asp • Morphological Analyser: http://www.iiit.net/ltrc/morph/index.htm TIDES PI Meeting/ SLE
Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: (cont.) [From TidesSLList Archive website] • Tribble email, via Vogel 6/2 Possible parallel websites: • http://www.bbc.co.uk (English) • http://www.bbc.co.uk/urdu/ (Hindi) • http://sify.com/news_info/news/ • http://sify.com/hindi/ • http://in.rediff.com/index.html (English) • http://www.rediff.com/hindi/index.html (Hindi) • http://www.indiatoday.com/itoday/index.html • http://www.indiatodayhindi.com • Vogel email 6/2 • http://us.rediff.com/index.html • http://www.rediff.com/hindi/index.html [Already listed] • http://www.niharonline.com/ • http://www.niharonline.com/hindi/index.html • http://www.boloji.com/hindi/index.html • http://www.boloji.com/hindi/hindi/index.htm • The Gita Supersite http://www.gitasupersite.iitk.ac.in/ • Press Information Bureau, Government of India • English: http://pib.nic.in/ • Hindi: http://pib.nic.in/urdu/hindimain.html TIDES PI Meeting/ SLE
Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: (cont.) [From TidesSLList Archive website] • 6/20 Parallel Hindi/English webpages: • GAIL (Natural Gas Co.) http://gail.nic.in/ UTF-8. [Found by CMU undergrad Web team] [Mike Maxwell, LDC, found it at the same time.] SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: [From TidesSLList Archive website:] • Frederking email 6/3 [announced], 6/4 [provided] • Ralf Brown's idenc encoding classifier • Frederking email 6/5 • PDF extractions from LanguageWeaver URLs: http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/English/ http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/Hindi/ • Frederking email 6/5 • Richard Wang's Perl ident.pl encoding classifier and ISCII-UTF8.pl converter • Frederking email 6/11 • Erik Peterson here has put together a Perl wrapper for the IIIT Morphology package, so that the input can be UTF-8: http://progress.is.cs.cmu.edu/surprise/morph_wrapper.tar.gz TIDES PI Meeting/ SLE
Other CMU Contributions to SLE Shared Resources SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: (cont.) [From TidesSLList Archive website:] • Levin email 6/13 • Directory of Elicited Word-Aligned English-Hindi Translated Phrases: http://progress.is.cs.cmu.edu/surprise/Elicited-Data/ • Frederking email 6/20 • Undecoded but believed to be parallel webpages: http://progress.is.cs.cmu.edu/surprise/merged_urls.txt • PDF extractions from same: http://progress.is.cs.cmu.edu/surprise/merged_urls/ • Frederking email 6/24 • Several individual parallel webpages; sites may have more: www.commerce.nic.in/setup.htm www.commerce.nic.in/hindi/setup.html mohfw.nic.in/kk/95/books1.htm mohfw.nic.in/oph.htm wwww.mp.nic.in TIDES PI Meeting/ SLE