1 / 20

Coping with Surprise: Multiple CMU MT Approaches

Coping with Surprise: Multiple CMU MT Approaches. Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language Technologies Institute Carnegie Mellon University Joint work with: Katharina Probst, Erik Peterson, Joy Zhang,

presta
Download Presentation

Coping with Surprise: Multiple CMU MT Approaches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Coping with Surprise:Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language Technologies Institute Carnegie Mellon University Joint work with: Katharina Probst, Erik Peterson, Joy Zhang, Fei Huang, Alicia Tribble, Ariadna Font-Llitjos, Rachel Reynolds, Richard Cohen

  2. Main Hindi SLE Efforts • Data Collection • Elicited Data Collection • Data from contacts in India • Web Crawling • Language Processing Utilities • Morphology • Encoding identification and conversion • MT system development • XFER system • SMT system • EBMT system TIDES PI Meeting/ SLE

  3. Elicited Data Collection • Goal: Acquire high quality word aligned Hindi-English data to support XFER system development (grammar learning) • Recruited team of ~20 bilingual speakers at CMU and in India • Extracted a corpus of phrases (NPs and PPs) from Brown Corpus section of Penn TreeBank • Controlled Elicitation Corpus (typologically diverse, limited vocabulary) also translated into Hindi • Resulting in total of 17589word aligned translated phrases (~50KB words) TIDES PI Meeting/ SLE

  4. The CMU Elicitation Tool TIDES PI Meeting/ SLE

  5. Elicited Data Collection: High quality, word-aligned data Controlled elicitation corpus translated and aligned by Hindi speakers - Typologically diverse, vocabulary limited TIDES PI Meeting/ SLE

  6. Elicited Data Collection: High quality, word-aligned data Uncontrolled elicitation corpus: English phrases extracted from the Brown Corpus, translated by Hindi Speakers - Specific constituent types, large vocabulary TIDES PI Meeting/ SLE

  7. Elicited Data Collection: High quality, word-aligned data Variety of phrase complexities and phrase lengths TIDES PI Meeting/ SLE

  8. Elicited Data Collection • Problems and issues: • English  Hindi direction allowed us to use the Penn TreeBank to extract accurate phrases • However, bilingual informants not well accustomed to type Hindi  some typos • Limits utility of the data, little effect on accuracy • Using the WSJ portion of the PennTB may have been a better fit for genre TIDES PI Meeting/ SLE

  9. Main CMU Contributions to SLE Shared Resources • Elicited Data Corpus (~50KB) • Indian Government Parallel Text ERDC.tgz (338 MB) • CMU Phrase Lexicon Joyphrase.gz (3.5 MB) • Cleaned IBM lexicon ibmlex-cleaned.txt.gz (1.5 MB) • CMU Aligned Sentences CMU-aligned-sentences.tar.gz (1.3 MB) • CMU Phrases and sentences CMU-phrases+sentences.zip (468 KB) • Bilingual Named Entity List IndiaTodayLPNETranslists.tar.gz (54KB) Web Crawling: • Most sites with possible parallel texts had Hindi in proprietary encodings • Osho http://www.osho.com/Content.cfm?Language=Hindi TIDES PI Meeting/ SLE

  10. Hindi Morphological Analyzer • http://www.iiit.net/ltrc/morph/index.htm • High quality and high coverage morphological analyzer from IIIT • Input: full inflected forms (RomanWX encoding) • Output: root form + collection of features • Installing as a local server required some effort, e.g. UTF-8  RomanWX • Used primarily in our XFER system TIDES PI Meeting/ SLE

  11. Other Hindi Processing Utilities • Encoding identification and conversion tools • Built two automatic encoding identifiers, used for web data collection • Located and installed encoding converters from a variety of encodings • Most widely used was UTF-8 to RomanWX TIDES PI Meeting/ SLE

  12. XFER System for Hindi • Three transfer strategies: • match against phrase-to-phrase entries (full-forms, no morphology) • morphologically analyze input words and match against lexicon • matches feed into manual and learned transfer rules • match original word against lexicon - provides word-to-word translation as fall-back for input not otherwise covered • Simple decoding: greedy left-to-right search that prefers longer input segments: NIST 5.35 • “Strong” decoding with lattices+LM: NIST 5.47 TIDES PI Meeting/ SLE

  13. Examples of Learned Rules TIDES PI Meeting/ SLE

  14. SMT System for Hindi • Resources • Trained on commonly available bilingual corpora • Used bilingual Hindi-English dictionary • Named Entities • 70 million word English LM • CMU SMT System • Tuned on ISI devtest data • Monotone decoding, as reordering did not result in improvement on this test set • Mixed casing based on Named Entities and simple rules • NIST score: 6.74 TIDES PI Meeting/ SLE

  15. EBMT System for Hindi • Training data: same as SMT + a few hand-written equivalent class generalizations • English LM built from APW portion of GigaWord Corpus (600M words) • Encoding variation: raw training data in a variety of different encodings  all converted to UTF-8 (already supported by EBMT) • Preprocessing of example phrases to improve word matching: • Match Hindi possessive with English ‘s • NIST Score: 5.98 TIDES PI Meeting/ SLE

  16. A Truly Limited Data Scenario for Hindi-to-English • Put together a scenario with very miserly data resources: • Elicited Data corpus: 17589 phrases • Cleaned portion (top 12%) of LDC dictionary: ~2725 Hindi words (23612 translation pairs) • Manually acquired resources during the SLE: • 500 manual bigram translations • 72 manually written phrase transfer rules • 105 manually written postposition rules • 48 manually written time expression rules • No additional parallel text!! • Results presented tomorrow… TIDES PI Meeting/ SLE

  17. Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: [From TidesSLList Archive website] • Vogel email 6/2 • Hindi Language Resources: http://www.cs.colostate.edu/~malaiya/hindilinks.html • General Information on Hindi Script: http://www.latrobe.edu.au/indiangallery/devanagari.htm • Dictionaries at: http://www.iiit.net/ltrc/Dictionaries/Dict_Frame.html • English to Hindu dictionary in different formats: http://sanskrit.gde.to/hindi/ • A small English to Urdu dictionary: http://www.cs.wisc.edu/~navin/india/urdu.dictionary • The Bible at: http://www.gospelcom.net/ibs/bibles/ • The Emille Project: http://www.emille.lancs.ac.uk/home.htm • [Hardcopy phrasebook references] • A Monthly Newsletter of Vigyan Prasar • http://www.vigyanprasar.com/dream/index.asp • Morphological Analyser: http://www.iiit.net/ltrc/morph/index.htm TIDES PI Meeting/ SLE

  18. Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: (cont.) [From TidesSLList Archive website] • Tribble email, via Vogel 6/2 Possible parallel websites: • http://www.bbc.co.uk (English) • http://www.bbc.co.uk/urdu/ (Hindi) • http://sify.com/news_info/news/ • http://sify.com/hindi/ • http://in.rediff.com/index.html (English) • http://www.rediff.com/hindi/index.html (Hindi) • http://www.indiatoday.com/itoday/index.html • http://www.indiatodayhindi.com • Vogel email 6/2 • http://us.rediff.com/index.html • http://www.rediff.com/hindi/index.html [Already listed] • http://www.niharonline.com/ • http://www.niharonline.com/hindi/index.html • http://www.boloji.com/hindi/index.html • http://www.boloji.com/hindi/hindi/index.htm • The Gita Supersite http://www.gitasupersite.iitk.ac.in/ • Press Information Bureau, Government of India • English: http://pib.nic.in/ • Hindi: http://pib.nic.in/urdu/hindimain.html TIDES PI Meeting/ SLE

  19. Other CMU Contributions to SLE Shared Resources FOUND RESOURCES not on LDC Website: (cont.) [From TidesSLList Archive website] • 6/20 Parallel Hindi/English webpages: • GAIL (Natural Gas Co.) http://gail.nic.in/ UTF-8. [Found by CMU undergrad Web team] [Mike Maxwell, LDC, found it at the same time.] SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: [From TidesSLList Archive website:] • Frederking email 6/3 [announced], 6/4 [provided] • Ralf Brown's idenc encoding classifier • Frederking email 6/5 • PDF extractions from LanguageWeaver URLs: http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/English/ http://progress.is.cs.cmu.edu/surprise/Hindi/ParDoc/06-04-2003/Hindi/ • Frederking email 6/5 • Richard Wang's Perl ident.pl encoding classifier and ISCII-UTF8.pl converter • Frederking email 6/11 • Erik Peterson here has put together a Perl wrapper for the IIIT Morphology package, so that the input can be UTF-8: http://progress.is.cs.cmu.edu/surprise/morph_wrapper.tar.gz TIDES PI Meeting/ SLE

  20. Other CMU Contributions to SLE Shared Resources SHARED PROCESSED RESOURCES NOT ON LDC WEBSITE: (cont.) [From TidesSLList Archive website:] • Levin email 6/13 • Directory of Elicited Word-Aligned English-Hindi Translated Phrases: http://progress.is.cs.cmu.edu/surprise/Elicited-Data/ • Frederking email 6/20 • Undecoded but believed to be parallel webpages: http://progress.is.cs.cmu.edu/surprise/merged_urls.txt • PDF extractions from same: http://progress.is.cs.cmu.edu/surprise/merged_urls/ • Frederking email 6/24 • Several individual parallel webpages; sites may have more: www.commerce.nic.in/setup.htm www.commerce.nic.in/hindi/setup.html mohfw.nic.in/kk/95/books1.htm mohfw.nic.in/oph.htm wwww.mp.nic.in TIDES PI Meeting/ SLE

More Related