1 / 34

Cross-lingual Information Access by Natural Language

Cross-lingual Information Access by Natural Language. Kishore Papineni IBM T.J. Watson Research Center Yorktown Heights, NY. Information access by natural language. Where is the Taj Mahal. Doc collection or the Web. Do you mean the one in India or New Jersey?. Docs in many

arwen
Download Presentation

Cross-lingual Information Access by Natural Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-lingual Information Access by Natural Language Kishore Papineni IBM T.J. Watson Research Center Yorktown Heights, NY

  2. Information access by natural language Where is the Taj Mahal Doc collectionorthe Web Do you mean the one in India or New Jersey? Docs in many languages

  3. Global Internet User Population 2000 2005 English English Source: Global Reach

  4. TypeofContentandAnswer Answer IR & Search Unstructured Transactional Systems (demo later) QA (‘demo’..) Structured Structured (DB) Unstructured (docs) Content

  5. NL Components • Statistical parsing • NLU, Named Entities, multilinguality • Dialog management • conversational web/telephony • Statistical Machine Translation • learning multilingual statistical dictionaries from corpora

  6. Named Entities Numbers, dates, names of locations, people,… Eight Nobel Peace Prize winners, after being refused entry to Burma, came to neighboring Thailand to start the campaign for their fellow laureate. Ms. Suu Kyi won the prize in 1991 and has been held under house arrest for over three years by Rangoon's military junta .

  7. Named Entity Detection(shallow parsing) Date Location Location Location Organization • Statistical named entity system • 1st place 2000 DARPA Mandarin Named Entity task Named entity components are used in QA, TDT, labeling clusters/taxonomies, .

  8. Machine Translation - 1998 Chinese Spanish Portuguese Russian Japanese German Korean French Vietnamese Readable Editable Marginal Polish Arabic Ukrainian Italian Farsi Serbo-Croatian Thai Dutch Indonesian Hungarian Greek Czech Swedish • The World - 1998* • ~228 Countries • >6,700 Languages • >39,000 Language, dialect, and alternate names * http://www.sil.org/ethnologue/

  9. Machine Translation for Search • High speed • Lower quality tolerable • Word order and inflection not that important • Training data: parallel or comparable corpora

  10. Fertility: pn(n|f-,f,f+ ) Sense translation: p(e|f-,f,f+) Statistical Translation Models:

  11. X-lingual Retrieval:Query Translation Chinese Docs online online E => C MT C => E MT Query English “English” for gisting Ranked Docs IR scoring Chinese Caveat: MT isn’t perfect and queries tend to be very short.

  12. X-lingual Retrieval:Query & Document Translation English IR scoring “English” Docs offline Score Merging C => E MT Query English “English” for gisting Ranked Docs Chinese Docs online E => C MT Chinese IR scoring IR scoring

  13. TREC 9 TransWhiz: commercial MT from Taiwan IBM SMT: statistical MT on HK corpus X-lingual IR on HK corpus IBM SMT Trans- Whiz Q+D Q D Mono- lingual Q+D+T

  14. Post-TREC: More data is better • Quality/quantity of training data (more data 200k sentences) • > 2x performance!

  15. Comparable Corpus Example: SDA newswire French, German Stories independently written, not translations Report same events Comparable corpus advantage: Availability Match domain to task Broader coverage Parallel corpus advantages: Match languages to each other Linguistic structure, details

  16. IBM Documents & queries can be in any of English, French, German, or Italian. IBM placed 1st in ’98 and ’99 evaluations.

  17. Suu Kyi Question Answering Who won the Nobel Peace Prize in 1991? Eight Nobel Peace Prize winners, after being refused entry to Burma, came to neighboring Thailand to start the campaign for their fellow laureate. Ms. Suu Kyi won the prize in 1991 and has been held under house arrest for over three years by Rangoon's military junta .

  18. Eight Nobel Peace Prize winners, after being refused entry to Burma, came to neighboring Thailand to start the campaign for their fellow laureate. Ms. Suu Kyi won the prize in 1991 and has been held under house arrest for over three years by Rangoon's military junta . IR Engine Query Expansion Other Wolf prize winners for 1991 include Norbert... ENCYC DB TREC-9 DB . . . 70 Documents 1Million Documents Fast Match Fast Match Who won the Nobel Peace Prize in 1991? Answer Class Tag PERSON Question

  19. Answer Class Tag Answer Class Tag Answer Selection Answer Selection Named Entity Named Entity Question-Answering Architecture Who won the Nobel Peace Prize in 1991? PERSON Question Answer IR Engine Query Expansion Eight Nobel Peace Prize winners, after being refused entry to Burma, came to neighboring Thailand to start the campaign for their fellow laureate. Ms. Suu Kyi won the prize in 1991 and has been held under house arrest for over three years by Rangoon's military junta . ENCYC DB TREC-9 DB

  20. NE Example • The study was conducted by Dr. Aubrey Milunsky of the Center for Human Genetics at the Boston University School of Medicine and colleagues . • the study be conduct by Dr. Aubrey Milunsky of the Center for Human Genetics at the Boston University School of Medicine and colleague . • DT NN VBD VBN IN NP NP NP IN DT NP IN NP NP IN DT NP NP NP IN NP CC NNS . • Other The study was conducted by Dr. <b_enamex TYPE="PERSON"> Aubrey Milunsky <e_enamex> of the <b_enamex TYPE="ORGANIZATION"> Center for Human Genetics <e_enamex> at the <b_enamex TYPE="ORGANIZATION"> Boston University School of Medicine <e_enamex> and colleagues .

  21. Answer Selection • Lowering the granularity of IR Word match sentence i Thesaurus match Relevant Docs Window Generation Ranked Answers  argmax Cluster match i-1 NE match weight i Dispersion match i+1 Missing words

  22. Research Issues • Machine translation • broad domain MT key component for Translingual QA • single objective measure • Monolingual QA • deal with “noisy” input from recognized audio or MT output • Information Extraction • deal with broad domain and noisy input • Topic Detection and Tracking • explore more detailed document analysis than just clustering/threading of documents • Document understanding/Summarization/NLG

  23. Speech Recognition (WER)

  24. Conclusion • Information access by natural language is important: info to everyone from anywhere! • Information is multilingual • Machine Translation is critical: • Bring the world closer! • Play with lots of data, supercomputers • Learn many languages and have lots of fun!

  25. Demos

  26. QA response 1 QA response 2

More Related