1 / 8

Cross-Lingual IR

Cross-Lingual IR. Salim Roukos IBM T. J. Watson Research Center 9/11/02. Assumptions for 2010 (Asilomar Report). 1 TB Mem, 1000 TB disk, 1B users, 1T devices=> 1b servers self-managing, very secure, and very reliable Auto-x: install, heal, adaptive, auto-tuning wizard

mieko
Download Presentation

Cross-Lingual IR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-Lingual IR Salim RoukosIBM T. J. Watson Research Center 9/11/02

  2. Assumptions for 2010 (Asilomar Report) • 1 TB Mem, 1000 TB disk, 1B users, • 1T devices=> 1b servers • self-managing, very secure, and very reliable • Auto-x: install, heal, adaptive, auto-tuning wizard • Information discovery: metadata for describing schema, cast operations • Federation across 1k, 1m databases • "Find the average enterprise-wide employee salary.“ • "Are there any really good Italian restaurants within 5 miles of where I live?"

  3. Exploit multilingualinformation streams ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. • - Xinhua • - SDA • AFP • AP • ... - Parallel vs comparable documents - Build Translingual search

  4. X-lingual Retrieval: xxx Docs English Docs French Docs Chinese Docs online E => X MT E => C MT X => E MT “English” for gisting Query English Ranked Docs IR scoring Chinese Caveat: Machine Translation isn’t perfect and queries tend to be short.

  5. Stemming Synonyms Translation P(v in q| w in D) = From information need to query • Who has the largest market share for notebooks: IBM or Dell? • Q1: notebook market share • Q2: laptop market share IBM Dell • Q3: ThinkPad IBM Dell ? D D I q P(q| I) = p(q | D is R, C) D

  6. Probabilistic Models of IR D = document C = doc collection q = query P(D is R | q, C) = P(q| D is R, C)P (D is R | C) Prior Link analysis,other? LM: Beyond 1g? Currently P(q|D is R) =k p(q|D) + (1-k) p(q) • Need training data to estimate model • Order 100k queries (not 1k)

  7. Probabilistic Model of What? P(R| a,D, q, C) Many features in ME/MIX models word ngrams synonyms Wordnet ontologies hidden: topics, top N docs, ..

  8. Goal -- Give users info they are seeking in context • Is XIR different from IR? • Translingual search  improved monolingual retrieval? • Monolingual vs multilingual users • How are XIR and MT related? • How can we scale up? • Create training sets to foster probabilistic modeling research for IR (100k queries) • Modeling multilingual web: content and link structure • Dialog Interaction • It’s about modeling!

More Related