80 likes | 211 Views
Cross-Lingual IR. Salim Roukos IBM T. J. Watson Research Center 9/11/02. Assumptions for 2010 (Asilomar Report). 1 TB Mem, 1000 TB disk, 1B users, 1T devices=> 1b servers self-managing, very secure, and very reliable Auto-x: install, heal, adaptive, auto-tuning wizard
E N D
Cross-Lingual IR Salim RoukosIBM T. J. Watson Research Center 9/11/02
Assumptions for 2010 (Asilomar Report) • 1 TB Mem, 1000 TB disk, 1B users, • 1T devices=> 1b servers • self-managing, very secure, and very reliable • Auto-x: install, heal, adaptive, auto-tuning wizard • Information discovery: metadata for describing schema, cast operations • Federation across 1k, 1m databases • "Find the average enterprise-wide employee salary.“ • "Are there any really good Italian restaurants within 5 miles of where I live?"
Exploit multilingualinformation streams ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. ………. ………. ……… ………. • - Xinhua • - SDA • AFP • AP • ... - Parallel vs comparable documents - Build Translingual search
X-lingual Retrieval: xxx Docs English Docs French Docs Chinese Docs online E => X MT E => C MT X => E MT “English” for gisting Query English Ranked Docs IR scoring Chinese Caveat: Machine Translation isn’t perfect and queries tend to be short.
Stemming Synonyms Translation P(v in q| w in D) = From information need to query • Who has the largest market share for notebooks: IBM or Dell? • Q1: notebook market share • Q2: laptop market share IBM Dell • Q3: ThinkPad IBM Dell ? D D I q P(q| I) = p(q | D is R, C) D
Probabilistic Models of IR D = document C = doc collection q = query P(D is R | q, C) = P(q| D is R, C)P (D is R | C) Prior Link analysis,other? LM: Beyond 1g? Currently P(q|D is R) =k p(q|D) + (1-k) p(q) • Need training data to estimate model • Order 100k queries (not 1k)
Probabilistic Model of What? P(R| a,D, q, C) Many features in ME/MIX models word ngrams synonyms Wordnet ontologies hidden: topics, top N docs, ..
Goal -- Give users info they are seeking in context • Is XIR different from IR? • Translingual search improved monolingual retrieval? • Monolingual vs multilingual users • How are XIR and MT related? • How can we scale up? • Create training sets to foster probabilistic modeling research for IR (100k queries) • Modeling multilingual web: content and link structure • Dialog Interaction • It’s about modeling!