120 likes | 462 Views
Distributed Information Retrieval. Jamie Callan Carnegie Mellon University callan@cs.cmu.edu. Engine n. Engine 1. Engine 2. Engine 3. Engine 4. Multi-Database Solutions: Distributed Information Retrieval. . . . . ?. . . . . . . Common scenarios: Multiple partitions, single service
E N D
Distributed Information Retrieval Jamie Callan Carnegie Mellon University callan@cs.cmu.edu
Engine n Engine 1 Engine 2 Engine 3 Engine 4 Multi-Database Solutions:Distributed Information Retrieval . . . . ? . . . . . . • Common scenarios: • Multiple partitions, single service • Independent engines, single organization • Independent engines, affiliated organizations • Independent engines, unaffiliated organizations • Defining dimensions: • Cooperative vs. uncooperative engines • Centralized vs. decentralized solutions Information Need © 2002 Jamie Callan
Multi-Database Solutions • Browsing model • Manual selection, no support for results-merging, etc • Web-search (single database) model • Distributed information retrieval • Automatic or interactive DB selection • Support for results-merging • Peer-to-peer systems • DB self-selection, mostly based on filename matching • No support for results merging © 2002 Jamie Callan
Distributed IR:The Issues Usually Addressed • Site description:Contents, search engine, services, etc • Resource ranking: ranking resources by how likely to contain desired content • Resource selection: selecting the best subset from a ranked list • Searching:Interoperability • Result merging: Merging a set of document rankings • different underlying corpus statistics • different search engines © 2002 Jamie Callan
Resource Selection • Resource Descriptions • Characterization of a given database • Typical solution: word histograms • Example: Query based sampling to learn a unigram language model • Resource Ranking and Selection • Based on comparing the resource descriptions on a per query basis. • Current techniques are ad hoc • E.g., treat collections like big documents • Language models are one way of describing and selecting resources • By comparing query language models one might be able to produce good resource descriptions. • Has been done (Si et al) in the case when the search engine is the same across databases. Performance better than CORI. © 2002 Jamie Callan
Merging Results • General problem: Multiple ranked lists of documents • Meta-search: Single DB or several DBs with overlapping content • Distributed IR: Multiple DBs with (more or less) disjoint contents • Solutions: • Rerank at client • Ad-hoc • Semi-supervised learning of normalizing functions © 2002 Jamie Callan
Distributed IR:State of the Art • Acquiring database descriptions • Good techniques for cooperative & uncooperative environments • Automatic resource selection • Good techniques for large numbers of databases • Current techniques very close to “single database” accuracy ……in research environments • Theory says they can be better than “single DB” accuracy ……but we don’t know how to do it yet • Merging results from multiple databases • Good techniques for cooperative and uncooperative environments © 2002 Jamie Callan
Distributed IR:What Lies Ahead • Language modeling • So far it’s as good as (but not better than) ad-hoc techniques • Query expansion • So far it doesn’t help for resource selection (!) • Multilingual / cross-lingual environments • Database summarization • “ACLU Search” is a mystery if you don’t know about the ACLU • Automatic categorization of databases into classification hierarchies • An automatic “Invisible Web” site • Decentralization • Most of the current solutions are centralized • Peer-to-peer a possible solution © 2002 Jamie Callan