Distributed Information Retrieval

Distributed Information Retrieval Jamie Callan Carnegie Mellon University callan@cs.cmu.edu

Engine n Engine 1 Engine 2 Engine 3 Engine 4 Multi-Database Solutions:Distributed Information Retrieval . . . . ? . . . . . . • Common scenarios: • Multiple partitions, single service • Independent engines, single organization • Independent engines, affiliated organizations • Independent engines, unaffiliated organizations • Defining dimensions: • Cooperative vs. uncooperative engines • Centralized vs. decentralized solutions Information Need © 2002 Jamie Callan

Multi-Database Solutions • Browsing model • Manual selection, no support for results-merging, etc • Web-search (single database) model • Distributed information retrieval • Automatic or interactive DB selection • Support for results-merging • Peer-to-peer systems • DB self-selection, mostly based on filename matching • No support for results merging © 2002 Jamie Callan

Distributed IR:The Issues Usually Addressed • Site description:Contents, search engine, services, etc • Resource ranking: ranking resources by how likely to contain desired content • Resource selection: selecting the best subset from a ranked list • Searching:Interoperability • Result merging: Merging a set of document rankings • different underlying corpus statistics • different search engines © 2002 Jamie Callan

Resource Selection • Resource Descriptions • Characterization of a given database • Typical solution: word histograms • Example: Query based sampling to learn a unigram language model • Resource Ranking and Selection • Based on comparing the resource descriptions on a per query basis. • Current techniques are ad hoc • E.g., treat collections like big documents • Language models are one way of describing and selecting resources • By comparing query language models one might be able to produce good resource descriptions. • Has been done (Si et al) in the case when the search engine is the same across databases. Performance better than CORI. © 2002 Jamie Callan

Merging Results • General problem: Multiple ranked lists of documents • Meta-search: Single DB or several DBs with overlapping content • Distributed IR: Multiple DBs with (more or less) disjoint contents • Solutions: • Rerank at client • Ad-hoc • Semi-supervised learning of normalizing functions © 2002 Jamie Callan

Distributed IR:State of the Art • Acquiring database descriptions • Good techniques for cooperative & uncooperative environments • Automatic resource selection • Good techniques for large numbers of databases • Current techniques very close to “single database” accuracy ……in research environments  • Theory says they can be better than “single DB” accuracy ……but we don’t know how to do it yet • Merging results from multiple databases • Good techniques for cooperative and uncooperative environments © 2002 Jamie Callan

Distributed IR:What Lies Ahead • Language modeling • So far it’s as good as (but not better than) ad-hoc techniques • Query expansion • So far it doesn’t help for resource selection (!) • Multilingual / cross-lingual environments • Database summarization • “ACLU Search” is a mystery if you don’t know about the ACLU • Automatic categorization of databases into classification hierarchies • An automatic “Invisible Web” site • Decentralization • Most of the current solutions are centralized • Peer-to-peer a possible solution © 2002 Jamie Callan

Distributed Information Retrieval

Distributed Information Retrieval

Presentation Transcript

DISTRIBUTED INFORMATION RETRIEVAL

Information retrieval

Information Retrieval

Information retrieval

Distributed Information Retrieval

Information Retrieval

Distributed Information Retrieval Jamie Callan

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

XML Distributed Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Parallel and Distributed Information Retrieval

information retrieval

Information Retrieval