1 / 8

Distributed Information Retrieval

Distributed Information Retrieval. Jamie Callan Carnegie Mellon University callan@cs.cmu.edu. Engine n. Engine 1. Engine 2. Engine 3. Engine 4. Multi-Database Solutions: Distributed Information Retrieval. . . . . ?. . . . . . . Common scenarios: Multiple partitions, single service

tam
Download Presentation

Distributed Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Information Retrieval Jamie Callan Carnegie Mellon University callan@cs.cmu.edu

  2. Engine n Engine 1 Engine 2 Engine 3 Engine 4 Multi-Database Solutions:Distributed Information Retrieval . . . . ? . . . . . . • Common scenarios: • Multiple partitions, single service • Independent engines, single organization • Independent engines, affiliated organizations • Independent engines, unaffiliated organizations • Defining dimensions: • Cooperative vs. uncooperative engines • Centralized vs. decentralized solutions Information Need © 2002 Jamie Callan

  3. Multi-Database Solutions • Browsing model • Manual selection, no support for results-merging, etc • Web-search (single database) model • Distributed information retrieval • Automatic or interactive DB selection • Support for results-merging • Peer-to-peer systems • DB self-selection, mostly based on filename matching • No support for results merging © 2002 Jamie Callan

  4. Distributed IR:The Issues Usually Addressed • Site description:Contents, search engine, services, etc • Resource ranking: ranking resources by how likely to contain desired content • Resource selection: selecting the best subset from a ranked list • Searching:Interoperability • Result merging: Merging a set of document rankings • different underlying corpus statistics • different search engines © 2002 Jamie Callan

  5. Resource Selection • Resource Descriptions • Characterization of a given database • Typical solution: word histograms • Example: Query based sampling to learn a unigram language model • Resource Ranking and Selection • Based on comparing the resource descriptions on a per query basis. • Current techniques are ad hoc • E.g., treat collections like big documents • Language models are one way of describing and selecting resources • By comparing query language models one might be able to produce good resource descriptions. • Has been done (Si et al) in the case when the search engine is the same across databases. Performance better than CORI. © 2002 Jamie Callan

  6. Merging Results • General problem: Multiple ranked lists of documents • Meta-search: Single DB or several DBs with overlapping content • Distributed IR: Multiple DBs with (more or less) disjoint contents • Solutions: • Rerank at client • Ad-hoc • Semi-supervised learning of normalizing functions © 2002 Jamie Callan

  7. Distributed IR:State of the Art • Acquiring database descriptions • Good techniques for cooperative & uncooperative environments • Automatic resource selection • Good techniques for large numbers of databases • Current techniques very close to “single database” accuracy ……in research environments  • Theory says they can be better than “single DB” accuracy ……but we don’t know how to do it yet • Merging results from multiple databases • Good techniques for cooperative and uncooperative environments © 2002 Jamie Callan

  8. Distributed IR:What Lies Ahead • Language modeling • So far it’s as good as (but not better than) ad-hoc techniques • Query expansion • So far it doesn’t help for resource selection (!) • Multilingual / cross-lingual environments • Database summarization • “ACLU Search” is a mystery if you don’t know about the ACLU • Automatic categorization of databases into classification hierarchies • An automatic “Invisible Web” site • Decentralization • Most of the current solutions are centralized • Peer-to-peer a possible solution © 2002 Jamie Callan

More Related