1 / 19

DISTRIBUTED INFORMATION RETRIEVAL

DISTRIBUTED INFORMATION RETRIEVAL. 2003. 07. 23 Lee Won Hee. Abstraction. A multi-database model of distributed information retrieval Full-text information retrieval consists of discovering database contents Ranking databases by their expected ability to satisfy the query

lali
Download Presentation

DISTRIBUTED INFORMATION RETRIEVAL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DISTRIBUTED INFORMATION RETRIEVAL 2003. 07. 23 Lee Won Hee

  2. Abstraction • A multi-database model of distributed information retrieval • Full-text information retrieval consists of discovering database contents • Ranking databases by their expected ability to satisfy the query • Searching a small number of databases • Merging results returned by different databases • This paper • Presents algorithms for each task

  3. Introduction • Multi-database model of distributed information retrieval • Reflects the distributed location and control of information in a wide area computer network • Resource description • The contents of each text database must be described • Resource selection • Given an information need and a set of resource descriptions, a decision must be made about which database(s) to search • Results merging • Integrating the ranked lists returned by search by each data base into a single, coherent ranked list

  4. Multi-database Testbeds • Marcus, 1983 • Addressed resource description and selection in the EXPERT CONIT system • The creation of the TREC corpora • The text collections created by the U.S. National Institute for Standards and Technology (NIST) for its TREC conferences • Sufficiently large and varied • Could divide into smaller databases • The summary statistics for three distributed IR testbeds

  5. Resource Description • Unigram language model • Gravano et al.,1994; Gravano and Gracia-Molina,1995; Callan et al., 1995 • Represent each database by a description consisting of the words that occur in the database, and their frequencies of occurrence • Compact and can be obtained automatically by examining the documents in a database or the document indexes • Can be extended easily to include phrases, proper names, and other text features • Resource description based on terms and frequencies • A small fraction of the size of the original text database • Resource Description gives the way to technique called Query Based Sampling

  6. Resource Selection(1/4) • Distributed Information Retrieval System • Resource Selection • Process of selecting databases relative to the query • Collections are treated analogously to documents in a databae • CORI database selection algorithm is used

  7. Resource Selection(2/4) • The CORI Algorithm (Callan et al., 1995) - df : the number of documents in Ri containing rk - cw : the number of indexing terms in resource Ri - avg_cw : the average number of indexing terms in each resource - C : the number of resource - cf : the number of resources containing term rk - B : the minimum belief component (usually 0.4)

  8. Resource Selection(3/4) • INQUERY query operator (Turtle, 1990; Turtle and Croft, 1991) • Can be used for ranking databases and documents - pj :p(rj|Ri)

  9. Resource Selection(4/4) • Effectiveness of a resource ranking algorithm • Compares a given database ranking at rank n to a desired database ranking at rank n - rgi : number of relevant documents in the i’’th-ranked database under the given ranking - rdi : number of relevant documents in the i’’th-ranked database under a desired ranking in which documents are ordered by the number of relevant documents they contain

  10. Merging Document Ranking(1/2) • After a set of databases is searched • The ranked results from each databases must be merged into a single ranking • Difficult when individual databases are not cooperative • Each database are based on different corpus statistics, representations and/or retrieval algorithms • Resource merging technique • Cooperative approach • Use of global idf or same ranking algorithm • Recomputing document scores at the search client • Non-cooperative approach • Estimate normalized document scores : combination of the score of the database and the score of the document

  11. Merging Document Ranking(2/2) • Estimates normalized document score - N : number of resources searched - D’’ : the product of the unnormalized document score D - Ri : the database score Ri - Avg_R : the average database score

  12. Acquiring Resource Descriptions(1/2) • Query-based sampling (Callan, et al., 1999; Callan & Connel, 2001) • Does not require cooperation of the databases • Process of querying database using random word queries • Initial query is selected from large dictionary of terms • Subsequent queries from documents sampled from database

  13. Acquiring Resource Descriptions(2/2) • Query-based sampling algorithm • Select initial query term • Run a one-term query on the database • Retrieve the top N documents returned by the database • Update the resource description based on characteristics of retrieved document • Extract words & frequencies from top N documents returned by the database • Add the word and their frequencies to the learned resource description • If a stopping criterion as not yet been reached, • Select a new query term • Go to Step 2

  14. Accuracy of Unigram Language Models(1/3) • Test corpora for query-based sampling experiments • Ctf ratio • How well the learned vocabulary matches the actual vocabulary - V’ : a learned vocabulary - V : a an actual vocabulary - ctfi :the number of times term I occurs in the database

  15. Accuracy of Unigram Language Models(2/3) • Spearman Rank Correlation Coefficient • How well the learned term frequencies indicates the frequency of each term in database • The rank correlation coefficient • 1 : two orderings are identical • 0 : they are uncorrelated • -1 : they are in reverse order - di : the rank difference of common term i - n : the number of terms - fk :the number of ties in the kth group if ties in the learned resource description - gm : the number of ties in the mth group of ties in the actual resource description

  16. Accuracy of Unigram Language Models(3/3) • Experiment

  17. Accuracy of Resource Rankings • Experiment

  18. Accuracy of Document Rankings • Experiment

  19. Summary and Conclusions • Techniques for acquiring descriptions of resources controlled by uncooperative parties • Using resource description to rank text databases by their likelihood of satisfying a query • Merging the document rankings returned by different text databases • The major remaining weakness • The algorithm for merging document rankings produces by different databases • Computational cost by parsing and reranking the documents • Many of the traditional IR tools, such as relevance feedback, have yet to be applied to multi-database environments

More Related