210 likes | 504 Views
DISTRIBUTED INFORMATION RETRIEVAL. 2003. 07. 23 Lee Won Hee. Abstraction. A multi-database model of distributed information retrieval Full-text information retrieval consists of discovering database contents Ranking databases by their expected ability to satisfy the query
E N D
DISTRIBUTED INFORMATION RETRIEVAL 2003. 07. 23 Lee Won Hee
Abstraction • A multi-database model of distributed information retrieval • Full-text information retrieval consists of discovering database contents • Ranking databases by their expected ability to satisfy the query • Searching a small number of databases • Merging results returned by different databases • This paper • Presents algorithms for each task
Introduction • Multi-database model of distributed information retrieval • Reflects the distributed location and control of information in a wide area computer network • Resource description • The contents of each text database must be described • Resource selection • Given an information need and a set of resource descriptions, a decision must be made about which database(s) to search • Results merging • Integrating the ranked lists returned by search by each data base into a single, coherent ranked list
Multi-database Testbeds • Marcus, 1983 • Addressed resource description and selection in the EXPERT CONIT system • The creation of the TREC corpora • The text collections created by the U.S. National Institute for Standards and Technology (NIST) for its TREC conferences • Sufficiently large and varied • Could divide into smaller databases • The summary statistics for three distributed IR testbeds
Resource Description • Unigram language model • Gravano et al.,1994; Gravano and Gracia-Molina,1995; Callan et al., 1995 • Represent each database by a description consisting of the words that occur in the database, and their frequencies of occurrence • Compact and can be obtained automatically by examining the documents in a database or the document indexes • Can be extended easily to include phrases, proper names, and other text features • Resource description based on terms and frequencies • A small fraction of the size of the original text database • Resource Description gives the way to technique called Query Based Sampling
Resource Selection(1/4) • Distributed Information Retrieval System • Resource Selection • Process of selecting databases relative to the query • Collections are treated analogously to documents in a databae • CORI database selection algorithm is used
Resource Selection(2/4) • The CORI Algorithm (Callan et al., 1995) - df : the number of documents in Ri containing rk - cw : the number of indexing terms in resource Ri - avg_cw : the average number of indexing terms in each resource - C : the number of resource - cf : the number of resources containing term rk - B : the minimum belief component (usually 0.4)
Resource Selection(3/4) • INQUERY query operator (Turtle, 1990; Turtle and Croft, 1991) • Can be used for ranking databases and documents - pj :p(rj|Ri)
Resource Selection(4/4) • Effectiveness of a resource ranking algorithm • Compares a given database ranking at rank n to a desired database ranking at rank n - rgi : number of relevant documents in the i’’th-ranked database under the given ranking - rdi : number of relevant documents in the i’’th-ranked database under a desired ranking in which documents are ordered by the number of relevant documents they contain
Merging Document Ranking(1/2) • After a set of databases is searched • The ranked results from each databases must be merged into a single ranking • Difficult when individual databases are not cooperative • Each database are based on different corpus statistics, representations and/or retrieval algorithms • Resource merging technique • Cooperative approach • Use of global idf or same ranking algorithm • Recomputing document scores at the search client • Non-cooperative approach • Estimate normalized document scores : combination of the score of the database and the score of the document
Merging Document Ranking(2/2) • Estimates normalized document score - N : number of resources searched - D’’ : the product of the unnormalized document score D - Ri : the database score Ri - Avg_R : the average database score
Acquiring Resource Descriptions(1/2) • Query-based sampling (Callan, et al., 1999; Callan & Connel, 2001) • Does not require cooperation of the databases • Process of querying database using random word queries • Initial query is selected from large dictionary of terms • Subsequent queries from documents sampled from database
Acquiring Resource Descriptions(2/2) • Query-based sampling algorithm • Select initial query term • Run a one-term query on the database • Retrieve the top N documents returned by the database • Update the resource description based on characteristics of retrieved document • Extract words & frequencies from top N documents returned by the database • Add the word and their frequencies to the learned resource description • If a stopping criterion as not yet been reached, • Select a new query term • Go to Step 2
Accuracy of Unigram Language Models(1/3) • Test corpora for query-based sampling experiments • Ctf ratio • How well the learned vocabulary matches the actual vocabulary - V’ : a learned vocabulary - V : a an actual vocabulary - ctfi :the number of times term I occurs in the database
Accuracy of Unigram Language Models(2/3) • Spearman Rank Correlation Coefficient • How well the learned term frequencies indicates the frequency of each term in database • The rank correlation coefficient • 1 : two orderings are identical • 0 : they are uncorrelated • -1 : they are in reverse order - di : the rank difference of common term i - n : the number of terms - fk :the number of ties in the kth group if ties in the learned resource description - gm : the number of ties in the mth group of ties in the actual resource description
Accuracy of Unigram Language Models(3/3) • Experiment
Accuracy of Resource Rankings • Experiment
Accuracy of Document Rankings • Experiment
Summary and Conclusions • Techniques for acquiring descriptions of resources controlled by uncooperative parties • Using resource description to rank text databases by their likelihood of satisfying a query • Merging the document rankings returned by different text databases • The major remaining weakness • The algorithm for merging document rankings produces by different databases • Computational cost by parsing and reranking the documents • Many of the traditional IR tools, such as relevance feedback, have yet to be applied to multi-database environments