170 likes | 280 Views
Federated text retrieval from uncooperative overlapped collections. Milad Shokouhi , RMIT University, Melbourne, Australia Justin Zobel , RMIT University, Melbourne, Australia SIGIR 2007 (Collection representation in distributed IR) 2009-03-13
E N D
Federated text retrieval from uncooperative overlapped collections MiladShokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University, Melbourne, Australia SIGIR 2007(Collection representation in distributed IR) 2009-03-13 Presented by JongHeumYeon, IDS Lab., Seoul National University
Abstract Broker User Collection • Collection • Collection • Federated information retrieval (FIR) • Send query to multiple collections • Central broker merges the results and ranks them • Duplicated documents in collections • Final results contains high number of duplicates potentially • Authors propose a method for estimating the rate of overlap among collections based on sampling • Using the estimated overlap statistics, they propose two collection selection methods that aim to maximize the number of unique relevant documents in the final results
Federated Information Retrieval (FIR) • Query is sent simultaneously to several collections • Each collection evaluates the query and returns the results to the broker • Advantage • No need to access the index of the collections • Search over the latest version of documents without crawling and indexing • Broker selects collections that are most likely to return relevant documents • Collection selection problem • Collection representation problem • Result merging problem
Collection Selection Problem • FIR techniques assume that the degree of overlap among collections is either none or negligible • However, there are many collections that have a significant degree of overlap • Bibliographic databases • News resources • Selecting collections that are likely to return the same results by introducing duplicate documents into the final results • Wastes costly resources • Degrades search effectiveness • Authors propose … • A method that estimates the degree of overlap among collections by sampling from each collection using random queries • two collection selection techniques that use the estimated overlap statistics to maximize the number of unique relevant documents in the final results
Related Work • Cooperative collection selection techniques • Collections provide the broker with their index statistics and other useful information • CORI, GlOSS, CVV • Uncooperative collection selection techniques • Collections do not provide their index statistics to the broker • The broker samples documents from each collection • ReDDE uses sampled documents for … • Estimates the number of relevant documents in collections • Ranks collections according to the number of highly ranked sampled documents
Overlap Estimation C1 C2 K S2 S1 • Expected number of documents Using the documents downloaded by query-based sampling for estimating the rate of overlap and does not require any additional information Subset of sample documents Size of m The probability of any given document from m1 to be available in m2
Overlap Estimation (cont’d) P(i) follows binomial distribution
Overlap Estimation (cont’d) • Binomial theorem • Expected number of documents in m1 ∩ m2 • The number of overlap documents is independent of the collection size
The ‘RELAX’ Selection Method Graph G = {(u,v) | vertex u, v are collections, edges indicates overlap documents between vertices} Output : final merged document lists that minimized duplicates
Overlap Filtering for ReDDE • F-ReDDE • The overlaps among collections are estimated as described for the Relax selection • Collections are ranked using a resource selection algorithm such as ReDDE • Each collection is compared with the previously selected collections. It is removed from the list if it has a high overlap (greater than γ) with any of the previously selected collections. We empirically choose γ = 30% and leave methods for finding the optimum value as future work
Testbeds • Authors create three new testbeds with overlapping collections based on the documents available in the TREC GOV dataset • Qprobed-280 • 360 most frequent queries in a search engine in the .gov • A random number of documents (between 5000 and 20000) are downloaded as a collection • Generate 280 collections with average size of 12194 documents • Qprobed-300 • every twentieth collection is merged into a single large collection • Sliding-115 • Using a sliding window of 30 000 documents • Generate 112 collections
Testbeds (cont’d) • Qprobed-280 • 74492 collection pairs < 10% overlap • 79 pairs < 90% • 1.1% of collection pairs > 50% overlap • Qprobed-300 • 1.9% of collection pairs > 50% overlap • Sliding-115 • 2.5% of collection pairs > 50% overlap
Results • The initial estimated values for D(i, j) suggested that the degree of overlap among collections is usually overestimated • Document retrieval models are biased towards returning some popular documents for many queries • Samples produced by query-based sampling are not random
Conclusion & Discussion • Pros • Propose the efficient algorithm for handling duplicates • Cons • Experiments show the improved performance • In practical environment?