Federated text retrieval from uncooperative overlapped collections

Federated text retrieval from uncooperative overlapped collections MiladShokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University, Melbourne, Australia SIGIR 2007(Collection representation in distributed IR) 2009-03-13 Presented by JongHeumYeon, IDS Lab., Seoul National University

Abstract Broker User Collection • Collection • Collection • Federated information retrieval (FIR) • Send query to multiple collections • Central broker merges the results and ranks them • Duplicated documents in collections • Final results contains high number of duplicates potentially • Authors propose a method for estimating the rate of overlap among collections based on sampling • Using the estimated overlap statistics, they propose two collection selection methods that aim to maximize the number of unique relevant documents in the final results

Federated Information Retrieval (FIR) • Query is sent simultaneously to several collections • Each collection evaluates the query and returns the results to the broker • Advantage • No need to access the index of the collections • Search over the latest version of documents without crawling and indexing • Broker selects collections that are most likely to return relevant documents • Collection selection problem • Collection representation problem • Result merging problem

Collection Selection Problem • FIR techniques assume that the degree of overlap among collections is either none or negligible • However, there are many collections that have a significant degree of overlap • Bibliographic databases • News resources • Selecting collections that are likely to return the same results by introducing duplicate documents into the final results • Wastes costly resources • Degrades search effectiveness • Authors propose … • A method that estimates the degree of overlap among collections by sampling from each collection using random queries • two collection selection techniques that use the estimated overlap statistics to maximize the number of unique relevant documents in the final results

Related Work • Cooperative collection selection techniques • Collections provide the broker with their index statistics and other useful information • CORI, GlOSS, CVV • Uncooperative collection selection techniques • Collections do not provide their index statistics to the broker • The broker samples documents from each collection • ReDDE uses sampled documents for … • Estimates the number of relevant documents in collections • Ranks collections according to the number of highly ranked sampled documents

Overlap Estimation C1 C2 K S2 S1 • Expected number of documents Using the documents downloaded by query-based sampling for estimating the rate of overlap and does not require any additional information Subset of sample documents Size of m The probability of any given document from m1 to be available in m2

Overlap Estimation (cont’d) P(i) follows binomial distribution

Overlap Estimation (cont’d) • Binomial theorem • Expected number of documents in m1 ∩ m2 • The number of overlap documents is independent of the collection size

The ‘RELAX’ Selection Method Graph G = {(u,v) | vertex u, v are collections, edges indicates overlap documents between vertices} Output : final merged document lists that minimized duplicates

The ‘RELAX’ Selection Method (cont’d)

Overlap Filtering for ReDDE • F-ReDDE • The overlaps among collections are estimated as described for the Relax selection • Collections are ranked using a resource selection algorithm such as ReDDE • Each collection is compared with the previously selected collections. It is removed from the list if it has a high overlap (greater than γ) with any of the previously selected collections. We empirically choose γ = 30% and leave methods for finding the optimum value as future work

Testbeds • Authors create three new testbeds with overlapping collections based on the documents available in the TREC GOV dataset • Qprobed-280 • 360 most frequent queries in a search engine in the .gov • A random number of documents (between 5000 and 20000) are downloaded as a collection • Generate 280 collections with average size of 12194 documents • Qprobed-300 • every twentieth collection is merged into a single large collection • Sliding-115 • Using a sliding window of 30 000 documents • Generate 112 collections

Testbeds (cont’d) • Qprobed-280 • 74492 collection pairs < 10% overlap • 79 pairs < 90% • 1.1% of collection pairs > 50% overlap • Qprobed-300 • 1.9% of collection pairs > 50% overlap • Sliding-115 • 2.5% of collection pairs > 50% overlap

Results • The initial estimated values for D(i, j) suggested that the degree of overlap among collections is usually overestimated • Document retrieval models are biased towards returning some popular documents for many queries • Samples produced by query-based sampling are not random

Results (cont’d)

Conclusion & Discussion • Pros • Propose the efficient algorithm for handling duplicates • Cons • Experiments show the improved performance • In practical environment?

Federated text retrieval from uncooperative overlapped collections

Federated text retrieval from uncooperative overlapped collections

Presentation Transcript

Introduction to Text Retrieval

Recognition and Retrieval from Document Image Collections

Text Based Information Retrieval - Text Mining

Information Retrieval and Text Mining

Comparing Offline and Online Statistics Estimation for Text Retrieval from Overlapped Collections

Metadata Quality for Federated Collections

Snowball : Extracting Relations from Large Plain-Text Collections

Text Retrieval Algorithms

Overlapped modulation structure

CS276A Text Retrieval and Mining

Active Learning in Text Retrieval

CS276A Text Retrieval and Mining

Federated Search of Text Search Engines in Uncooperative Environments

Conventional Text-Retrieval Systems

Conventional Text-Retrieval Systems

Federated Search of Text Search Engines in Uncooperative Environments

Text-retrieval Systems

Structured Text Retrieval Models

CS276A Text Retrieval and Mining

Retrieval Evaluation - Reference Collections

Conventional Text-Retrieval Systems

Text retrieval systems