Relevant Document Distribution Estimation Method for Resource Selection

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University lsi@cs.cmu.edu callan@cs.cmu.edu

Abstract Task:Distributed Information Retrieval in uncooperative environments. Contributions: • Sample-Resample method to estimate DB size. • ReDDE (relevant document distribution estimation) resource selection algorithm directly estimates distribution of relevant documents among databases • Modified ReDDE algorithm for better retrieval performance. 2

…… …… Engine n Engine 1 Engine 3 Engine 4 Engine 2 (2)Resource Selection (4)Results Merging (1)Resource Representation What is Distributed Information Retrieval(Federated Search)? . . . . . . . . . . Four steps: (1)Find out what each DB contains (2) Decide which DBs to search (3)Search selected DBs (4) Merge results returned by DBs 3

Previous Work:Resource Representation Resource Representation (Content Representation): • Query Based Sampling (Need no cooperation)(Callan, et al., 1999) • Submitting randomly-generated queries and analyze returned docs • Does not need cooperation for individual DBs Resource Representation (Database Size Estimation): • Capture-Recapture Model (Liu and Yu, 1999) Total Num: 4

Previous Work:Resource Selection & Results Merging Resource Selection: • gGlOSS (Gravano, et al., 1995) • Represent DBs and queries as vectors and calculate the similarities • CORI (Callan, et al., 1995) • A Bayesian Inference Network model. Has been shown effective on different testbeds Results Merging: • CORI results merging algorithm (Callan, et al., 1995) • Linear heuristic model with fixed parameters • Semi-Supervised Learning algorithm (Si and Callan, 2002) • Linear model and parameters are learned from training data 5

Previous Work:Thoughts Thoughts: • Original capture-recapture method has very large cost to get relatively accurate DB size estimates • Most of the resource selection algorithms have not been studied in the environment with skewed DB size distribution • They do not directly optimize the number of relevant docs contained in selected DBs. (The goal of resource selection) • There is inconsistency between the goals of resource selection and retrieval (high recall and high precision) 6

Experimental Data Testbeds: • Trec123_100col: 100 DBs. Organized by source and publication date. DB sizes and distribution of relevant documents rather uniform • Trec123_AP_WSJ_60col (Relevant): 62 DBs. 60 from above, 2 by merging AP and WSJ DBs. DB sizes skewed and large DBs have much more relevant docs • Trec123_FR_DOE_81col (Non-Relevant): 83 DBs. 81 from above, 2 by merging FR and DOE DBs. DB sizes skewed and large DBs have not many relevant docs • Trec4_kmeans: 100 DBs. Organized by topic. DB sizes and distribution of relevant documents moderately skewed • Trec123_10col: 10 DBs. Each DB is built by merging 10 DBs in Trec123_100col in a round-robin way. DB sizes are large. 7

df of term in sampled docs from jth DB df of term in the whole jth DB Num of docs sampled from jth DB DB Size Estimate A New Approach to DB Size Estimation:The Sample-Resample Algorithm The Idea: Assume: Search engine indicates num of docs match a one-term query Strategy: Estimate df of a term in sampled docs and get df from the DB in the whole collection; scale the num of sampled docs to get the DB size Centralized sample DB: built by all the sampled docs Centralized complete DB: imaginarily built by all the docs in all DBs 8

Experimental Results:DB Size Estimation Methods were allowed the same num of transactions with a DB Capture-Recapture: about 385 queries (transactions). Sample-Resample: 80 queries and 300 docs for sampled docs (sample) +5 queries ( resample)=385 transactions Estimated DB Size Absolute error ratio Actual DB Size Measure: Original Cap-Recap (Top 1) only selects top 1 Doc to build the sample, more experiments are in the paper 9

A New Approach to Resource Selection:The ReDDE Algorithm • The goal of resource selection: • Select the (few) DBs that have the most relevant documents • Common strategy: • Pick DBs that are the “most similar” to the query • But similarity measures don’t always normalize well for DB size • Desired strategy: • Rank DBs by the number of relevant documents they contain • It hasn’t been clear how to do this • An approximation of the desired strategy: • Rank DBs by the percentage of relevant documents they contain • This can be estimated a little more easily… …but we need to make some assumptions 10

Number of docs sampled from jth DB Estimated DB size “Everything at the top is (equally) relevant” Estimated Number of docs in the DB that contains dj CSDB (Rank) CCDB (Rank) Scale by DB Size a } b } c a a b b b Number of docs sampled from the DB that contains dj Scale by DB Size Normalize, to eliminate constant Cq. The ReDDE Algorithm:Estimating the Distribution of Relevant Documents 11

Experimental Results:Resource Selection Measure: Percentage of num of rel docs included compared with relevance based ranking. Evaluated Ranking Trec4-kmeans (100 DBs) Trec123-100col (100 DBs) Best Ranking Large are Non-Relevant Large are Relevant Relevant (2 Large,60 small DBs) Non-Relevant (2 Large, 81 small DBs) 12

Modified ReDDE for retrieval performance Document Retrieval The ReDDE algorithm has a parameter (“ratio”): • It tunes the algorithm for “high Precision” or “high Recall” • High Precision focuses attention at the top of the rankings • High Recall focuses attention on retrieving more relevant documents • Usually high Precision is the goal in interactive environments • But, for some databases data is sparse, so high Precision settings yield (inaccurate) estimates of zero relevant documents in a DB. • Solution: Modified ReDDE with two ratios • Use high Precision setting if possible: Rank all the DBs that have large values with a smaller ratio: DistRel_r1j>=backoff_Thres • Else use high Recall setting: Rank all the DBs by the values with larger ratio: DistRel_r2j 13

Experimental Results:Retrieval Performance Precision at different doc ranks using CORI and Modified ReDDE resource selection algorithms. Results were averaged over 50 queries. 3 DBs were selected 14

Conclusion and Future Work Conclusions: • Sample-Resample algorithm gives relatively accurate DB size estimates with low communication cost • Database size is an important factor for resource selection algorithm especially in the environment of skewed relevant documents distribution • ReDDE has better or at least the same performance than CORI in different environments • Modified ReDDE results in better retrieval performance Future work: • To adjust the parameters of ReDDE algorithm automatically 15

Relevant Document Distribution Estimation Method for Resource Selection

Relevant Document Distribution Estimation Method for Resource Selection

Presentation Transcript

Estimation of Distribution Algorithms

Fast FPGA Resource Estimation

Estimation and Model Selection for Geostatistical Models

Resource Use and Selection

RESOURCE DISTRIBUTION

Human Resource Selection

Variational Bayes Model Selection for Mixture Distribution

Least Square Method for Parameter Estimation

A Method for Fast Delay/Area Estimation

7. MOMENT DISTRIBUTION METHOD

Target Selection Relevant to Health

The Effect of Database Size Distribution on Resource Selection Algorithms

Method Selection and Evaluation

LCA -Based Selection for XML Document Collections

A Method for Runtime Service Selection

Estimation of Distribution Algorithms (EDA)

method of nonlinear estimation

Method Selection and Development

How to Find Relevant Data for Effort Estimation ?

Selection Sort Method

Resource Selection and Scale

Document Management Systems For Human Resource Department