1 / 15

Relevant Document Distribution Estimation Method for Resource Selection

Relevant Document Distribution Estimation Method for Resource Selection. Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University lsi@cs.cmu.edu callan@cs.cmu.edu. Abstract. Task: Distributed Information Retrieval in uncooperative environments. Contributions:

cmorfin
Download Presentation

Relevant Document Distribution Estimation Method for Resource Selection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University lsi@cs.cmu.edu callan@cs.cmu.edu

  2. Abstract Task:Distributed Information Retrieval in uncooperative environments. Contributions: • Sample-Resample method to estimate DB size. • ReDDE (relevant document distribution estimation) resource selection algorithm directly estimates distribution of relevant documents among databases • Modified ReDDE algorithm for better retrieval performance. 2

  3. …… …… Engine n Engine 1 Engine 3 Engine 4 Engine 2 (2)Resource Selection (4)Results Merging (1)Resource Representation What is Distributed Information Retrieval(Federated Search)? . . . . . . . . . . Four steps: (1)Find out what each DB contains (2) Decide which DBs to search (3)Search selected DBs (4) Merge results returned by DBs 3

  4. Previous Work:Resource Representation Resource Representation (Content Representation): • Query Based Sampling (Need no cooperation)(Callan, et al., 1999) • Submitting randomly-generated queries and analyze returned docs • Does not need cooperation for individual DBs Resource Representation (Database Size Estimation): • Capture-Recapture Model (Liu and Yu, 1999) Total Num: 4

  5. Previous Work:Resource Selection & Results Merging Resource Selection: • gGlOSS (Gravano, et al., 1995) • Represent DBs and queries as vectors and calculate the similarities • CORI (Callan, et al., 1995) • A Bayesian Inference Network model. Has been shown effective on different testbeds Results Merging: • CORI results merging algorithm (Callan, et al., 1995) • Linear heuristic model with fixed parameters • Semi-Supervised Learning algorithm (Si and Callan, 2002) • Linear model and parameters are learned from training data 5

  6. Previous Work:Thoughts Thoughts: • Original capture-recapture method has very large cost to get relatively accurate DB size estimates • Most of the resource selection algorithms have not been studied in the environment with skewed DB size distribution • They do not directly optimize the number of relevant docs contained in selected DBs. (The goal of resource selection) • There is inconsistency between the goals of resource selection and retrieval (high recall and high precision) 6

  7. Experimental Data Testbeds: • Trec123_100col: 100 DBs. Organized by source and publication date. DB sizes and distribution of relevant documents rather uniform • Trec123_AP_WSJ_60col (Relevant): 62 DBs. 60 from above, 2 by merging AP and WSJ DBs. DB sizes skewed and large DBs have much more relevant docs • Trec123_FR_DOE_81col (Non-Relevant): 83 DBs. 81 from above, 2 by merging FR and DOE DBs. DB sizes skewed and large DBs have not many relevant docs • Trec4_kmeans: 100 DBs. Organized by topic. DB sizes and distribution of relevant documents moderately skewed • Trec123_10col: 10 DBs. Each DB is built by merging 10 DBs in Trec123_100col in a round-robin way. DB sizes are large. 7

  8. df of term in sampled docs from jth DB df of term in the whole jth DB Num of docs sampled from jth DB DB Size Estimate A New Approach to DB Size Estimation:The Sample-Resample Algorithm The Idea: Assume: Search engine indicates num of docs match a one-term query Strategy: Estimate df of a term in sampled docs and get df from the DB in the whole collection; scale the num of sampled docs to get the DB size Centralized sample DB: built by all the sampled docs Centralized complete DB: imaginarily built by all the docs in all DBs 8

  9. Experimental Results:DB Size Estimation Methods were allowed the same num of transactions with a DB Capture-Recapture: about 385 queries (transactions). Sample-Resample: 80 queries and 300 docs for sampled docs (sample) +5 queries ( resample)=385 transactions Estimated DB Size Absolute error ratio Actual DB Size Measure: Original Cap-Recap (Top 1) only selects top 1 Doc to build the sample, more experiments are in the paper 9

  10. A New Approach to Resource Selection:The ReDDE Algorithm • The goal of resource selection: • Select the (few) DBs that have the most relevant documents • Common strategy: • Pick DBs that are the “most similar” to the query • But similarity measures don’t always normalize well for DB size • Desired strategy: • Rank DBs by the number of relevant documents they contain • It hasn’t been clear how to do this • An approximation of the desired strategy: • Rank DBs by the percentage of relevant documents they contain • This can be estimated a little more easily… …but we need to make some assumptions 10

  11. Number of docs sampled from jth DB Estimated DB size “Everything at the top is (equally) relevant” Estimated Number of docs in the DB that contains dj CSDB (Rank) CCDB (Rank) Scale by DB Size a } b } c a a b b b Number of docs sampled from the DB that contains dj Scale by DB Size Normalize, to eliminate constant Cq. The ReDDE Algorithm:Estimating the Distribution of Relevant Documents 11

  12. Experimental Results:Resource Selection Measure: Percentage of num of rel docs included compared with relevance based ranking. Evaluated Ranking Trec4-kmeans (100 DBs) Trec123-100col (100 DBs) Best Ranking Large are Non-Relevant Large are Relevant Relevant (2 Large,60 small DBs) Non-Relevant (2 Large, 81 small DBs) 12

  13. Modified ReDDE for retrieval performance Document Retrieval The ReDDE algorithm has a parameter (“ratio”): • It tunes the algorithm for “high Precision” or “high Recall” • High Precision focuses attention at the top of the rankings • High Recall focuses attention on retrieving more relevant documents • Usually high Precision is the goal in interactive environments • But, for some databases data is sparse, so high Precision settings yield (inaccurate) estimates of zero relevant documents in a DB. • Solution: Modified ReDDE with two ratios • Use high Precision setting if possible: Rank all the DBs that have large values with a smaller ratio: DistRel_r1j>=backoff_Thres • Else use high Recall setting: Rank all the DBs by the values with larger ratio: DistRel_r2j 13

  14. Experimental Results:Retrieval Performance Precision at different doc ranks using CORI and Modified ReDDE resource selection algorithms. Results were averaged over 50 queries. 3 DBs were selected 14

  15. Conclusion and Future Work Conclusions: • Sample-Resample algorithm gives relatively accurate DB size estimates with low communication cost • Database size is an important factor for resource selection algorithm especially in the environment of skewed relevant documents distribution • ReDDE has better or at least the same performance than CORI in different environments • Modified ReDDE results in better retrieval performance Future work: • To adjust the parameters of ReDDE algorithm automatically 15

More Related