240 likes | 254 Views
Improve text database selection by enhancing content summaries using shrinkage and database classification. Learn how to obtain better word probability estimates for more accurate information retrieval.
E N D
When one Sample is not Enough:Improving Text Database Selection using Shrinkage Panos Ipeirotis Luis Gravano Computer Science Department Columbia University
“Regular” Web Pages and Text Databases • “Regular” Web • Link structure • Crawlable • Documents indexed by search engines • Text Databases (a.k.a. “Hidden Web”, “Deep Web”…) • Usually no link structure • Documents “hidden” in databases • Documents not indexed by search engines • Need to query each collection individually Panos Ipeirotis - Columbia University
Text Databases: Examples • Search on U.S. Patent and Trademark Office (USPTO) database: • [wireless network] 26,012 matches • (USPTO database is at http://patft.uspto.gov/netahtml/search-bool.html) • Search on Google restricted to USPTO database site: • [wireless network site:patft.uspto.gov] 0 matches as of June 10th, 2004 Panos Ipeirotis - Columbia University
? ... thrombopenia 27,960 ... ... thrombopenia 42 ... ... thrombopenia 0 ... Metasearchers Provide Access to Distributed Text Databases Database selection relies on simple content summaries: vocabulary, word frequencies thrombopenia PubMed (11,868,552 documents) … aids 121,491 cancer 1,562,477 heart 691,360hepatitis121,129 thrombopenia 27,960 … Metasearcher PubMed NYTimesArchives USPTO Panos Ipeirotis - Columbia University
Extracting Content Summaries from Autonomous Text Databases • Send queries to databases • Retrieve top matching documents • If “stopping criterion met” (e.g., sample>300 docs) then exit; else go to Step 1 Content summary contains words in sample and document frequency of each word Problem: Summaries from small samples are highly incomplete Panos Ipeirotis - Columbia University
Problem: Summaries Derived from Small Samples Fundamentally Incomplete • Many words appear in “relatively few” documents (Zipf’s law) • Low-frequency words are often important • Small document samples miss many low-frequency words Sample=300 Log(Frequency) 107 106 10% most frequent words in PubMed database 9,000 . . ……………………………………… endocarditis ~9,000 docs / ~0.1% 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - Columbia University
Improving Sample-based Content Summaries Main Idea: Database Classification Helps • Similar topics ↔ Similar content summaries • Extracted content summaries complement each other • Classification available from directories (e.g., Open Directory) or derived automatically (e.g., QProber) Challenge: Improve content summary quality without increasing sample size Panos Ipeirotis - Columbia University
Databases with Similar Topics • Cancerlit contains “metastasis”, not found during sampling • CancerBacup contains “metastasis” • Databases under same category have similar vocabularies, and can complement each other Panos Ipeirotis - Columbia University
Content Summaries for Categories • Databases under same category share similar vocabulary • Higher-level category content summaries provide additional useful estimates of “word probabilities” • Can use all estimates in category path Panos Ipeirotis - Columbia University
Enhancing Summaries Using “Shrinkage” • Word-probability estimates from database content summaries can be unreliable • Category content summaries are more reliable (based on larger samples) but less specific to database • By combining estimates from category and database content summaries we get better estimates Panos Ipeirotis - Columbia University
Shrinkage-based Estimations Adjust probability estimates Pr [metastasis | D] = λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Select λi weights to maximize the probability that the summary of D is from a database under all its parent categories Panos Ipeirotis - Columbia University
Computing Shrinkage-based Summaries Root Health Cancer D Pr [metastasis | D] =λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Pr [treatment | D] =λ1 * 0.015 +λ2 * 0.12 + λ3 * 0.179+ λ4 * 0.184 … • Automatic computation of λi weights using an EM algorithm • Computation performed offline No query overhead Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - Columbia University
Shrinkage Weights and Summary new estimates old estimates Shrinkage: • Increases estimations for underestimates (e.g., metastasis) • Decreases word-probability estimates for overestimates (e.g., aids) • …it also introduces (with small probabilities) spurious words (e.g., football) Panos Ipeirotis - Columbia University
Is Shrinkage Always Necessary? • Shrinkage used to reduce uncertainty (variance) of estimations • Small samples of large databases high variance • In sample: 10 out of 100 documents contain metastasis • In database: ? out of 10,000,000 documents? • Small samples of small databases small variance • In sample: 10 out of 100 documents contain metastasis • In database: ? out of 200 documents? • Shrinkage less useful (or even harmful) when uncertainty is low Panos Ipeirotis - Columbia University
Adaptive Application of Shrinkage • Database selection algorithms assign scores to databases for each query • When word frequency estimates are uncertain, assigned score has high variance • shrinkage improves score estimates • When word frequency estimates are reliable, assigned score has small variance • shrinkage unnecessary Unreliable Score Estimate: Use shrinkage Probability 0 1 Database Score for a Query Reliable Score Estimate: Shrinkage might hurt Probability Solution: Use shrinkage adaptively in a query- and database-specific manner 0 1 Database Score for a Query Panos Ipeirotis - Columbia University
Searching Algorithm Extract document samples Get database classification Compute shrinkage-based summaries One-time process To process a query Q: • For each database D: • Use a regular database selection algorithm to compute query score for D using old, “unshrunk” summary • Analyze uncertainty of score • If uncertainty high, use new, shrinkage-based summary instead and compute new query score for D • Evaluate Q over top-k scoring databases For every query Panos Ipeirotis - Columbia University
Evaluation: Goals • Examine quality of shrinkage-based summaries • Examine effect of shrinkage on database selection Panos Ipeirotis - Columbia University
Experimental Setup • Three data sets: • Two standard testbeds from TREC (“Text Retrieval Conference”): • 200 databases • 100 queries with associated human-assigned document relevance judgments • 315 real Web databases • Two sets of experiments: • Content summary quality • Database selection accuracy Panos Ipeirotis - Columbia University
Results: Content Summary Quality • Recall: How many words in database also in summary? Shrinkage-based summaries include 10-90% more words than unshrunk summaries • Precision: How many words in the summary also in database? Shrinkage-based summaries include 5%-15% words not in actual database Panos Ipeirotis - Columbia University
Results: Content Summary Quality • Rank correlation: Is word ranking in summary similar to ranking in database? Shrinkage-based summaries demonstrate better word rankings than unshrunk summaries • Kullback-Leibler divergence: Is probability distribution in summary similar to distribution in database? Shrinkage improves bad cases, making very good ones worse Motivates adaptive application of shrinkage! Panos Ipeirotis - Columbia University
Results: Database Selection • Metric: R(K) = Χ / Υ • X = # of relevant documents in the selected K databases • Y = # of relevant documents in the best K databases For CORI (a state-of-the-art database selection algorithm) with stemming over one TREC testbed Panos Ipeirotis - Columbia University
Other Experiments • Choice of database selection algorithm (CORI, bGlOSS, Language Modeling) • Comparison with VLDB’02 hierarchical database selection algorithm • Universal vs. adaptive application of shrinkage • Effect of stemming • Effect of stop-word elimination Panos Ipeirotis - Columbia University
Conclusions Developed strategy to automatically summarize contents of hidden-web text databases • Content summaries are critical for efficient metasearching • Strategy assumes no cooperation from databases • Shrinkage improves content summary quality by exploiting topical similarity • Shrinkage is efficient: no increase in document sample size required Developed adaptive database selection strategy that decides whether to apply shrinkage on a database- and query-specific way Panos Ipeirotis - Columbia University
Thank you! Shrinkage-based content summary generation implemented and available for download at: http://sdarts.cs.columbia.edu Questions? Panos Ipeirotis - Columbia University