840 likes | 846 Views
This article discusses the challenges of classifying and searching hidden-web text databases, along with strategies for interaction and classification. Examples and methods are provided.
E N D
Classifying and Searching "Hidden-Web" Text Databases Panos Ipeirotis Department of Information Systems New York University
Motivation?“Surface” Web vs. “Hidden” Web • “Surface” Web • Link structure • Crawlable • Documents indexed by search engines • “Hidden” Web • No link structure • Documents “hidden” in databases • Documents not indexed by search engines • Need to query each collection individually Panos Ipeirotis - New York University - IOMS Department
Hidden-Web Databases: Examples • Search on U.S. Patent and Trademark Office (USPTO) database: • [wireless network] 29,051 matches • (USPTO database is at http://patft.uspto.gov/netahtml/search-bool.html) • Search on Google restricted to USPTO database site: • [wireless network site:patft.uspto.gov] 0 matches as of Oct 28th, 2004 Panos Ipeirotis - New York University - IOMS Department
Interacting With Hidden-Web Databases • Browsing: Yahoo!-like directories • InvisibleWeb.com • SearchEngineGuide.com • Searching: Metasearchers Populated Manually Panos Ipeirotis - New York University - IOMS Department
Outline of Talk • Classification of Hidden-Web Databases • Search over Hidden-Web Databases Panos Ipeirotis - New York University - IOMS Department
? ? ? ? ? Hierarchically Classifying the ACM Digital Library ACM DL Panos Ipeirotis - New York University - IOMS Department
? ? ? Hierarchically Classifying the ACM Digital Library ACM DL Panos Ipeirotis - New York University - IOMS Department
Hierarchically Classifying the ACM Digital Library ACM DL Panos Ipeirotis - New York University - IOMS Department
Text Database Classification: Definition • For a text database D and a category C: • Coverage(D,C) = number of docs in D about C • Specificity(D,C) = fraction of docs in D about C • Assign a text database to a category C if: • Database coverage for C at least Tc Tc:coverage threshold (e.g., > 100 docs in C) • Database specificity for C at least Ts Ts:specificity threshold (e.g., > 40% of docs in C) Panos Ipeirotis - New York University - IOMS Department
Brute-Force Classification “Strategy” • Extract all documents from database • Classify documents on topic (use state-of-the-art classifiers: SVMs, C4.5, RIPPER,…) • Classify database according to topic distribution Problem: No direct access to full contents of Hidden-Web databases Panos Ipeirotis - New York University - IOMS Department
Classification: Goal & Challenges • Goal: Discover database topic distribution • Challenges: • No direct access to full contents of Hidden-Web databases • Only limited search interfaces available • Should not overload databases Key observation: Only queries “about” database topic(s) generate large number of matches Panos Ipeirotis - New York University - IOMS Department
Query-based Database Classification: Overview • Train document classifier • Extract queries from classifier • Adaptively issue queries to database • Identify topic distribution based on adjusted number of query matches • Classify database TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department
Training a Document Classifier • Get training set (set of pre-classified documents) • Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] • Train classifier (SVM, C4.5, RIPPER, …) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE Output: A “black-box” model for classifying documents IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Document Classifier Panos Ipeirotis - New York University - IOMS Department
Training a Document Classifier • Get training set (set of pre-classified documents) • Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] • Train classifier (SVM, C4.5, RIPPER, …) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE Output: A “black-box” model for classifying documents IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Document Classifier Panos Ipeirotis - New York University - IOMS Department
Easy for decision-tree classifiers (C4.5) for which rule generators exist (C4.5rules) C4.5rules • Trickier for other classifiers: we devised rule-extraction methods for linear classifiers (linear-kernel SVMs, Naïve-Bayes, …) Rule extraction Extracting Query Probes ACM TOIS 2003 Transform classifier model into queries • Trivial for “rule-based” classifiers (RIPPER) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Example query for Sports: +nba +knicks Panos Ipeirotis - New York University - IOMS Department
Querying Database with Extracted Queries • Issue each query to database to obtain number of matches without retrieving any documents • Increase coverage of rule’s category accordingly (#Sports = #Sports + 706) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE SIGMOD 2001 ACM TOIS 2003 Panos Ipeirotis - New York University - IOMS Department
Identifying Topic Distribution from Query Results • Document classifiers not perfect: • Rules for one category match documents from other categories • Querying not perfect: • Queries for same category might overlap • Queries do not match all documents in a category TRAIN CLASSIFIER Query-based estimates of topic distribution not perfect EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION Solution: Learn to adjust results of query probes CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department
Confusion Matrix Adjustment of Query Probe Results correct class Correct (but unknown) topic distribution Incorrect topic distribution derived from query probing 800+500+0 = X = 80+4250+2 = 20+750+48 = assigned class This “multiplication” can be inverted to get a better estimate of the real topic distribution from the probe results 10% of “sport” documents match queries for “computers” Panos Ipeirotis - New York University - IOMS Department
Confusion Matrix Adjustment of Query Probe Results • M usually diagonally dominant for “reasonable” document classifiers, hence invertible • Compensates for errors in query-based estimates of topic distribution TRAIN CLASSIFIER Coverage(D) ~ M-1 . ECoverage(D) EXTRACT QUERIES Sports: +nba +knicks Adjusted estimate of topic distribution Health Probing results +sars QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department
Classification Algorithm (Again) TRAIN CLASSIFIER • Train document classifier • Extract queries from classifier • Adaptively issue queries to database • Identify topic distribution based on adjusted number of query matches • Classify database One-time process EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION For every database CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department
Experimental Setup • 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) • 500,000 Usenet articles (April-May 2000): • Newsgroups assigned by hand to hierarchy nodes • RIPPER trained with 54,000 articles (1,000 articles per leaf), 27,000 articles to construct confusion matrix • 500 “Controlled” databases built using 419,000 newsgroup articles (to run detailed experiments) • 130 real Web databases picked from InvisibleWeb (first 5 under each topic) comp.hardware rec.music.classical rec.photo.* Panos Ipeirotis - New York University - IOMS Department
Experimental Results:Controlled Databases • Accuracy (using F-measure): • Above 80% for most <Tc, Ts> threshold combinations tried • Degrades gracefully with hierarchy depth • Confusion-matrix adjustment helps • Efficiency: Relatively small number of queries (<500) needed for most threshold <Tc, Ts> combinations tried Panos Ipeirotis - New York University - IOMS Department
Experimental Results: Web Databases • Accuracy (using F-measure): • ~70% for best <Tc, Ts> combination • Learned thresholds that reproduce human classification • Tested threshold choice using 3-fold cross validation • Efficiency: • 120 queries per database on average needed for choice of thresholds, no documents retrieved • Only small part of hierarchy “explored” • Queries are short: 1.5 words on average; 4 words maximum (easily handled by most Web databases) Panos Ipeirotis - New York University - IOMS Department
Other Experiments • Effect of choice of document classifiers: • RIPPER • C4.5 • Naïve Bayes • SVM • Benefits of feature selection • Effect of search-interface heterogeneity: Boolean vs. vector-space retrieval models • Effect of query-overlap elimination step • Over crawlable databases: query-based classification orders of magnitude faster than “brute-force” crawling-based classification ACM TOIS 2003 IEEE Data Engineering Bulletin 2002 Panos Ipeirotis - New York University - IOMS Department
Hidden-Web Database Classification: Summary • Handles autonomous Hidden-Web databases accurately and efficiently: • ~70% F-measure • Only 120 queries issued on average, with no documents retrieved • Handles large family of document classifiers(and can hence exploit future advances in machine learning) Panos Ipeirotis - New York University - IOMS Department
Outline of Talk • Classification of Hidden-Web Databases • Search over Hidden-Web Databases Panos Ipeirotis - New York University - IOMS Department
Interacting With Hidden-Web Databases • Browsing: Yahoo!-like directories • Searching: Metasearchers } Content not accessible through Google NYTimesArchives … … PubMed … Query Metasearcher USPTO Library of Congress … Panos Ipeirotis - New York University - IOMS Department
... thrombopenia 24,826 ... ... thrombopenia 18 ... ... thrombopenia 0 ... Metasearchers Provide Access to Distributed Databases Database selection relies on simple content summaries: vocabulary, word frequencies thrombopenia Metasearcher PubMed (11,868,552 documents) … aids 121,491 cancer 1,562,477 heart 691,360hepatitis121,129 thrombopenia 24,826 … ? PubMed NYTimesArchives USPTO Databases typically do not export such summaries! Panos Ipeirotis - New York University - IOMS Department
Extracting Content Summaries from Autonomous Hidden-Web Databases [Callan&Connell2001] • Send random queries to databases • Retrieve top matching documents • If retrieved 300 documents then stop; else go to Step 1 Content summary contains words in sample and document frequency of each word • Problems: • Random sampling retrieves non-representative documents • Frequencies in summary “compressed” to sample size range • Summaries from small samples are highly incomplete Panos Ipeirotis - New York University - IOMS Department
Extracting Representative Document Sample Problem 1: Random sampling retrieves non-representative documents • Train a document classifier • Create queries from classifier • Adaptively issue queries to databases • Retrieve top-k matching documents for each query • Save #matches for each one-word query • Identify topic distribution based on adjusted number of query matches • Categorize the database • Generate content summary from document sample Sampling retrieves documents only from “topically dense” areas from database Panos Ipeirotis - New York University - IOMS Department
Sample Frequencies vs. Actual Frequencies Problem 2: Frequencies in summary “compressed” to sample size range PubMed (11,868,552 docs) …cancer 1,562,477 heart 691,360… PubMed Sample (300 documents) … cancer 45 heart 16… Sampling Key Observation: Query matches reveal frequency information Panos Ipeirotis - New York University - IOMS Department
Adjusting Document Frequencies • Zipf’s lawempiricallyconnects word frequency f and rank r f= A (r + B) c frequency rank VLDB 2002 Panos Ipeirotis - New York University - IOMS Department
Adjusting Document Frequencies • Zipf’s lawempiricallyconnects word frequency f and rank r • We know document frequency and rank r of the words in sample f= A (r + B) c frequency Frequency in sample 100 rank 1 12 78 …. VLDB 2002 Rank in sample Panos Ipeirotis - New York University - IOMS Department
Adjusting Document Frequencies • Zipf’s lawempiricallyconnects word frequency f and rank r • We know document frequency and rank r of the words in sample • We know real document frequency f of some words from one-word queries frequency f= A (r + B) c Frequency in database rank 1 12 78 …. VLDB 2002 Rank in sample Panos Ipeirotis - New York University - IOMS Department
Adjusting Document Frequencies • Zipf’s lawempiricallyconnects word frequency f and rank r • We know document frequency and rank r of the words in sample • We know real document frequency f of some words from one-word queries • We use curve-fitting to estimate the absolute frequency of all words in sample f= A (r + B) c frequency Estimated frequency in database rank 1 12 78 …. VLDB 2002 Panos Ipeirotis - New York University - IOMS Department
Actual PubMed Content Summary • Extracted automatically • ~ 27,500 words in extracted content summary • Fewer than 200 queries sent • At most 4 documents retrieved per query PubMedcontent summary Number of Documents: 8,691,360 (Actual: 11,868,552) Category: Health, Diseases … cancer 1,562,477 heart581,506 (Actual: 691,360) aids 121,491 hepatitis73,481 (Actual: 121,129) … basketball 907 (Actual: 1,063) cpu 598 (heart, hepatitis, basketball not in 1-word probes) Panos Ipeirotis - New York University - IOMS Department
Sampling and Incomplete Content Summaries Problem 3: Summaries from small samples are highly incomplete • Many words appear in “relatively few” documents (Zipf’s law) Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in < 0.1% of db) 10,000 . . 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - New York University - IOMS Department
Sampling and Incomplete Content Summaries Problem 3: Summaries from small samples are highly incomplete • Many words appear in “relatively few” documents (Zipf’s law) • Low-frequency words are often important Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in < 0.1% of db) 10,000 . . endocarditis ~10,000 docs / ~0.1% 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - New York University - IOMS Department
Sampling and Incomplete Content Summaries Problem 3: Summaries from small samples are highly incomplete • Many words appear in “relatively few” documents (Zipf’s law) • Low-frequency words are often important • Small document samples miss many low-frequency words Sample=300 Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in < 0.1% of db) 9,000 . . endocarditis ~9,000 docs / ~0.1% 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - New York University - IOMS Department
Sample-based Content Summaries Main Idea: Database Classification Helps • Similar topics ↔ Similar content summaries • Extracted content summaries complement each other Challenge: Improve content summary quality without increasing sample size Panos Ipeirotis - New York University - IOMS Department
Databases with Similar Topics • CANCERLIT contains “metastasis”, not found during sampling • CancerBACUP contains “metastasis” • Databases under same category have similar vocabularies, and can complement each other Panos Ipeirotis - New York University - IOMS Department
Content Summaries for Categories • Databases under same category share similar vocabulary • Higher level category content summaries provide additional useful estimates • All estimates in category path are potentially useful Panos Ipeirotis - New York University - IOMS Department
Enhancing Summaries Using “Shrinkage” • Estimates from database content summaries can be unreliable • Category content summaries are more reliable (based on larger samples) but less specific to database • By combining estimates from category and database content summaries we get better estimates SIGMOD 2004 Panos Ipeirotis - New York University - IOMS Department
Shrinkage-based Estimations Adjust estimate for metastasis in D: λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Select λi weights to maximize the probability that the summary of D is from a database under all its parent categories Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - New York University - IOMS Department
Computing Shrinkage-based Summaries Root Health Cancer D Pr [metastasis | D] =λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Pr [treatment | D] =λ1 * 0.015 +λ2 * 0.12 + λ3 * 0.179+ λ4 * 0.184 … • Automatic computation of λi weights using an EM algorithm • Computation performed offline No query overhead Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - New York University - IOMS Department
Shrinkage Weights and Summary new estimates old estimates Shrinkage: • Increases estimations for underestimates (e.g., metastasis) • Decreases word-probability estimates for overestimates (e.g., aids) • …it also introduces (with small probabilities) spurious words (e.g., football) Panos Ipeirotis - New York University - IOMS Department
Is Shrinkage Always Necessary? • Shrinkage used to reduce uncertainty (variance) of estimations • Small samples of large databases high variance • In sample: 10 out of 100 documents contain metastasis • In database: ? out of 10,000,000 documents? • Small samples of small databases small variance • In sample: 10 out of 100 documents contain metastasis • In database: ? out of 200 documents? • Shrinkage less useful (or even harmful) when uncertainty is low Panos Ipeirotis - New York University - IOMS Department
Adaptive Application of Shrinkage • Database selection algorithms assign scores to databases for each query • When word frequency estimates are uncertain, assigned score has high variance • shrinkage improves score estimates • When word frequency estimates are reliable, assigned score has small variance • shrinkage unnecessary Unreliable Score Estimate: Use shrinkage Probability 0 1 Database Score for a Query Reliable Score Estimate: Shrinkage might hurt Probability Solution: Use shrinkage adaptively in a query- and database-specific manner 0 1 Database Score for a Query Panos Ipeirotis - New York University - IOMS Department
Searching Algorithm Classify databases and extract document samples Adjust frequencies in samples One-time process For each query: For each database D: • Assign score to database D (using extracted content summary) • Examine uncertainty of score • If uncertainty high, apply shrinkage and give new score; else keep existing score Query only top-K scoring databases For every query Panos Ipeirotis - New York University - IOMS Department
Extracting Content Summaries: Problems Solved Problem 1:Random sampling may retrieve non-representative documents Solution: Focus querying on “topically dense” areas of the database Problem 2: Frequencies are “compressed” to the sample size range Solution: Exploit number of matches for query and adjust estimates using curve fitting Problem 3: Summaries based on small samples are highly incomplete Solution: Exploit database classification and augment summaries using samples from topically similar databases Panos Ipeirotis - New York University - IOMS Department