Classifying and Searching "Hidden-Web" Text Databases

Classifying and Searching "Hidden-Web" Text Databases Panos Ipeirotis Department of Information Systems New York University

Motivation?“Surface” Web vs. “Hidden” Web • “Surface” Web • Link structure • Crawlable • Documents indexed by search engines • “Hidden” Web • No link structure • Documents “hidden” in databases • Documents not indexed by search engines • Need to query each collection individually Panos Ipeirotis - New York University - IOMS Department

Hidden-Web Databases: Examples • Search on U.S. Patent and Trademark Office (USPTO) database: • [wireless network]  39,270 matches • (USPTO database is at http://patft.uspto.gov/netahtml/search-bool.html) • Search on Google restricted to USPTO database site: • [wireless network site:patft.uspto.gov]  1 match as of Oct 3rd, 2005 Panos Ipeirotis - New York University - IOMS Department

Interacting With Hidden-Web Databases • Browsing: Yahoo!-like directories • InvisibleWeb.com • SearchEngineGuide.com • Searching: Metasearchers Populated Manually Panos Ipeirotis - New York University - IOMS Department

Outline of Talk • Classification of Hidden-Web Databases • Search over Hidden-Web Databases • Managing Changes in Hidden-Web Databases Panos Ipeirotis - New York University - IOMS Department

    ? ? ? ? ? ? ? ? Hierarchically Classifying the ACM Digital Library ACM DL  Panos Ipeirotis - New York University - IOMS Department

Text Database Classification: Definition • For a text database D and a category C: • Coverage(D,C) = number of docs in D about C • Specificity(D,C) = fraction of docs in D about C • Assign a text database to a category C if: • Database coverage for C at least Tc Tc:coverage threshold (e.g., > 100 docs in C) • Database specificity for C at least Ts Ts:specificity threshold (e.g., > 40% of docs in C) Panos Ipeirotis - New York University - IOMS Department

Brute-Force Classification “Strategy” • Extract all documents from database • Classify documents on topic (use state-of-the-art classifiers: SVMs, C4.5, RIPPER,…) • Classify database according to topic distribution Problem: No direct access to full contents of Hidden-Web databases Panos Ipeirotis - New York University - IOMS Department

Classification: Goal & Challenges • Goal: Discover database topic distribution • Challenges: • No direct access to full contents of Hidden-Web databases • Only limited search interfaces available • Should not overload databases Key observation: Only queries “about” database topic(s) generate large number of matches Panos Ipeirotis - New York University - IOMS Department

     Query-based Database Classification: Overview • Train document classifier • Extract queries from classifier • Adaptively issue queries to database • Identify topic distribution based on adjusted number of query matches • Classify database TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department

         Training a Document Classifier • Get training set (set of pre-classified documents) • Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] • Train classifier (SVM, C4.5, RIPPER, …) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE Output: A “black-box” model for classifying documents IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Document  Classifier Panos Ipeirotis - New York University - IOMS Department

         Training a Document Classifier • Get training set (set of pre-classified documents) • Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] • Train classifier (SVM, C4.5, RIPPER, …) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE Output: A “black-box” model for classifying documents IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Document  Classifier Panos Ipeirotis - New York University - IOMS Department

Easy for decision-tree classifiers (C4.5) for which rule generators exist (C4.5rules) C4.5rules • Trickier for other classifiers: we devised rule-extraction methods for linear classifiers (linear-kernel SVMs, Naïve-Bayes, …) Rule extraction      Extracting Query Probes ACM TOIS 2003 Transform classifier model into queries • Trivial for “rule-based” classifiers (RIPPER) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Example query for Sports: +nba +knicks Panos Ipeirotis - New York University - IOMS Department

     Querying Database with Extracted Queries • Issue each query to database to obtain number of matches without retrieving any documents • Increase coverage of rule’s category accordingly (#Sports = #Sports + 706) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE SIGMOD 2001 ACM TOIS 2003 Panos Ipeirotis - New York University - IOMS Department

     Identifying Topic Distribution from Query Results • Document classifiers not perfect: • Rules for one category match documents from other categories • Querying not perfect: • Queries for same category might overlap • Queries do not match all documents in a category TRAIN CLASSIFIER Query-based estimates of topic distribution not perfect EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION Solution: Learn to adjust results of query probes CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department

Confusion Matrix Adjustment of Query Probe Results correct class Correct (but unknown) topic distribution Incorrect topic distribution derived from query probing 800+500+0 = X = 80+4250+2 = 20+750+48 = assigned class This “multiplication” can be inverted to get a better estimate of the real topic distribution from the probe results 10% of “sport” documents match queries for “computers” Panos Ipeirotis - New York University - IOMS Department

     Confusion Matrix Adjustment of Query Probe Results • M usually diagonally dominant for “reasonable” document classifiers, hence invertible • Compensates for errors in query-based estimates of topic distribution TRAIN CLASSIFIER Coverage(D) ~ M-1 . ECoverage(D) EXTRACT QUERIES Sports: +nba +knicks Adjusted estimate of topic distribution Health Probing results +sars QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department

     Classification Algorithm (Again) TRAIN CLASSIFIER • Train document classifier • Extract queries from classifier • Adaptively issue queries to database • Identify topic distribution based on adjusted number of query matches • Classify database One-time process EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION For every database CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department

Experimental Setup • 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) • 500,000 Usenet articles: • Newsgroups assigned by hand to hierarchy nodes • RIPPER trained with 54,000 articles (1,000 articles per leaf), 27,000 articles to construct confusion matrix • 500 “Controlled” databases built using 419,000 newsgroup articles (to run detailed experiments) • 130 real Web databases picked from InvisibleWeb (first 5 under each topic) comp.hardware rec.music.classical rec.photo.* Panos Ipeirotis - New York University - IOMS Department

Experimental Results:Controlled Databases • Accuracy (using F-measure): • Above 80% for most <Tc, Ts> threshold combinations tried • Degrades gracefully with hierarchy depth • Confusion-matrix adjustment helps • Efficiency: Relatively small number of queries (<500) needed for most threshold <Tc, Ts> combinations tried Panos Ipeirotis - New York University - IOMS Department

Experimental Results: Web Databases • Accuracy (using F-measure): • ~70% for best <Tc, Ts> combination • Learned thresholds that reproduce human classification • Tested threshold choice using 3-fold cross validation • Efficiency: • 120 queries per database on average needed for choice of thresholds, no documents retrieved • Only small part of hierarchy “explored” • Queries are short: 1.5 words on average; 4 words maximum (easily handled by most Web databases) Panos Ipeirotis - New York University - IOMS Department

Hidden-Web Database Classification: Summary • Handles autonomous Hidden-Web databases accurately and efficiently: • ~70% F-measure • Only 120 queries issued on average, with no documents retrieved • Handles large family of document classifiers(and can hence exploit future advances in machine learning) Panos Ipeirotis - New York University - IOMS Department

Interacting With Hidden-Web Databases • Browsing: Yahoo!-like directories • Searching: Metasearchers } Content not accessible through Google NYTimesArchives … … PubMed … Query Metasearcher USPTO Library of Congress … Panos Ipeirotis - New York University - IOMS Department

... thrombopenia 24,826 ... ... thrombopenia 18 ... ... thrombopenia 0 ... Metasearchers Provide Access to Distributed Databases Database selection relies on simple content summaries: vocabulary, word frequencies thrombopenia Metasearcher PubMed (11,868,552 documents) … aids 121,491 cancer 1,562,477 heart 691,360hepatitis121,129 thrombopenia 24,826 …   ? PubMed NYTimesArchives USPTO Databases typically do not export such summaries! Panos Ipeirotis - New York University - IOMS Department

Extracting Representative Document Sample Focused Sampling • Train a document classifier • Create queries from classifier • Adaptively issue queries to databases • Retrieve top-k matching documents for each query • Save #matches for each one-word query • Identify topic distribution based on adjusted number of query matches • Categorize the database • Generate content summary from document sample Focused sampling retrieves documents only from “topically dense” areas from database Panos Ipeirotis - New York University - IOMS Department

Sampling and Incomplete Content Summaries Problem: Summaries from small samples are highly incomplete • Many words appear in “relatively few” documents (Zipf’s law) Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in less than 0.1% of db) 10,000 . . 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - New York University - IOMS Department

Sampling and Incomplete Content Summaries Problem: Summaries from small samples are highly incomplete • Many words appear in “relatively few” documents (Zipf’s law) • Low-frequency words are often important Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in < 0.1% of db) 10,000 . . endocarditis ~10,000 docs / ~0.1% 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - New York University - IOMS Department

Sampling and Incomplete Content Summaries Problem: Summaries from small samples are highly incomplete • Many words appear in “relatively few” documents (Zipf’s law) • Low-frequency words are often important • Small document samples miss many low-frequency words Sample=300 Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in < 0.1% of db) 9,000 . . endocarditis ~9,000 docs / ~0.1% 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - New York University - IOMS Department

Sample-based Content Summaries Main Idea: Database Classification Helps • Similar topics ↔ Similar content summaries • Extracted content summaries complement each other Challenge: Improve content summary quality without increasing sample size Panos Ipeirotis - New York University - IOMS Department

Databases with Similar Topics • CANCERLIT contains “metastasis”, not found during sampling • CancerBACUP contains “metastasis” • Databases under same category have similar vocabularies, and can complement each other Panos Ipeirotis - New York University - IOMS Department

Content Summaries for Categories • Databases under same category share similar vocabulary • Higher level category content summaries provide additional useful estimates • All estimates in category path are potentially useful Panos Ipeirotis - New York University - IOMS Department

Enhancing Summaries Using “Shrinkage” • Estimates from database content summaries can be unreliable • Category content summaries are more reliable (based on larger samples) but less specific to database • By combining estimates from category and database content summaries we get better estimates SIGMOD 2004 Panos Ipeirotis - New York University - IOMS Department

Shrinkage-based Estimations Adjust estimate for metastasis in D: λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Select λi weights to maximize the probability that the summary of D is from a database under all its parent categories  Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - New York University - IOMS Department

Computing Shrinkage-based Summaries Root Health Cancer D Pr [metastasis | D] =λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Pr [treatment | D] =λ1 * 0.015 +λ2 * 0.12 + λ3 * 0.179+ λ4 * 0.184 … • Automatic computation of λi weights using an EM algorithm • Computation performed offline  No query overhead Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - New York University - IOMS Department

Shrinkage Weights and Summary new estimates old estimates Shrinkage: • Increases estimations for underestimates (e.g., metastasis) • Decreases word-probability estimates for overestimates (e.g., aids) • …it also introduces (with small probabilities) spurious words (e.g., football) Panos Ipeirotis - New York University - IOMS Department

Is Shrinkage Always Necessary? • Shrinkage used to reduce uncertainty (variance) of estimations • Small samples of large databases  high variance • In sample: 10 out of 100 documents contain metastasis • In database: ? out of 10,000,000 documents? • Small samples of small databases  small variance • In sample: 10 out of 100 documents contain metastasis • In database: ? out of 200 documents? • Shrinkage less useful (or even harmful) when uncertainty is low Panos Ipeirotis - New York University - IOMS Department

Adaptive Application of Shrinkage • Database selection algorithms assign scores to databases for each query • When word frequency estimates are uncertain, assigned score has high variance • shrinkage improves score estimates • When word frequency estimates are reliable, assigned score has small variance • shrinkage unnecessary Unreliable Score Estimate: Use shrinkage Probability 0 1 Database Score for a Query Reliable Score Estimate: Shrinkage might hurt Probability Solution: Use shrinkage adaptively in a query- and database-specific manner 0 1 Database Score for a Query Panos Ipeirotis - New York University - IOMS Department

Searching Algorithm Classify databases and extract document samples Adjust frequencies in samples One-time process For each query: For each database D: • Assign score to database D (using extracted content summary) • Examine uncertainty of score • If uncertainty high, apply shrinkage and give new score; else keep existing score Query only top-K scoring databases For every query Panos Ipeirotis - New York University - IOMS Department

Results: Database Selection • Metric: R(K) = Χ / Υ • X = # of relevant documents in the selected K databases • Y = # of relevant documents in the best K databases For CORI (a state-of-the-art database selection algorithm) with stemming over TREC6 testbed Panos Ipeirotis - New York University - IOMS Department

Never-update Policy • Naive practice: construct summary once, never update • Extracted (old) summary may: • Miss new words (from new documents) • Contain obsolete words (from deleted document) • Provide inaccurate frequency estimates NY Times(Mar 29, 2005) Word #Docs … NY Times(Oct 29, 2004) Word #Docs … • tsunami(0) • recount 2,302 • grokster 2 • tsunami 250 • recount (0) • grokster 78 Panos Ipeirotis - New York University - IOMS Department

Updating Content Summaries: Questions • Do content summaries change over time? • Which database properties affect the rate of change? • How to schedule updates with constrained resources? ICDE 2005 Panos Ipeirotis - New York University - IOMS Department

Data for our Study: 152 Web Databases Study period: Oct 2002 – Oct 2003 • 52 weekly snapshots for each database • 5 million pages in each snapshot (approx.) • 65 Gb per snapshot (3.3 Tb total) Examined differences between snapshots: • 1 week: 5% new words, 5% old words disappeared • 20 weeks: 20% new words, 20% old words disappeared Panos Ipeirotis - New York University - IOMS Department

Survival Analysis Survival Analysis: A collection of statistical techniques for predicting “the time until an event occurs” • Initially used to measure length of survival of patients under different treatments (hence the name) • Used to measure effect of different parameters (e.g., weight, race) on survival time • We want to predict “time until next update” and find database properties that affect this time Panos Ipeirotis - New York University - IOMS Department

Survival Analysis for Summary Updates • “Survival time of summary”: Time until current database summary is “sufficiently different” than the old one (i.e., an update is required) • Old summary changes at time tif: KL divergence(current, old) > τ • Survival analysis estimates probability that a database summary changes within time t change sensitivity threshold Panos Ipeirotis - New York University - IOMS Department

“Censored” cases X X X X X Week 52, end of study Survival Times and “Incomplete” Data “Survival times” for a database week • Many observations are “incomplete” (aka “censored”) • Censored data give partial information (database did not change) Panos Ipeirotis - New York University - IOMS Department

S(t), best fit, using censored data X X X S(t), best fit, using censored data “as-is” S(t), best fit, ignoring censored data X X Using “Censored” Data X • By ignoring censored cases we get (under) estimates  perform more update operations than needed • By using censored cases “as-is” we get (again) underestimates • Survival analysis “extends” the lifetime of “censored” cases Panos Ipeirotis - New York University - IOMS Department

Database Properties and Survival Times For our analysis, we use Cox Proportional Hazards Regression • Uses effectively “censored” data (i.e., database did not change within time T) • Derives effect of database propertieson rate of change • E.g., “if you double the size of a database, it changes twice as fast” • No assumptions about the form of the survival function Panos Ipeirotis - New York University - IOMS Department

Baseline Survival Functions by Domain Effect of domain: • GOV changes slower than any other domain • EDU changes fast in the short term, but slower in the long term • COM and other commercial sites change faster than the rest Panos Ipeirotis - New York University - IOMS Department

Classifying and Searching &quot;Hidden-Web&quot; Text Databases

Classifying and Searching &quot;Hidden-Web&quot; Text Databases

Presentation Transcript

Classifying and Searching "Hidden-Web" Text Databases

Classifying and Searching "Hidden-Web" Text Databases