1 / 69

Classifying and Searching "Hidden-Web" Text Databases

Learn about hidden-web databases, classification methods, and interacting with hidden content. Explore querying techniques for efficient search. Discover how to manage changes in these databases.

erintaylor
Download Presentation

Classifying and Searching "Hidden-Web" Text Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classifying and Searching "Hidden-Web" Text Databases Panos Ipeirotis Department of Information Systems New York University

  2. Motivation?“Surface” Web vs. “Hidden” Web • “Surface” Web • Link structure • Crawlable • Documents indexed by search engines • “Hidden” Web • No link structure • Documents “hidden” in databases • Documents not indexed by search engines • Need to query each collection individually Panos Ipeirotis - New York University - IOMS Department

  3. Hidden-Web Databases: Examples • Search on U.S. Patent and Trademark Office (USPTO) database: • [wireless network]  39,270 matches • (USPTO database is at http://patft.uspto.gov/netahtml/search-bool.html) • Search on Google restricted to USPTO database site: • [wireless network site:patft.uspto.gov]  1 match as of Oct 3rd, 2005 Panos Ipeirotis - New York University - IOMS Department

  4. Interacting With Hidden-Web Databases • Browsing: Yahoo!-like directories • InvisibleWeb.com • SearchEngineGuide.com • Searching: Metasearchers Populated Manually Panos Ipeirotis - New York University - IOMS Department

  5. Outline of Talk • Classification of Hidden-Web Databases • Search over Hidden-Web Databases • Managing Changes in Hidden-Web Databases Panos Ipeirotis - New York University - IOMS Department

  6.    ? ? ? ? ? ? ? ? Hierarchically Classifying the ACM Digital Library ACM DL  Panos Ipeirotis - New York University - IOMS Department

  7. Text Database Classification: Definition • For a text database D and a category C: • Coverage(D,C) = number of docs in D about C • Specificity(D,C) = fraction of docs in D about C • Assign a text database to a category C if: • Database coverage for C at least Tc Tc:coverage threshold (e.g., > 100 docs in C) • Database specificity for C at least Ts Ts:specificity threshold (e.g., > 40% of docs in C) Panos Ipeirotis - New York University - IOMS Department

  8. Brute-Force Classification “Strategy” • Extract all documents from database • Classify documents on topic (use state-of-the-art classifiers: SVMs, C4.5, RIPPER,…) • Classify database according to topic distribution Problem: No direct access to full contents of Hidden-Web databases Panos Ipeirotis - New York University - IOMS Department

  9. Classification: Goal & Challenges • Goal: Discover database topic distribution • Challenges: • No direct access to full contents of Hidden-Web databases • Only limited search interfaces available • Should not overload databases Key observation: Only queries “about” database topic(s) generate large number of matches Panos Ipeirotis - New York University - IOMS Department

  10.     Query-based Database Classification: Overview • Train document classifier • Extract queries from classifier • Adaptively issue queries to database • Identify topic distribution based on adjusted number of query matches • Classify database TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department

  11.         Training a Document Classifier • Get training set (set of pre-classified documents) • Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] • Train classifier (SVM, C4.5, RIPPER, …) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE Output: A “black-box” model for classifying documents IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Document  Classifier Panos Ipeirotis - New York University - IOMS Department

  12.         Training a Document Classifier • Get training set (set of pre-classified documents) • Select best features to characterize documents (Zipf’s law + information theoretic feature selection) [Koller and Sahami 1996] • Train classifier (SVM, C4.5, RIPPER, …) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE Output: A “black-box” model for classifying documents IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Document  Classifier Panos Ipeirotis - New York University - IOMS Department

  13. Easy for decision-tree classifiers (C4.5) for which rule generators exist (C4.5rules) C4.5rules • Trickier for other classifiers: we devised rule-extraction methods for linear classifiers (linear-kernel SVMs, Naïve-Bayes, …) Rule extraction      Extracting Query Probes ACM TOIS 2003 Transform classifier model into queries • Trivial for “rule-based” classifiers (RIPPER) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Example query for Sports: +nba +knicks Panos Ipeirotis - New York University - IOMS Department

  14.     Querying Database with Extracted Queries • Issue each query to database to obtain number of matches without retrieving any documents • Increase coverage of rule’s category accordingly (#Sports = #Sports + 706) TRAIN CLASSIFIER EXTRACT QUERIES Sports: +nba +knicks Health: +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE SIGMOD 2001 ACM TOIS 2003 Panos Ipeirotis - New York University - IOMS Department

  15.     Identifying Topic Distribution from Query Results • Document classifiers not perfect: • Rules for one category match documents from other categories • Querying not perfect: • Queries for same category might overlap • Queries do not match all documents in a category TRAIN CLASSIFIER Query-based estimates of topic distribution not perfect EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION Solution: Learn to adjust results of query probes CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department

  16. Confusion Matrix Adjustment of Query Probe Results correct class Correct (but unknown) topic distribution Incorrect topic distribution derived from query probing 800+500+0 = X = 80+4250+2 = 20+750+48 = assigned class This “multiplication” can be inverted to get a better estimate of the real topic distribution from the probe results 10% of “sport” documents match queries for “computers” Panos Ipeirotis - New York University - IOMS Department

  17.     Confusion Matrix Adjustment of Query Probe Results • M usually diagonally dominant for “reasonable” document classifiers, hence invertible • Compensates for errors in query-based estimates of topic distribution TRAIN CLASSIFIER Coverage(D) ~ M-1 . ECoverage(D) EXTRACT QUERIES Sports: +nba +knicks Adjusted estimate of topic distribution Health Probing results +sars QUERY DATABASE IDENTIFY TOPIC DISTRIBUTION CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department

  18.     Classification Algorithm (Again) TRAIN CLASSIFIER • Train document classifier • Extract queries from classifier • Adaptively issue queries to database • Identify topic distribution based on adjusted number of query matches • Classify database One-time process EXTRACT QUERIES Sports: +nba +knicks Health +sars QUERY DATABASE +sars 1254 IDENTIFY TOPIC DISTRIBUTION For every database CLASSIFY DATABASE Panos Ipeirotis - New York University - IOMS Department

  19. Experimental Setup • 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) • 500,000 Usenet articles: • Newsgroups assigned by hand to hierarchy nodes • RIPPER trained with 54,000 articles (1,000 articles per leaf), 27,000 articles to construct confusion matrix • 500 “Controlled” databases built using 419,000 newsgroup articles (to run detailed experiments) • 130 real Web databases picked from InvisibleWeb (first 5 under each topic) comp.hardware rec.music.classical rec.photo.* Panos Ipeirotis - New York University - IOMS Department

  20. Experimental Results:Controlled Databases • Accuracy (using F-measure): • Above 80% for most <Tc, Ts> threshold combinations tried • Degrades gracefully with hierarchy depth • Confusion-matrix adjustment helps • Efficiency: Relatively small number of queries (<500) needed for most threshold <Tc, Ts> combinations tried Panos Ipeirotis - New York University - IOMS Department

  21. Experimental Results: Web Databases • Accuracy (using F-measure): • ~70% for best <Tc, Ts> combination • Learned thresholds that reproduce human classification • Tested threshold choice using 3-fold cross validation • Efficiency: • 120 queries per database on average needed for choice of thresholds, no documents retrieved • Only small part of hierarchy “explored” • Queries are short: 1.5 words on average; 4 words maximum (easily handled by most Web databases) Panos Ipeirotis - New York University - IOMS Department

  22. Hidden-Web Database Classification: Summary • Handles autonomous Hidden-Web databases accurately and efficiently: • ~70% F-measure • Only 120 queries issued on average, with no documents retrieved • Handles large family of document classifiers(and can hence exploit future advances in machine learning) Panos Ipeirotis - New York University - IOMS Department

  23. Outline of Talk • Classification of Hidden-Web Databases • Search over Hidden-Web Databases • Managing Changes in Hidden-Web Databases Panos Ipeirotis - New York University - IOMS Department

  24. Interacting With Hidden-Web Databases • Browsing: Yahoo!-like directories • Searching: Metasearchers } Content not accessible through Google NYTimesArchives … … PubMed … Query Metasearcher USPTO Library of Congress … Panos Ipeirotis - New York University - IOMS Department

  25. ... thrombopenia 24,826 ... ... thrombopenia 18 ... ... thrombopenia 0 ... Metasearchers Provide Access to Distributed Databases Database selection relies on simple content summaries: vocabulary, word frequencies thrombopenia Metasearcher PubMed (11,868,552 documents) … aids 121,491 cancer 1,562,477 heart 691,360hepatitis121,129 thrombopenia 24,826 …   ? PubMed NYTimesArchives USPTO Databases typically do not export such summaries! Panos Ipeirotis - New York University - IOMS Department

  26. Extracting Representative Document Sample Focused Sampling • Train a document classifier • Create queries from classifier • Adaptively issue queries to databases • Retrieve top-k matching documents for each query • Save #matches for each one-word query • Identify topic distribution based on adjusted number of query matches • Categorize the database • Generate content summary from document sample Focused sampling retrieves documents only from “topically dense” areas from database Panos Ipeirotis - New York University - IOMS Department

  27. Sampling and Incomplete Content Summaries Problem: Summaries from small samples are highly incomplete • Many words appear in “relatively few” documents (Zipf’s law) Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in less than 0.1% of db) 10,000 . . 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - New York University - IOMS Department

  28. Sampling and Incomplete Content Summaries Problem: Summaries from small samples are highly incomplete • Many words appear in “relatively few” documents (Zipf’s law) • Low-frequency words are often important Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in < 0.1% of db) 10,000 . . endocarditis ~10,000 docs / ~0.1% 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - New York University - IOMS Department

  29. Sampling and Incomplete Content Summaries Problem: Summaries from small samples are highly incomplete • Many words appear in “relatively few” documents (Zipf’s law) • Low-frequency words are often important • Small document samples miss many low-frequency words Sample=300 Log(Frequency) 107 106 Frequency & rank of 10% most frequent words in PubMed database (95% of the words appear in < 0.1% of db) 9,000 . . endocarditis ~9,000 docs / ~0.1% 103 102 Rank 2·104 4·104 105 Panos Ipeirotis - New York University - IOMS Department

  30. Sample-based Content Summaries Main Idea: Database Classification Helps • Similar topics ↔ Similar content summaries • Extracted content summaries complement each other Challenge: Improve content summary quality without increasing sample size Panos Ipeirotis - New York University - IOMS Department

  31. Databases with Similar Topics • CANCERLIT contains “metastasis”, not found during sampling • CancerBACUP contains “metastasis” • Databases under same category have similar vocabularies, and can complement each other Panos Ipeirotis - New York University - IOMS Department

  32. Content Summaries for Categories • Databases under same category share similar vocabulary • Higher level category content summaries provide additional useful estimates • All estimates in category path are potentially useful Panos Ipeirotis - New York University - IOMS Department

  33. Enhancing Summaries Using “Shrinkage” • Estimates from database content summaries can be unreliable • Category content summaries are more reliable (based on larger samples) but less specific to database • By combining estimates from category and database content summaries we get better estimates SIGMOD 2004 Panos Ipeirotis - New York University - IOMS Department

  34. Shrinkage-based Estimations Adjust estimate for metastasis in D: λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Select λi weights to maximize the probability that the summary of D is from a database under all its parent categories  Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - New York University - IOMS Department

  35. Computing Shrinkage-based Summaries Root Health Cancer D Pr [metastasis | D] =λ1 * 0.002 +λ2 * 0.05 + λ3 * 0.092+ λ4 * 0.000 Pr [treatment | D] =λ1 * 0.015 +λ2 * 0.12 + λ3 * 0.179+ λ4 * 0.184 … • Automatic computation of λi weights using an EM algorithm • Computation performed offline  No query overhead Avoids “sparse data” problem and decreases estimation risk Panos Ipeirotis - New York University - IOMS Department

  36. Shrinkage Weights and Summary new estimates old estimates Shrinkage: • Increases estimations for underestimates (e.g., metastasis) • Decreases word-probability estimates for overestimates (e.g., aids) • …it also introduces (with small probabilities) spurious words (e.g., football) Panos Ipeirotis - New York University - IOMS Department

  37. Is Shrinkage Always Necessary? • Shrinkage used to reduce uncertainty (variance) of estimations • Small samples of large databases  high variance • In sample: 10 out of 100 documents contain metastasis • In database: ? out of 10,000,000 documents? • Small samples of small databases  small variance • In sample: 10 out of 100 documents contain metastasis • In database: ? out of 200 documents? • Shrinkage less useful (or even harmful) when uncertainty is low Panos Ipeirotis - New York University - IOMS Department

  38. Adaptive Application of Shrinkage • Database selection algorithms assign scores to databases for each query • When word frequency estimates are uncertain, assigned score has high variance • shrinkage improves score estimates • When word frequency estimates are reliable, assigned score has small variance • shrinkage unnecessary Unreliable Score Estimate: Use shrinkage Probability 0 1 Database Score for a Query Reliable Score Estimate: Shrinkage might hurt Probability Solution: Use shrinkage adaptively in a query- and database-specific manner 0 1 Database Score for a Query Panos Ipeirotis - New York University - IOMS Department

  39. Searching Algorithm Classify databases and extract document samples Adjust frequencies in samples One-time process For each query: For each database D: • Assign score to database D (using extracted content summary) • Examine uncertainty of score • If uncertainty high, apply shrinkage and give new score; else keep existing score Query only top-K scoring databases For every query Panos Ipeirotis - New York University - IOMS Department

  40. Results: Database Selection • Metric: R(K) = Χ / Υ • X = # of relevant documents in the selected K databases • Y = # of relevant documents in the best K databases For CORI (a state-of-the-art database selection algorithm) with stemming over TREC6 testbed Panos Ipeirotis - New York University - IOMS Department

  41. Outline of Talk • Classification of Hidden-Web Databases • Search over Hidden-Web Databases • Managing Changes in Hidden-Web Databases Panos Ipeirotis - New York University - IOMS Department

  42. Never-update Policy • Naive practice: construct summary once, never update • Extracted (old) summary may: • Miss new words (from new documents) • Contain obsolete words (from deleted document) • Provide inaccurate frequency estimates NY Times(Mar 29, 2005) Word #Docs … NY Times(Oct 29, 2004) Word #Docs … • tsunami(0) • recount 2,302 • grokster 2 • tsunami 250 • recount (0) • grokster 78 Panos Ipeirotis - New York University - IOMS Department

  43. Updating Content Summaries: Questions • Do content summaries change over time? • Which database properties affect the rate of change? • How to schedule updates with constrained resources? ICDE 2005 Panos Ipeirotis - New York University - IOMS Department

  44. Data for our Study: 152 Web Databases Study period: Oct 2002 – Oct 2003 • 52 weekly snapshots for each database • 5 million pages in each snapshot (approx.) • 65 Gb per snapshot (3.3 Tb total) Examined differences between snapshots: • 1 week: 5% new words, 5% old words disappeared • 20 weeks: 20% new words, 20% old words disappeared Panos Ipeirotis - New York University - IOMS Department

  45. Survival Analysis Survival Analysis: A collection of statistical techniques for predicting “the time until an event occurs” • Initially used to measure length of survival of patients under different treatments (hence the name) • Used to measure effect of different parameters (e.g., weight, race) on survival time • We want to predict “time until next update” and find database properties that affect this time Panos Ipeirotis - New York University - IOMS Department

  46. Survival Analysis for Summary Updates • “Survival time of summary”: Time until current database summary is “sufficiently different” than the old one (i.e., an update is required) • Old summary changes at time tif: KL divergence(current, old) > τ • Survival analysis estimates probability that a database summary changes within time t change sensitivity threshold Panos Ipeirotis - New York University - IOMS Department

  47. “Censored” cases X X X X X Week 52, end of study Survival Times and “Incomplete” Data “Survival times” for a database week • Many observations are “incomplete” (aka “censored”) • Censored data give partial information (database did not change) Panos Ipeirotis - New York University - IOMS Department

  48. S(t), best fit, using censored data X X X S(t), best fit, using censored data “as-is” S(t), best fit, ignoring censored data X X Using “Censored” Data X • By ignoring censored cases we get (under) estimates  perform more update operations than needed • By using censored cases “as-is” we get (again) underestimates • Survival analysis “extends” the lifetime of “censored” cases Panos Ipeirotis - New York University - IOMS Department

  49. Database Properties and Survival Times For our analysis, we use Cox Proportional Hazards Regression • Uses effectively “censored” data (i.e., database did not change within time T) • Derives effect of database propertieson rate of change • E.g., “if you double the size of a database, it changes twice as fast” • No assumptions about the form of the survival function Panos Ipeirotis - New York University - IOMS Department

  50. Baseline Survival Functions by Domain Effect of domain: • GOV changes slower than any other domain • EDU changes fast in the short term, but slower in the long term • COM and other commercial sites change faster than the rest Panos Ipeirotis - New York University - IOMS Department

More Related