360 likes | 468 Views
Probe, Count, and Classify: Categorizing Hidden Web Databases. Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc. DIMACS Summer School Tutorial on New Frontiers in Data Mining T heme: Web -- Thursday, August 16, 2001. Surface Web Link structure
E N D
Probe, Count, and Classify:Categorizing Hidden Web Databases Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc. DIMACS Summer School Tutorial on New Frontiers in Data Mining Theme: Web -- Thursday, August 16, 2001
Surface Web Link structure Crawlable Hidden Web No link structure Documents “hidden” behind search forms Surface Web vs. Hidden Web
Do We Need the Hidden Web? Example: PubMed/MEDLINE • PubMed: (www.ncbi.nlm.nih.gov/PubMed) search: “cancer” 1,341,586 matches • AltaVista: “cancer site:www.ncbi.nlm.nih.gov” 21,830 matches
Interacting With Searchable Text Databases • Searching: Metasearchers • Browsing: Yahoo!-like web directories: • InvisibleWeb.com • SearchEngineGuide.com Example from InvisibleWeb.com Health > Publications > PubMED Created Manually!
Classifying Text Databases Automatically: Outline • Definition of classification • Classification through query probing • Experiments
Database Classification: Two Definitions • Coverage-based classification: • Database contains many documents about a category Coverage: #docs about this category • Specificity-based classification: • Database contains mainly documents about a category • Specificity: #docs/|DB|
Database Classification: An Example • Category: Basketball • Coverage-based classification • ESPN.com, NBA.com, not KnicksTerritory.com • Specificity-based classification • NBA.com, KnicksTerritory.com, not ESPN.com
Database Classification: More Details Tc, Ts “editorial” choices Thresholds for coverage and specificity • Tc: coverage threshold (e.g., 100) • Ts: specificity threshold (e.g., 0.5) Ideal(D) Root Ideal(D): set of classes for database D Class C is in Ideal(D) if: • D has “enough” coverage and specificity (Tc, Ts) for C and all of C’s ancestors and • D fails to have both “enough” coverage and specificity for each child of C SPORTS C=800 S=0.8 HEALTH C=200 S=0.2 BASKETBALL S=0.5 BASEBALL S=0.5
From Document to Database Classification • If we know the categories of all documents inside the database, we are done! • We do not have direct access to the documents. • Databases do not export such data! How can we extract this information?
Our Approach: Query Probing • Train a rule-based document classifier. • Transform classifier rules into queries. • Adaptively sendqueries to databases. • Categorize the databases based on adjusted number of query matches.
Training a Rule-based Document Classifier • Feature Selection: Zipf’s law pruning, followed by information-theoretic feature selection [Koller & Sahami’96] • Classifier Learning: AT&T’s RIPPER [Cohen 1995] • Input: A set of pre-classified, labeled documents • Output: A set of classification rules • IF linux THEN Computers • IF jordan AND bulls THEN Sports • IF lung AND cancer THEN Health
Constructing Query Probes • Transform each rule into a query IF lung AND cancer THEN health +lung +cancer IF linux THEN computers +linux • Send the queries to the database • Get number of matches for each query, NOT the documents (i.e., number of documents that match each rule) • These documents would have been classified by the rule under its associated category!
Adjusting Query Results • Classifiers are not perfect! • Queries do not “retrieve” all the documents in a category • Queries for one category “match” documents not in this category • From the classifier’s training phase we know its “confusion matrix”
Confusion Matrix Correct class 10% of “Sports” classified as “Computers” X = 10% of the 5000 “Sports” docs to “Computers” Classified into M . Coverage(D) ~ ECoverage(D)
Confusion Matrix Adjustment:Compensating for Classifier’s Errors -1 = X M is diagonally dominant, hence invertible Coverage(D) ~ M-1 . ECoverage(D) Multiplication better approximates the correct result
Classifying a Database • Send the query probes for the top-level categories • Get the number of matches for each probe • Calculate Specificity and Coverage for each category • “Push” the database to the qualifying categories (with Specificity>Ts and Coverage>Tc) • Repeat for each of the qualifying categories • Return the classes that satisfy the coverage/specificity conditions The result is the Approximation of the Ideal classification
Experiments: Data • 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) • 500,000 Usenet articles (April-May 2000): • Newsgroups assigned by hand to hierarchy nodes • RIPPER trained with 54,000 articles (1,000 articles per leaf) • 27,000 articles used to construct estimations of the confusion matrices • Remaining 419,000 articles used to build 500 Controlled Databases of varying category mixes, size
Comparison With Alternatives • DS: Random sampling of documents via query probes • Callan et al., SIGMOD’99 • Different task: Gather vocabulary statistics • We adapted it for database classification • TQ: Title-based Probing • Yu et al., WISE 2000 • Query probes are simply the category names
Experiments: Metrics Expanded(N) • Accuracy of classification results: • Expanded(N) = N and all descendants • Correct = Expanded(Ideal(D)) • Classified = Expanded(Approximate(D)) • Precision = |Correct /\ Classified|/|Classified| • Recall = |Correct /\ Classified|/|Correct| • F-measure = 2.Precision.Recall/(Precision + Recall) • Costof classification: Number of queries to database N
Average F-measure, Controlled Databases PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing
Experimental Results: Controlled Databases • Feature selection helps. • Confusion-matrix adjustment helps. • F-measure above 0.8 for most <Tc, Ts> combinations. • Results degrade gracefully with hierarchy depth. • Relatively small number of probes needed for most <Tc, Ts> combinations tried. • Also, probes are short: 1.5 words on average; 4 words maximum. • Both better performance and lower cost than DS [Callan et al. adaptation] and TQ [Yu et al.]
Web Databases • 130 real databases classified from InvisibleWeb™. • Used InvisibleWeb’s categorization as correct. • Simple “wrappers” for querying (only # of matches needed). • The Ts, Tc thresholds are not known (unlike with the Controlled databases)but implicit in the IWeb categorization. • We can learn/validate the thresholds (tricky but easy!). • More details in the paper!
Experimental Results:Web Databases • 130 RealWeb Databases. • F-measure above 0.7 for best <Tc, Ts> combination learned. • 185 query probes per database on average needed for classification. • Also, probes are short: 1.5 words on average; 4 words maximum.
Conclusions • Accurate classification using only a small number of short queries • No need for document retrieval • Only need a result like: “X matches found” • No need for any cooperation or special metadata from databases • URL: http://qprober.cs.columbia.edu
Current and Future Work • Build “wrappers” automatically • Extend to non-topical categories • Evaluate impact of varying search interfaces (e.g., Boolean vs. ranked) • Extend to other classifiers (e.g., SVMs or Bayesian models) • Integrate with searching (connection with database selection?)
Contributions Easy, inexpensive method for database classification Uses results from document classification “Indirect” classification of the documents in a database Does not inspect documents, only number of matches Adjustment of results according to classifier’s performance Easy wrapper construction No need for any metadata from the database
Related Work • Callan et al., SIGMOD 1999 • Gauch et al., Profusion • Dolin et al., Pharos • Yu et al., WISE 2000 • Raghavan and Garcia Molina, VLDB 2001
Controlled Databases 500 databases built using 419,000 newsgroup articles • One label per document • 350 databases with single (not necessarily leaf) category • 150 databases with varying category mixes • Database size ranges from 25 to 25,000 articles • Indexed and queries using SMART
F-measure for Different Hierarchy Depths PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing Tc=8, Ts=0.3