360 likes | 373 Views
This study explores the categorization of hidden web databases through query probing, covering classification methods, specificity, coverage, and training of rule-based document classifiers. Discover a practical approach to extract and classify information from databases without direct access to documents. 8 Relevant
E N D
Probe, Count, and Classify:Categorizing Hidden Web Databases Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
Surface Web Link structure Crawlable Hidden Web No link structure Documents “hidden” behind search forms Surface Web vs. Hidden Web
Do We Need the Hidden Web? Example: PubMed/MEDLINE • PubMed: (www.ncbi.nlm.nih.gov/PubMed) search: “cancer” 1,341,586 matches • AltaVista: “cancer site:www.ncbi.nlm.nih.gov” 21,830 matches
Interacting With Searchable Text Databases • Searching: Metasearchers • Browsing: Yahoo!-like web directories: • InvisibleWeb.com • SearchEngineGuide.com Example from InvisibleWeb.com Health > Publications > PubMED Created Manually!
Classifying Text Databases Automatically: Outline • Definition of classification • Classification through query probing • Experiments
Database Classification: Two Definitions • Coverage-based classification: • Database contains many documents about a category Coverage: #docs about this category • Specificity-based classification: • Database contains mainly documents about a category • Specificity: #docs/|DB|
Database Classification: An Example • Category: Basketball • Coverage-based classification • ESPN.com, NBA.com, not KnicksTerritory.com • Specificity-based classification • NBA.com, KnicksTerritory.com, not ESPN.com
Database Classification: More Details Tc, Ts “editorial” choices Thresholds for coverage and specificity • Tc: coverage threshold (e.g., 100) • Ts: specificity threshold (e.g., 0.5) Ideal(D) Root Ideal(D): set of classes for database D Class C is in Ideal(D) if: • D has “enough” coverage and specificity (Tc, Ts) for C and all of C’s ancestors and • D fails to have both “enough” coverage and specificity for each child of C SPORTS C=800 S=0.8 HEALTH C=200 S=0.2 BASKETBALL S=0.5 BASEBALL S=0.5
From Document to Database Classification • If we know the categories of all documents inside the database, we are done! • We do not have direct access to the documents. • Databases do not export such data! How can we extract this information?
Our Approach: Query Probing • Train a rule-based document classifier. • Transform classifier rules into queries. • Adaptively sendqueries to databases. • Categorize the databases based on adjusted number of query matches.
Training a Rule-based Document Classifier • Feature Selection: Zipf’s law pruning, followed by information-theoretic feature selection [Koller & Sahami’96] • Classifier Learning: AT&T’s RIPPER [Cohen 1995] • Input: A set of pre-classified, labeled documents • Output: A set of classification rules • IF linux THEN Computers • IF jordan AND bulls THEN Sports • IF lung AND cancer THEN Health
Constructing Query Probes • Transform each rule into a query IF lung AND cancer THEN health +lung +cancer IF linux THEN computers +linux • Send the queries to the database • Get number of matches for each query, NOT the documents (i.e., number of documents that match each rule) • These documents would have been classified by the rule under its associated category!
Adjusting Query Results • Classifiers are not perfect! • Queries do not “retrieve” all the documents in a category • Queries for one category “match” documents not in this category • From the classifier’s training phase we know its “confusion matrix”
Confusion Matrix Correct class 10% of “Sports” classified as “Computers” X = 10% of the 5000 “Sports” docs to “Computers” Classified into M . Coverage(D) ~ ECoverage(D)
Confusion Matrix Adjustment:Compensating for Classifier’s Errors -1 = X M is diagonally dominant, hence invertible Coverage(D) ~ M-1 . ECoverage(D) Multiplication better approximates the correct result
Classifying a Database • Send the query probes for the top-level categories • Get the number of matches for each probe • Calculate Specificity and Coverage for each category • “Push” the database to the qualifying categories (with Specificity>Ts and Coverage>Tc) • Repeat for each of the qualifying categories • Return the classes that satisfy the coverage/specificity conditions The result is the Approximation of the Ideal classification
Experiments: Data • 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) • 500,000 Usenet articles (April-May 2000): • Newsgroups assigned by hand to hierarchy nodes • RIPPER trained with 54,000 articles (1,000 articles per leaf) • 27,000 articles used to construct estimations of the confusion matrices • Remaining 419,000 articles used to build 500 Controlled Databases of varying category mixes, size
Comparison With Alternatives • DS: Random sampling of documents via query probes • Callan et al., SIGMOD’99 • Different task: Gather vocabulary statistics • We adapted it for database classification • TQ: Title-based Probing • Yu et al., WISE 2000 • Query probes are simply the category names
Experiments: Metrics Expanded(N) • Accuracy of classification results: • Expanded(N) = N and all descendants • Correct = Expanded(Ideal(D)) • Classified = Expanded(Approximate(D)) • Precision = |Correct /\ Classified|/|Classified| • Recall = |Correct /\ Classified|/|Correct| • F-measure = 2.Precision.Recall/(Precision + Recall) • Costof classification: Number of queries to database N
Average F-measure, Controlled Databases PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing
Experimental Results: Controlled Databases • Feature selection helps. • Confusion-matrix adjustment helps. • F-measure above 0.8 for most <Tc, Ts> combinations. • Results degrade gracefully with hierarchy depth. • Relatively small number of probes needed for most <Tc, Ts> combinations tried. • Also, probes are short: 1.5 words on average; 4 words maximum. • Both better performance and lower cost than DS [Callan et al. adaptation] and TQ [Yu et al.]
Web Databases • 130 real databases classified from InvisibleWeb™. • Used InvisibleWeb’s categorization as correct. • Simple “wrappers” for querying (only # of matches needed). • The Ts, Tc thresholds are not known (unlike with the Controlled databases)but implicit in the IWeb categorization. • We can learn/validate the thresholds (tricky but easy!). • More details in the paper!
Experimental Results:Web Databases • 130 RealWeb Databases. • F-measure above 0.7 for best <Tc, Ts> combination learned. • 185 query probes per database on average needed for classification. • Also, probes are short: 1.5 words on average; 4 words maximum.
Conclusions • Accurate classification using only a small number of short queries • No need for document retrieval • Only need a result like: “X matches found” • No need for any cooperation or special metadata from databases
Current and Future Work • Build “wrappers” automatically • Extend to non-topical categories • Evaluate impact of varying search interfaces (e.g., Boolean vs. ranked) • Extend to other classifiers (e.g., SVMs or Bayesian models) • Integrate with searching (connection with database selection?)
Contributions Easy, inexpensive method for database classification Uses results from document classification “Indirect” classification of the documents in a database Does not inspect documents, only number of matches Adjustment of results according to classifier’s performance Easy wrapper construction No need for any metadata from the database
Related Work • Callan et al., SIGMOD 1999 • Gauch et al., Profusion • Dolin et al., Pharos • Yu et al., WISE 2000 • Raghavan and Garcia Molina, VLDB 2001
Controlled Databases 500 databases built using 419,000 newsgroup articles • One label per document • 350 databases with single (not necessarily leaf) category • 150 databases with varying category mixes • Database size ranges from 25 to 25,000 articles • Indexed and queries using SMART
F-measure for Different Hierarchy Depths PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing Tc=8, Ts=0.3