Combining Classifiers to Identify Online Databases

Combining Classifiers to Identify Online Databases Luciano Barbosa andJuliana Freire School of Computing University of Utah {lbarbosa,juliana}@cs.utah.edu

The Hidden Web • Web content hidden behind form interfaces • Search for books, airfare tickets • Not accessible from search engines • Millions of online databases - Hsieh et al. SIGMOD 2006 • High-quality content How to leverage this information?

Making the Hidden Web more Accessible: Current Approaches • Database directories (NAR database compilation - Galperin NAR2007) • Web Integration Systems (Google Base; Chang et al. CIDR 2005) • Hidden-Web crawling (Raghavan & Molina VLDB 2001; Barbosa & Freire SBBD 2004)

The Hidden-Web Infrastructure Applications Database Directory Web Integration Systems Hidden Web Crawlers … Hidden-Web Infrastructure Form Repository Barbosa et al. ICDE2007 Form Location Form Clustering Form Identification

Outline • Combining Classifiers to Identify Online Databases • An Adaptive Crawler for Locating Hidden-Web Entry Points

Problem Definition Given a set F of Web forms automatically gathered by a focused crawler in an online database domain D, our goal is to select from F only the forms that are entry points to databases in D.

Challenges • Locate online databases (later!) • Online databases are very sparsely distributed on the Web • Select only “relevant databases” , I.e., filter out non-searchable forms and forms not in domain • There is great variation in the way Web forms are designed, even within a well-defined domain • High structural variability, heterogeneous vocabulary, vocabulary overlap across domains

Form Variability • Searchable X Non-searchable Searchable Non-searchable

Form Variability • Different domains with similar content Hotel Airfare

Form Variability • Heterogeneity in same domain

Solution Overview: Pruning the Search Space Web Searchable Forms Relevant Forms Pages in the domain Relevant forms Non-relevant forms

HIerarchical Form Identification Locating Forms Identifying Relevant Forms Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Searchable forms Relevant forms Web pages Forms Form structure Form textual content Page textual content HIFI

HIFI: Phase I Locating Forms Identifying Relevant Forms Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Searchable forms Relevant forms Web pages Forms Form structure Form textual content Page textual content HIFI

Looking at Form Structure • Searchable X Non-searchable Searchable Non-searchable

Looking at Form Structure • Searchable forms shares similar structure • Statistics about form components Structural features are good indicators of whether forms are searchable or not

Generic Form Classifier - GFC • 13 features • hidden tags; radios; file inputs; submit tags; image inputs; buttons; resets; password tags; textboxes; “search” inside form tags; items in selection lists; submission method (post or get); text sizes in textboxes

Generic Form Classifier • Test error • GFC is domain independent • Previous classifiers for identifying searchable forms are domain dependent • Use the content inside tags

HIFI: Phase II Locating Forms Identifying Relevant Forms Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Searchable forms Relevant forms Web pages Forms Form structure Form textual content Page textual content HIFI

Looking at Form Content • Problem of focused crawler + GFC • Co-occurrence of different searchable forms in the domain

Looking at the Form Content <form id="search_form" name="search" action="http://us.rd.yahoo.com/hotjobs/search/home/* method="get"> Search for Jobs Across the Web Job Category <select tabindex="4" name="industry1" id="industry1"> <option value="FIN">Accounting/Finance</option> <option value="ADV">Advertising/Public Relations</option> <option value="ART">Arts/Entertainment/Publishing</option> <option value="BAM">Banking/Mortgage</option> </select> Keyword(s) <input name="keywords_all" id="keywords_all" type="text" value=""> (e.g. Job title, company, occupation) City & State or Zip <input tabindex="3" type="checkbox" align="left" name="metro_area" id="metro_area" value="1" checked /> Include surrounding cities <input type="hidden" name="country1" id="country1" value="USA"> </form>

Domain-Specific Form Classifier–DSFC • Forms in a given domain contain a well-defined and restricted vocabulary [He et al., CIKM 2004] • Usage of the textual content that can be automatically extracted from forms • Remove the html tags • Vector of 500 most frequent stemmed words in training set • Weight in the vector: term frequency

Classifier Creation • Test error in the 8 domains • Best results: SVM

Hierarchical Classification • GFC • Coarse classification: high recall • Domain independent • DSFC • Smaller search space: high precision • Domain specific • Benefits • Simplify the search space • Allows the construction of simpler classifiers • Use appropriate learning techniques for each feature space • Deal with badly formed forms

Experiments • Assess the quality of HIFI • In 8 representative domains--variation in form structure, vocabulary, size (details in paper) • Over different inputs • Effectiveness of monolithic classifier vs. HIFI

Evaluation Metrics False Negative True Positive False Positive True Negative False Positive False Negative

Exceptions GFC removes a significant percentage of irrelevant forms Misclassifies only a few relevant forms (high recall) GFC Results

HIFI Performance • HIFI = GFC + DSFC • High recall and precision

High recall Low precision over non-searchable forms More specific model HIFI X Monolithic Classifier • Configuration 1 • Content • Configuration 2 • Structure + content Combining classifiers gives the best tradeoff between precision and recall

Sensitivity to Input Quality • Classification accuracy depends on the input quality • Input from two focused crawlers • BFC (Chakrabarti et al., WWW1999)--less focused • FFC (Barbosa & Freire, WebDB 2005)-- more focused

Percentage of Relevant Forms

Sensitivity to Input Quality • Results: F-Measure HIFI is effective for ‘noisy’ input HIFI performs better for the higher-quality input

Related Work • Identifying searchable forms • Pre-query (Hess and Kushmerick IIWeb 2003; Cope et al. ADC 2003) • Domain-dependent; manual extraction of form attributes • Post-query (Bergholz and Chidlovskii WISE 2003) • Require forms to be automatically submitted • Hierarchical classifiers • Image classification (Heiseler et al. Pattern Recognition 2003) • Part-of-speech tagging (Even-Zohar and Roth EMNLP 2001)

Conclusion • Effective and automatic approach to identify forms in a domain • Partition the search space • Construction of simpler and more effective classifiers • Future directions • Handle simple search forms • Use semi-supervised learning to build the DSFC

Outline • Combining Classifiers to Identify Online Databases • An Adaptive Crawler for Locating Hidden-Web Entry Points

Problem Definition Given an online database domain, to automatically locate forms that serve as entry points to databases in this domain

Challenge • Online databases are very sparsely distributed on the Web • A content-based focused crawler retrieves only 94 Movie search forms after crawling 100,000 pages • Requirements • Perform a broad search • Avoid visiting unproductive Web regions

Our Approach • Focused crawler • Restricted to a topic • Delayed benefit • Identifies the neighborhood of the forms • Suitable to sparse domains • Online learning • Learning of experience • Adaptive aspect • Removes possible bias in crawler policy

Outline • FFC (Barbosa and Freire, WebDB2005) • Components • Limitations • ACHE • Adaptive component • Automatic feature selection • Experimental Evaluation

FFC • Focuses on broad topic based on the page content - similar to topic-focused crawlers • Prioritizes links to follow based on hyperlink path patterns- similar to reinforcement-learning-based crawlers • Effective for locating searchable forms Searchable Forms Form Database Searchable Form Classifier Page Forms Page Classifier Crawler Most relevant link Links (Link, Relevance) Link Classifier Frontier Manager

Page Classifier • Focus on a specific topic based on the page content Web Off-topic pages On-topic pages

Form page Link neighborhood at level 1 Level 1 Link neighborhood at level 2 Level 2 Link Classifier • Gives relevance to pages close to form pages • Patterns in the link neighborhood: anchor, URL, text in the proximity of the URL On-topic pages

Frontier Manager • Each non-visited link has the expected reward given by Link Classifier • Implements the crawler policy to maximize the expected reward

FFC: Limitations • Requires substantial manual tuning • Features selected manually for the LC • Efficiency is highly dependent on training examples used to build the Link Classifier • Retrieves a large percentage of irrelevant forms

Searchable Forms Relevant Forms Searchable Form Classifier Domain-Specific Form Classifier Page Forms Page Classifier Crawler Form Identification Most relevant link Links Form path (Link, Relevance) Adaptive Link Learner Automatic Feature Selection Link Classifier Frontier Manager ACHE: Overview Form Database

Adaptive Crawler as a Learning Agent • Behavior generating element (BGE) • Maximize the expected reward (exploitation) • Problem generator (PG) • Suggesting actions that will lead to new experiences even if the benefit is not immediate (exploration) • Critic • Feedback on the success (or failure) of its actions • Online learning • Takes critic’s feedback into account to update the policy used by the BGE.

Searchable Forms Relevant Forms Form Database Searchable Form Classifier Domain-Specific Form Classifier Page Forms Page Classifier Crawler Form Identification Most relevant link Links Form path (Link, Relevance) Adaptive Link Learner Automatic Feature Selection Link Classifier Frontier Manager ACHE as a Learning Agent Critic Online Learning Element BGE + PG

Adaptive Link Learner • Learns from the successful paths • Effectiveness depends on the accuracy of the HIFI

Automatic Feature Selection • Features to successful paths • anchor, URL, and text around links • Select the stemmed terms with the highest DF in each feature space • DF comparable to IG and Chi-square (Yang and Pedersen, 1997) • Aggressive feature selection • Naive Bayes better results with few features (Zheng et al., 2004)

Experiments • Evaluating • Effectiveness in retrieving relevant forms • Quality of the features automatically selected by AFS • Online learning in the crawling process • Database domains

Experiments: Crawling strategies

Combining Classifiers to Identify Online Databases

Combining Classifiers to Identify Online Databases

Presentation Transcript

Classifiers

Combining Multiple Modes of Information using Unsupervised Neural Classifiers

Classifiers

LECTURE 23: ESTIMATING, COMPARING AND COMBINING CLASSIFIERS

Online Databases

Unsupervised medical image classification by combining case -based classifiers

LECTURE 20: ESTIMATING, COMPARING AND COMBINING CLASSIFIERS

Objectives: Cross -Validation ML and Bayesian Model Comparison Combining Classifiers

Classifiers

Combining Multiple Classifiers

Enhancing Text Classifiers to Identify Disease Aspect Information

Data Dependence in Combining Classifiers

Online Databases Status

Mining Several Databases with an Ensemble of Classifiers

Online Databases

Online Databases

Combining Front-to-End Perimeter Search and Pattern Databases

Classifiers

Online Databases Status

Classifiers!!!

eStar – Combining Telescopes and Databases

Online Databases