660 likes | 810 Views
Combining Classifiers to Identify Online Databases. Luciano Barbosa and Juliana Freire School of Computing University of Utah {lbarbosa,juliana}@cs.utah.edu. The Hidden Web. Web content hidden behind form interfaces Search for books, airfare tickets Not accessible from search engines
E N D
Combining Classifiers to Identify Online Databases Luciano Barbosa andJuliana Freire School of Computing University of Utah {lbarbosa,juliana}@cs.utah.edu
The Hidden Web • Web content hidden behind form interfaces • Search for books, airfare tickets • Not accessible from search engines • Millions of online databases - Hsieh et al. SIGMOD 2006 • High-quality content How to leverage this information?
Making the Hidden Web more Accessible: Current Approaches • Database directories (NAR database compilation - Galperin NAR2007) • Web Integration Systems (Google Base; Chang et al. CIDR 2005) • Hidden-Web crawling (Raghavan & Molina VLDB 2001; Barbosa & Freire SBBD 2004)
The Hidden-Web Infrastructure Applications Database Directory Web Integration Systems Hidden Web Crawlers … Hidden-Web Infrastructure Form Repository Barbosa et al. ICDE2007 Form Location Form Clustering Form Identification
Outline • Combining Classifiers to Identify Online Databases • An Adaptive Crawler for Locating Hidden-Web Entry Points
Problem Definition Given a set F of Web forms automatically gathered by a focused crawler in an online database domain D, our goal is to select from F only the forms that are entry points to databases in D.
Challenges • Locate online databases (later!) • Online databases are very sparsely distributed on the Web • Select only “relevant databases” , I.e., filter out non-searchable forms and forms not in domain • There is great variation in the way Web forms are designed, even within a well-defined domain • High structural variability, heterogeneous vocabulary, vocabulary overlap across domains
Form Variability • Searchable X Non-searchable Searchable Non-searchable
Form Variability • Different domains with similar content Hotel Airfare
Form Variability • Heterogeneity in same domain
Solution Overview: Pruning the Search Space Web Searchable Forms Relevant Forms Pages in the domain Relevant forms Non-relevant forms
HIerarchical Form Identification Locating Forms Identifying Relevant Forms Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Searchable forms Relevant forms Web pages Forms Form structure Form textual content Page textual content HIFI
HIFI: Phase I Locating Forms Identifying Relevant Forms Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Searchable forms Relevant forms Web pages Forms Form structure Form textual content Page textual content HIFI
Looking at Form Structure • Searchable X Non-searchable Searchable Non-searchable
Looking at Form Structure • Searchable forms shares similar structure • Statistics about form components Structural features are good indicators of whether forms are searchable or not
Generic Form Classifier - GFC • 13 features • hidden tags; radios; file inputs; submit tags; image inputs; buttons; resets; password tags; textboxes; “search” inside form tags; items in selection lists; submission method (post or get); text sizes in textboxes
Generic Form Classifier • Test error • GFC is domain independent • Previous classifiers for identifying searchable forms are domain dependent • Use the content inside tags
HIFI: Phase II Locating Forms Identifying Relevant Forms Focused Crawler Generic Form Classifier Domain-Specific Form Classifier Searchable forms Relevant forms Web pages Forms Form structure Form textual content Page textual content HIFI
Looking at Form Content • Problem of focused crawler + GFC • Co-occurrence of different searchable forms in the domain
Looking at the Form Content <form id="search_form" name="search" action="http://us.rd.yahoo.com/hotjobs/search/home/* method="get"> Search for Jobs Across the Web Job Category <select tabindex="4" name="industry1" id="industry1"> <option value="FIN">Accounting/Finance</option> <option value="ADV">Advertising/Public Relations</option> <option value="ART">Arts/Entertainment/Publishing</option> <option value="BAM">Banking/Mortgage</option> </select> Keyword(s) <input name="keywords_all" id="keywords_all" type="text" value=""> (e.g. Job title, company, occupation) City & State or Zip <input tabindex="3" type="checkbox" align="left" name="metro_area" id="metro_area" value="1" checked /> Include surrounding cities <input type="hidden" name="country1" id="country1" value="USA"> </form>
Domain-Specific Form Classifier–DSFC • Forms in a given domain contain a well-defined and restricted vocabulary [He et al., CIKM 2004] • Usage of the textual content that can be automatically extracted from forms • Remove the html tags • Vector of 500 most frequent stemmed words in training set • Weight in the vector: term frequency
Classifier Creation • Test error in the 8 domains • Best results: SVM
Hierarchical Classification • GFC • Coarse classification: high recall • Domain independent • DSFC • Smaller search space: high precision • Domain specific • Benefits • Simplify the search space • Allows the construction of simpler classifiers • Use appropriate learning techniques for each feature space • Deal with badly formed forms
Experiments • Assess the quality of HIFI • In 8 representative domains--variation in form structure, vocabulary, size (details in paper) • Over different inputs • Effectiveness of monolithic classifier vs. HIFI
Evaluation Metrics False Negative True Positive False Positive True Negative False Positive False Negative
Exceptions GFC removes a significant percentage of irrelevant forms Misclassifies only a few relevant forms (high recall) GFC Results
HIFI Performance • HIFI = GFC + DSFC • High recall and precision
High recall Low precision over non-searchable forms More specific model HIFI X Monolithic Classifier • Configuration 1 • Content • Configuration 2 • Structure + content Combining classifiers gives the best tradeoff between precision and recall
Sensitivity to Input Quality • Classification accuracy depends on the input quality • Input from two focused crawlers • BFC (Chakrabarti et al., WWW1999)--less focused • FFC (Barbosa & Freire, WebDB 2005)-- more focused
Sensitivity to Input Quality • Results: F-Measure HIFI is effective for ‘noisy’ input HIFI performs better for the higher-quality input
Related Work • Identifying searchable forms • Pre-query (Hess and Kushmerick IIWeb 2003; Cope et al. ADC 2003) • Domain-dependent; manual extraction of form attributes • Post-query (Bergholz and Chidlovskii WISE 2003) • Require forms to be automatically submitted • Hierarchical classifiers • Image classification (Heiseler et al. Pattern Recognition 2003) • Part-of-speech tagging (Even-Zohar and Roth EMNLP 2001)
Conclusion • Effective and automatic approach to identify forms in a domain • Partition the search space • Construction of simpler and more effective classifiers • Future directions • Handle simple search forms • Use semi-supervised learning to build the DSFC
Outline • Combining Classifiers to Identify Online Databases • An Adaptive Crawler for Locating Hidden-Web Entry Points
Problem Definition Given an online database domain, to automatically locate forms that serve as entry points to databases in this domain
Challenge • Online databases are very sparsely distributed on the Web • A content-based focused crawler retrieves only 94 Movie search forms after crawling 100,000 pages • Requirements • Perform a broad search • Avoid visiting unproductive Web regions
Our Approach • Focused crawler • Restricted to a topic • Delayed benefit • Identifies the neighborhood of the forms • Suitable to sparse domains • Online learning • Learning of experience • Adaptive aspect • Removes possible bias in crawler policy
Outline • FFC (Barbosa and Freire, WebDB2005) • Components • Limitations • ACHE • Adaptive component • Automatic feature selection • Experimental Evaluation
FFC • Focuses on broad topic based on the page content - similar to topic-focused crawlers • Prioritizes links to follow based on hyperlink path patterns- similar to reinforcement-learning-based crawlers • Effective for locating searchable forms Searchable Forms Form Database Searchable Form Classifier Page Forms Page Classifier Crawler Most relevant link Links (Link, Relevance) Link Classifier Frontier Manager
Page Classifier • Focus on a specific topic based on the page content Web Off-topic pages On-topic pages
Form page Link neighborhood at level 1 Level 1 Link neighborhood at level 2 Level 2 Link Classifier • Gives relevance to pages close to form pages • Patterns in the link neighborhood: anchor, URL, text in the proximity of the URL On-topic pages
Frontier Manager • Each non-visited link has the expected reward given by Link Classifier • Implements the crawler policy to maximize the expected reward
FFC: Limitations • Requires substantial manual tuning • Features selected manually for the LC • Efficiency is highly dependent on training examples used to build the Link Classifier • Retrieves a large percentage of irrelevant forms
Searchable Forms Relevant Forms Searchable Form Classifier Domain-Specific Form Classifier Page Forms Page Classifier Crawler Form Identification Most relevant link Links Form path (Link, Relevance) Adaptive Link Learner Automatic Feature Selection Link Classifier Frontier Manager ACHE: Overview Form Database
Adaptive Crawler as a Learning Agent • Behavior generating element (BGE) • Maximize the expected reward (exploitation) • Problem generator (PG) • Suggesting actions that will lead to new experiences even if the benefit is not immediate (exploration) • Critic • Feedback on the success (or failure) of its actions • Online learning • Takes critic’s feedback into account to update the policy used by the BGE.
Searchable Forms Relevant Forms Form Database Searchable Form Classifier Domain-Specific Form Classifier Page Forms Page Classifier Crawler Form Identification Most relevant link Links Form path (Link, Relevance) Adaptive Link Learner Automatic Feature Selection Link Classifier Frontier Manager ACHE as a Learning Agent Critic Online Learning Element BGE + PG
Adaptive Link Learner • Learns from the successful paths • Effectiveness depends on the accuracy of the HIFI
Automatic Feature Selection • Features to successful paths • anchor, URL, and text around links • Select the stemmed terms with the highest DF in each feature space • DF comparable to IG and Chi-square (Yang and Pedersen, 1997) • Aggressive feature selection • Naive Bayes better results with few features (Zheng et al., 2004)
Experiments • Evaluating • Effectiveness in retrieving relevant forms • Quality of the features automatically selected by AFS • Online learning in the crawling process • Database domains