540 likes | 583 Views
Deep-Web Crawling and Related Work. Matt Honeycutt CSC 6400. Outline. Basic background information Google’s Deep-Web Crawl Web Data Extraction Based on Partial Tree Alignment Bootstrapping Information Extraction from Semi-structured Web Pages
E N D
Deep-Web Crawling and Related Work Matt Honeycutt CSC 6400
Outline • Basic background information • Google’s Deep-Web Crawl • Web Data Extraction Based on Partial Tree Alignment • Bootstrapping Information Extraction from Semi-structured Web Pages • Crawling Web Pages with Support for Client-Side Dynamism • DeepBot: A Focused Crawler for Accessing Hidden Web Content
Background • Publicly-Indexable Web (PIW) • Web pages exposed by standard search engines • Pages link to one another • Deep-web • Content behind HTML forms • Database records • Estimated to be much larger than PIW • Estimated to be of higher quality than PIW
Google’s Deep-Web Crawl J. MadhavanD. Ko L. Kot V. GanapathyA. Rasmussen A. Halevy
Summary • Describes process implemented by Google • Goal is to ‘surface’ content for indexing • Contributions: • Informativeness test • Query selection techniques and algorithm for generating appropriate text inputs
About the Google Crawler • Estimates that there are ~10 million high-quality HTML forms • Index representative deep-web content across many forms, driving search traffic to the deep-web • Two problems: • Which inputs to fill in? • What values to use?
Query Templates • Correspond to SQL-like queries: select * from D where P • First problem is to select the best templates • Second problem is to select the best values for those templates • Want to ignore presentation-related fields
Incremental Search for Informative Query Templates • Classify templates as either informative or uninformitive • Template is informative if it generates sufficiently distinct pages from other templates • Build more complex templates from simpler informative ones • Signatures computed for each page
Informativeness Test • T is informative if: • Heuristically limit to templates with 10,000 or fewer possible submissions and no more than 3 dimensions • Can estimate informativeness using a sample of possible queries (ie: 200)
Observations • URLs generated for larger templates are not as useful • ISIT Generates far fewer URLs than CP but still has high coverage • Most common reason for inability to find informative template: JavaScript • Ignoring JavaScript errors, informative templates found for 80% of forms tested
Generating Input Values • Text boxes may be typed or untyped • Special rules for small number of typed inputs that are common • Can’t use generic lists, best keywords are site specific • Select seed keywords from form, then iterate and select candidate keywords from results using TF-IDF • Results are clustered and representative keywords are chosen for each cluster, ranked by page length • Once candidate keywords have been selected, treat text inputs as select inputs
Conclusions • Describes the innovations of “the first large-scale deep-web surfacing system” • Results are already integrated into Google • Informativness test is a useful building block • No need to cover individual sites completely • Heuristics for common input types are useful • Future work: support for JavaScript and handling dependencies between inputs • Limitation: only supports GET requests
Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai Bing Liu
Summary • Novel technique for extracting data from record lists: DEPTA (Data Extraction based on Partial Tree Alignment) • Automatically identifies records and aligns their fields • Overcomes limitations of existing techniques
Approach • Step 1: Build tag tree • Step 2: Segment page to identify data regions • Step 3: Identify data records within the regions • Step 4: Align records to identify fields • Step 5: Extract fields into common table
Building the Tag Tree and Finding Data Regions • Computes bounding regions for each element • Associate items to parents based on containment to build tag tree • Next, compare tag strings with edit distance to find data regions • Finally, identify records within regions
Partial Tree Alignment • Tree matching is expensive • Simple Tree Matching – faster, but not as accurate • Longest record tree becomes seed • Fields that don’t match are added to seed • Finally, field values extracted and inserted into table
Conclusions • Surpasses previous work (MDR) • Capable of extracting data very accurately • Recall: 98.18% • Precision: 99.68%
Bootstrapping Information Extraction from Semi-structured Web Pages A. Carlson C. Schafer
Summary • Method for extracting structured records from web pages • Method requires very little training and achieves good results in two domains
Introduction • Extracting structured fields enables advanced information retrieval scenarios • Much previous work has been site-specific or required substantial manual labeling • Heuristic-based approaches have not had great success • Uses semi-supervised learning to extract fields from web pages • User only has to label 2-5 pages for each of 4-6 sites
Technical Approach • Human specifies domain schema • Labels training records from representative sites • Utilizes partial tree alignment to acquire additional records for each site • New records are automatically labeled • Learns regression model that predicts mappings from fields to schema columns
Mapping Fields to Columns • Calculate score between each field and column • Score based on field contexts and contexts observed in training • Most probable mapping above a threshold is accepted
Feature Types • Precontext 3-grams • Lowercase value tokens • Lowercase value 3-grams • Value token type categories
Scoring • Field mappings based on comparing feature distributions • Distribution computed from training contexts • Distribution computed from observed contexts • Completely dissimilar field/column pairs are fully divergent • Exact field/column pairs have no divergence • Feature similarities combined using “stacked” linear regression model • Weights for the model are learned in training
Crawling Web Pages with Support for Client-Side Dynamism Manuel Alvarez Alberto Pan Juan Raposo Justo Hidalgo
Summary • Advanced crawler based on browser automation • NSEQL - Language for specify browser actions • Stores URLs and path back to URL
Limitations of Typical Crawlers • Built on low-level HTTP APIs • Limited or no support for client-side scripts • Limited support for sessions • Can only see what’s in the HTML
Their Crawler’s Features • Built on “mini web browsers” – MSIE Browser Control • Handles client-side JavaScript • Routes fully support sessions • Limited form-handling capabilities
Identifying New Routes • Routes can come from links, forms, and JavaScript • ‘href’ attributes extracted from normal anchor tags • Tags with JavaScript click events are identified and “clicked” • Captures actions and inspects them
Results and Conclusions • Large scale websites are crawler-friendly • Many medium-scale, deep-web sites aren’t • Crawlers should handle client-side script • Presented crawler has been applied to real-world applications
DeepBot: A Focused Crawler for Accessing Hidden Web Content Manuel Alvarez Juan Raposo Alberto Pan
Summary • Presents a focused deep-web crawler • Extension of previous work • Crawls links and handles search forms
Domain Definitions • Attributes a1…aN • Each attribute has name, aliases, specificity index • Queries q1…qN • Each query contains 1 or more (attribute,value) pairs • Relevance threshold
Evaluating Forms • Obtains bounding coordinates of all form fields and potential labels • Distances and angles computed between fields and labels
Evaluating Forms • If label l is within min-distance of field f, l is added to f’s list • Ties are broken using angle • Lists are pruned so that labels appear in only one list and all fields have at least one possible label
Evaluating Forms • Text similarity measures used to link domain attributes to fields • Computes relevance of form • If form score exceeds relevance threshold, DeepBot executes queries
Results and Conclusions • Evaluated on three domain tasks: book, music, and movie shopping • Achieves very high precision and recall • Errors due to: • Missing aliases • Forms with too few fields to achieve minimum support • Sources that did not label fields