170 likes | 302 Views
Bootstrapping information extraction from semi-structured web pages. Andrew Carson and Charles Schafer. Abstract. No human supervision required system Previous work: Required significant human effort Their solution: Requiring 2-5 annotated pages fro 4-6 web sites for training model
E N D
Bootstrapping information extraction from semi-structured web pages Andrew Carson and Charles Schafer
Abstract • No human supervision required system • Previous work: • Required significant human effort • Their solution: • Requiring 2-5 annotated pages fro 4-6 web sites for training model • No human supervision for the garget web site • Result: • 83.8% and 91.1% for different sites.
Introduction • Extracting structured records from detail pages of semi-structured web pages
Introduction • Why semi-structured web • Great sources of information • Attribute/value structure: downstream learning or querying systems
Related Work • Problem of Previous Work • No labeling example pages, but manual labeling of the output • Irrelevant fields(20 data fields and 7 schema columns) • Dela system(automatically label extracted data) • Problem of labeling detected data fields • A data field does not have a label • Multiple fields of the same data type
Methods • Terms: • Domain schema: a set of attributes • Schema column: a single attribute • Detailed page: a page that corresponds to a single data record • Data field: a location within a template for that site • Data values: an instance of that data field
Methods • Detecting Data Fields • Partial Tree Alignment Algorithm
Methods • Classifying Data Fields • Assign a score to each schema column • c: Data values => data for training schema column • f: data fields => contexts from the training data • Compute the score: • Use a classifier to map data fields to schema column • Use a model • K different feature types
Methods • Feature Types • Precontext character 3-grams • Lowercase value tokens • Lowercase value character 3-grams • Value token types
Methods • Comparing Distributions of Feature Values • Advantage • Similar data values • Avoid over-fitting • when high-dimensional feature spaces • Small number of training example
Methods • KL-Divergence • Smoothed version • Skew Similarity Score
Methods • Combining Skew Similarity Scores • Combine skew similarity scores for the dfferent feature types using linear regression model • Stacked classifier model • Labeling the Target Site • Higher for each schema column c
Evaluation • Accuracy of automatically labeling new sites • How well it make recommendations to human annotators • Input: a collection of annotated sites for a domain • Method: cross-validation
Identifying Missing Schema Columns • Vacation rentals: 80.0% • Job sites: 49.3%