Bootstrapping Information Extraction from Semi-Structured Web Pages

Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google Pittsburgh) ECML/PKDD 2008 Bootstrapping Information Extraction from Semi-Structured Web Pages TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAA

Semi-Structured Web Pages: Vacation Rentals

Semi-Structured Web Pages: Nobel Prize Winners 3

Semi-Structured Web Pages: Museum Collections 4

Structured Data

Structured data enables better search interfaces 6

Supervised Information Extraction Supervised IE allows a user to annotate pages and train a ‘wrapper’ for the site.

Bootstrapping IE from Semi-Structured Web Pages Assume that we have wrappers for a number of sites in a domain and thus many records from those sites. Can we use what we’ve learned to automatically wrap a new site in the same domain?

From unlabeled pages to DOM trees DOM tree DOM tree <html> <html> <body> <body> <h1> <h1> <img> <img> <h4> <h4> <div> <div> <div> <div> text text text text text text text text Unlabeled pages from new site 9

From DOM trees to template tree DOM tree DOM tree <html> <html> <body> <body> <h1> <img> <h4> <div> <div> <h1> <h4> <div> <div> <div> text text text text text text text text text Tree alignment Template tree <html> <body> <h1> <img> <h4> <div> <div> <div> text text text text text 10

Supervised setting: Labels from user annotations Generalized template <html> Learn labels from user annotations <body> <h4> <h1> <img> <div> text text text Generalized extraction template <html> <body> <h1> <img> <h4> <div> text text text 11

Bootstrapping setting: Labels from classifiers Generalized template <html> <body> Label data fields with classifiers <h4> <h1> <img> <div> text text text Generalized extraction template <html> Bedrooms: <body> Bedrooms: Las Vegas Bedrooms: 3 New York <h1> <img> <h4> <div> Bedrooms: 5 Miami Bedrooms: 4 Palm Springs Bedrooms: text text text 2 New York 1 Boston 2 12

Framing the classification problem Training Sites Site B Site C Site A Houston Las Vegas San Jose Atlanta Billings City Topeka Las Vegas Topeka Great Falls Seattle New York Philadelphia Missoula Las Vegas Miami New Haven Bozeman Yorktown Palm Springs Boston Atlanta Baltimore New York Boston Other Grill Description: 6/9/08 DVD Player Description: 7/13/08 Heated Pool Description: 7/20/08 Amenities: 717-0474 Bedrooms: Deck Description: 9/13/08 Amenities: 835-7694 Bedrooms: Gas Grill Description: 3 5/15/08 $36 Amenities: 2.5 845-0923 Bedrooms: Canoe Description: 6 $14 1/1/09 Amenities: 3 934-9720 Bedrooms: 4 $99 Amenities: 2.5 663-1111 Bedrooms: 4 $13 3.5 Amenities: 646-0957 Bedroom: 5 $64 2 3 $78 1.5 13

Comparing fields: Feature types Content: Tokens Split on tokens because lots of data types have some vocabulary but order is not important. Character 3-grams Useful for matching “fulltime” and “full-time” Token types (all digits, all caps, etc.) Helpful for addresses, unique IDs, other fields with a mix of token types Context: Precontext character 3-grams Sites vary their wordings, but often use variants of the same words 14

Naïve classification attempt Logistic Regression: Each data field from training sites is a labeled instance for each schema column Use features we just described Problems: Tens of training instances Tens of thousands of features Serious overfitting 15

Coarser Features: Distributional similarity Treat each field as a distribution of values Compute distributional similarity for each feature type: Smooth and normalize to Skew Similarity 16

Smarter classification attempt Stacked Skews model: Each field from each training site is a labeled instance Features are distributional similarity for each feature type Train linear regression model Inspired by database schema matching by [Madhavan et al. 2005] Now: Tens of training instances One feature per feature type – just a handful Appropriately sized learning problem 17

Related work Unsupervised wrapper induction typically doesn’t label data fields e.g. [Chang & Kuo, 2004] [Zhai & Liu, 2005] DeLa system of [Wang & Lochovsky, 2003] Heuristic rule-based mapping of fields to labels Requires explicit prompts of extracted fields [Golgher et al, 2001] Finds exact matches of data values and looks for consistent context 18

Evaluation: Vacation rentals Schema: Title, Bedrooms, Bathrooms, Sleeps, Property Type, Description, Address 19

Evaluation: Job listings Schema: Title, Company, Location, Date Posted, Job Type, ID 20

Results Accuracy by schema column • Significantly outperforms logistic regression baseline. • With a small, fixed investment of human effort, we can create wrappers for hundreds of sites in a domain. 21

Thank You

Results by Schema Column

Results by Web Site

Feature Type Ablation Study Results

Bootstrapping Information Extraction from Semi-Structured Web Pages

Bootstrapping Information Extraction from Semi-Structured Web Pages

Presentation Transcript

Information Extraction from Web Documents

Querying for relations from the semi-structured Web

Information Extraction from the World Wide Web

Information Network Analysis and Extraction Extraction and Integration of the Semi-Structured Web

Open Information Extraction from the Web

Collectively Representing Semi-Structured Data from the Web

Bootstrapping information extraction from semi-structured web pages

Information Extraction: Distilling Structured Data from Unstructured Text.

Information Extraction from the World Wide Web

Information Extraction from the World Wide Web

Implementing Automatic Value Extraction from Structured Web Pages

Structured Information Extraction from Natural Disaster Events on Twitter

Extracting Structured Data from Web Pages

BOEMIE: Bootstrapping Ontology Evolution with Multimedia Information Extraction

Information extraction from web pages using extraction ontologies

Information Extraction from the World Wide Web

Information extraction from web pages using extraction ontologies

The Data Records Extraction from Web Pages