250 likes | 383 Views
Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google Pittsburgh) ECML/PKDD 2008. Bootstrapping Information Extraction from Semi-Structured Web Pages. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A.
E N D
Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google Pittsburgh) ECML/PKDD 2008 Bootstrapping Information Extraction from Semi-Structured Web Pages TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAA
Supervised Information Extraction Supervised IE allows a user to annotate pages and train a ‘wrapper’ for the site.
Bootstrapping IE from Semi-Structured Web Pages Assume that we have wrappers for a number of sites in a domain and thus many records from those sites. Can we use what we’ve learned to automatically wrap a new site in the same domain?
From unlabeled pages to DOM trees DOM tree DOM tree <html> <html> <body> <body> <h1> <h1> <img> <img> <h4> <h4> <div> <div> <div> <div> text text text text text text text text Unlabeled pages from new site 9
From DOM trees to template tree DOM tree DOM tree <html> <html> <body> <body> <h1> <img> <h4> <div> <div> <h1> <h4> <div> <div> <div> text text text text text text text text text Tree alignment Template tree <html> <body> <h1> <img> <h4> <div> <div> <div> text text text text text 10
Supervised setting: Labels from user annotations Generalized template <html> Learn labels from user annotations <body> <h4> <h1> <img> <div> text text text Generalized extraction template <html> <body> <h1> <img> <h4> <div> text text text 11
Bootstrapping setting: Labels from classifiers Generalized template <html> <body> Label data fields with classifiers <h4> <h1> <img> <div> text text text Generalized extraction template <html> Bedrooms: <body> Bedrooms: Las Vegas Bedrooms: 3 New York <h1> <img> <h4> <div> Bedrooms: 5 Miami Bedrooms: 4 Palm Springs Bedrooms: text text text 2 New York 1 Boston 2 12
Framing the classification problem Training Sites Site B Site C Site A Houston Las Vegas San Jose Atlanta Billings City Topeka Las Vegas Topeka Great Falls Seattle New York Philadelphia Missoula Las Vegas Miami New Haven Bozeman Yorktown Palm Springs Boston Atlanta Baltimore New York Boston Other Grill Description: 6/9/08 DVD Player Description: 7/13/08 Heated Pool Description: 7/20/08 Amenities: 717-0474 Bedrooms: Deck Description: 9/13/08 Amenities: 835-7694 Bedrooms: Gas Grill Description: 3 5/15/08 $36 Amenities: 2.5 845-0923 Bedrooms: Canoe Description: 6 $14 1/1/09 Amenities: 3 934-9720 Bedrooms: 4 $99 Amenities: 2.5 663-1111 Bedrooms: 4 $13 3.5 Amenities: 646-0957 Bedroom: 5 $64 2 3 $78 1.5 13
Comparing fields: Feature types Content: Tokens Split on tokens because lots of data types have some vocabulary but order is not important. Character 3-grams Useful for matching “fulltime” and “full-time” Token types (all digits, all caps, etc.) Helpful for addresses, unique IDs, other fields with a mix of token types Context: Precontext character 3-grams Sites vary their wordings, but often use variants of the same words 14
Naïve classification attempt Logistic Regression: Each data field from training sites is a labeled instance for each schema column Use features we just described Problems: Tens of training instances Tens of thousands of features Serious overfitting 15
Coarser Features: Distributional similarity Treat each field as a distribution of values Compute distributional similarity for each feature type: Smooth and normalize to Skew Similarity 16
Smarter classification attempt Stacked Skews model: Each field from each training site is a labeled instance Features are distributional similarity for each feature type Train linear regression model Inspired by database schema matching by [Madhavan et al. 2005] Now: Tens of training instances One feature per feature type – just a handful Appropriately sized learning problem 17
Related work Unsupervised wrapper induction typically doesn’t label data fields e.g. [Chang & Kuo, 2004] [Zhai & Liu, 2005] DeLa system of [Wang & Lochovsky, 2003] Heuristic rule-based mapping of fields to labels Requires explicit prompts of extracted fields [Golgher et al, 2001] Finds exact matches of data values and looks for consistent context 18
Evaluation: Vacation rentals Schema: Title, Bedrooms, Bathrooms, Sleeps, Property Type, Description, Address 19
Evaluation: Job listings Schema: Title, Company, Location, Date Posted, Job Type, ID 20
Results Accuracy by schema column • Significantly outperforms logistic regression baseline. • With a small, fixed investment of human effort, we can create wrappers for hundreds of sites in a domain. 21