1 / 25

Bootstrapping Information Extraction from Semi-Structured Web Pages

Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google Pittsburgh) ECML/PKDD 2008. Bootstrapping Information Extraction from Semi-Structured Web Pages. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A.

jens
Download Presentation

Bootstrapping Information Extraction from Semi-Structured Web Pages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google Pittsburgh) ECML/PKDD 2008 Bootstrapping Information Extraction from Semi-Structured Web Pages TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAA

  2. Semi-Structured Web Pages: Vacation Rentals

  3. Semi-Structured Web Pages: Nobel Prize Winners 3

  4. Semi-Structured Web Pages: Museum Collections 4

  5. Structured Data

  6. Structured data enables better search interfaces 6

  7. Supervised Information Extraction Supervised IE allows a user to annotate pages and train a ‘wrapper’ for the site.

  8. Bootstrapping IE from Semi-Structured Web Pages Assume that we have wrappers for a number of sites in a domain and thus many records from those sites. Can we use what we’ve learned to automatically wrap a new site in the same domain?

  9. From unlabeled pages to DOM trees DOM tree DOM tree <html> <html> <body> <body> <h1> <h1> <img> <img> <h4> <h4> <div> <div> <div> <div> text text text text text text text text Unlabeled pages from new site 9

  10. From DOM trees to template tree DOM tree DOM tree <html> <html> <body> <body> <h1> <img> <h4> <div> <div> <h1> <h4> <div> <div> <div> text text text text text text text text text Tree alignment Template tree <html> <body> <h1> <img> <h4> <div> <div> <div> text text text text text 10

  11. Supervised setting: Labels from user annotations Generalized template <html> Learn labels from user annotations <body> <h4> <h1> <img> <div> text text text Generalized extraction template <html> <body> <h1> <img> <h4> <div> text text text 11

  12. Bootstrapping setting: Labels from classifiers Generalized template <html> <body> Label data fields with classifiers <h4> <h1> <img> <div> text text text Generalized extraction template <html> Bedrooms: <body> Bedrooms: Las Vegas Bedrooms: 3 New York <h1> <img> <h4> <div> Bedrooms: 5 Miami Bedrooms: 4 Palm Springs Bedrooms: text text text 2 New York 1 Boston 2 12

  13. Framing the classification problem Training Sites Site B Site C Site A Houston Las Vegas San Jose Atlanta Billings City Topeka Las Vegas Topeka Great Falls Seattle New York Philadelphia Missoula Las Vegas Miami New Haven Bozeman Yorktown Palm Springs Boston Atlanta Baltimore New York Boston Other Grill Description: 6/9/08 DVD Player Description: 7/13/08 Heated Pool Description: 7/20/08 Amenities: 717-0474 Bedrooms: Deck Description: 9/13/08 Amenities: 835-7694 Bedrooms: Gas Grill Description: 3 5/15/08 $36 Amenities: 2.5 845-0923 Bedrooms: Canoe Description: 6 $14 1/1/09 Amenities: 3 934-9720 Bedrooms: 4 $99 Amenities: 2.5 663-1111 Bedrooms: 4 $13 3.5 Amenities: 646-0957 Bedroom: 5 $64 2 3 $78 1.5 13

  14. Comparing fields: Feature types Content: Tokens Split on tokens because lots of data types have some vocabulary but order is not important. Character 3-grams Useful for matching “fulltime” and “full-time” Token types (all digits, all caps, etc.) Helpful for addresses, unique IDs, other fields with a mix of token types Context: Precontext character 3-grams Sites vary their wordings, but often use variants of the same words 14

  15. Naïve classification attempt Logistic Regression: Each data field from training sites is a labeled instance for each schema column Use features we just described Problems: Tens of training instances Tens of thousands of features Serious overfitting 15

  16. Coarser Features: Distributional similarity Treat each field as a distribution of values Compute distributional similarity for each feature type: Smooth and normalize to Skew Similarity 16

  17. Smarter classification attempt Stacked Skews model: Each field from each training site is a labeled instance Features are distributional similarity for each feature type Train linear regression model Inspired by database schema matching by [Madhavan et al. 2005] Now: Tens of training instances One feature per feature type – just a handful Appropriately sized learning problem 17

  18. Related work Unsupervised wrapper induction typically doesn’t label data fields e.g. [Chang & Kuo, 2004] [Zhai & Liu, 2005] DeLa system of [Wang & Lochovsky, 2003] Heuristic rule-based mapping of fields to labels Requires explicit prompts of extracted fields [Golgher et al, 2001] Finds exact matches of data values and looks for consistent context 18

  19. Evaluation: Vacation rentals Schema: Title, Bedrooms, Bathrooms, Sleeps, Property Type, Description, Address 19

  20. Evaluation: Job listings Schema: Title, Company, Location, Date Posted, Job Type, ID 20

  21. Results Accuracy by schema column • Significantly outperforms logistic regression baseline. • With a small, fixed investment of human effort, we can create wrappers for hundreds of sites in a domain. 21

  22. Thank You

  23. Results by Schema Column

  24. Results by Web Site

  25. Feature Type Ablation Study Results

More Related