1 / 36

Wrapper Learning for Web Data Extraction

Learn how to extract database records from a website with few examples by leveraging wrapper learning systems like WIEN and Boosted Wrapper Induction. Explore boosting rules and co-training algorithms for efficient learning.

watkinsm
Download Presentation

Wrapper Learning for Web Data Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Wrapper Learning:Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03

  2. Goal: learn from a human teacher how to extract certain database records from a particular web site.

  3. Learner

  4. Why learning from few examples is important At training time, only four examples are available—but one would like to generalize to future pages as well… Must generalize across time as well as across a single site

  5. now some details….

  6. Kushmerick’s WIEN system • Earliest wrapper-learning system (published IJCAI ’97) • Special things about WIEN: • Treats document as a string of characters • Learns to extract a relationdirectly, rather than extracting fields, then associating them together in some way • Example is a completely labeled page

  7. WIEN system: a sample wrapper

  8. Left delimiters L1=“<B>”, L2=“<I>”; Right R1=“</B>”, R2=“</I>” WIEN system: a sample wrapper

  9. WIEN system: a sample wrapper • Learning means finding L1,…,Lk and R1,…,Rk • Li must precede every instance of field i • Ri must follow every instance of field I • Li, Ri can’t contain data items • Limited number of possible candidates for Li,Ri

  10. WIEN system: a more complex class of wrappers (HLRT) Extension: use Li,Ri delimiters only:after a “head” (after first occurence of H) and before a “tail” (occurrence of T) H = “<P>”, T = “<HR>”

  11. Kushmeric: overview of various extensions to LR

  12. Kushmeric and Frietag: Boosted wrapper induction

  13. Review of boosting Generalized version of AdaBoost (Singer&Schapire, 99) Allows “real-valued” predictions for each “base hypothesis”—including value of zero.

  14. Constraint: W+ > W- where and caret is smoothing Learning methods: boosting rules • Weak learner: to find weak hypothesis t: • Split Data into Growing and Pruning sets • Let Rt be an empty conjunction • Greedily add conditions to Rt guided by Growing set: • Greedily remove conditions from Rt guided by Pruning set: • Convert to weak hypothesis:

  15. Learning methods: boosting rules SLIPPER also produces fairly compact rule sets.

  16. Learning methods: BWI • Boosted wrapper induction (BWI) learns to extract substrings from a document. • Learns three concepts: firstToken(x), lastToken(x), substringLength(k) • Conditions are tests on tokens before/after x • E.g., toki-2=‘from’, isNumber(toki+1) • SLIPPER weak learner, no pruning. • Greedy search extends “window size” by at most L in each iteration, uses lookahead L, no fixed limit on window size. • Good results in (Kushmeric and Frietag, 2000)

  17. BWI algorithm

  18. BWI algorithm Lookahead search here

  19. BWI example rules

  20. Cohen et al

  21. Improving A Page Classifier with Anchor Extractionand Link Analysis William W. Cohen NIPS 2002

  22. Previous work in page classification using links: • Exploit hyperlinks (Slattery&Mitchell 2000; Cohn&Hofmann, 2001; Joachims 2001): Documents pointed to by the same “hub” should have the same class. • What’s new in this paper: • Use structure of hub pages (as well as structure of site graph) to find better “hubs” • Adapt an existing “wrapper learning” system to find structure, on the task of classifying “executive bio pages”.

  23. Intuition: links from this “hub page” are informative… …especially these links

  24. Task: train a page classifier, then use it to classify pages on a new, previously-unseen web site as executiveBio or other Question: can index pages for executive biographies be used to improve classification? Idea: use the wrapper-learner to learn to extract links to execBio pages, smoothing the “noisy” data produced by the initial page classifier.

  25. Background: “co-training” (Mitchell&Blum, ‘98) • Suppose examples are of the form (x1,x2,y) where x1,x2are independent(given y), and where each xiis sufficient for classification, and unlabeledexamples are cheap. • (E.g., x1 = bag of words, x2 = bag of links). • Co-training algorithm: 1. Use x1’s (on labeled data D) to train f1(x)=y 2. Use f1 to label additional unlabeledexamples U. 3. Use x2’s (on labeled part of U+D to train f1(x)=y 4. Repeat . . .

  26. Simple 1-step co-training for web pages f1 is a bag-of-words page classifier, and S is web site containing unlabeledpages. • Feature construction. Represent a page xin S as a bag of pages that link tox(“bag of hubs”). • Learning. Learn f2 from the bag-of-hubs examples, labeled with f1 • Labeling. Use f2(x) to label pages from S. Idea: use one round of co-training to bootstrap the bag-of words classifier to one that uses site-specific features x2/f2

  27. Improved 1-step co-training for web pages Feature construction. - Label an anchor a in S as positive iff it points to a positive page x (according to f1). Let D = {(x’,a): a is a positive anchor on x’}. - Generate many small training sets Di from D, by sliding small windows over D. - Let P be the set of all “structures” found by any builder from any subset Di - Say that p links to xif p extracts an anchor that points to x. Represent a page x asthe bag of structuresin Pthat link to x. Learning and Labeling. As before.

  28. builder extractor List1

  29. builder extractor List2

  30. builder extractor List3

  31. BOH representation: { List1, List3,…}, PR { List1, List2, List3,…}, PR { List2, List 3,…}, Other { List2, List3,…}, PR … Learner

  32. No improvement Co-training hurts Experimental results

  33. Summary - “Builders” (from a wrapper learning system) let one discover and use structure of web sites and index pages to smooth page classification results. - Discovering good “hub structures” makes it possible to use 1-step co-training on small(50-200 example) unlabeled datasets. – Average error rate was reduced from 8.4% to 3.6%. – Difference is statistically significant with a 2-tailed paired sign test or t-test. – EM with probabilistic learners also works—see (Blei et al, UAI 2002)

More Related