1 / 43

Automatic Wrappers for Large Scale Web Extraction

Automatic Wrappers for Large Scale Web Extraction. Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC). Task : Learn rules to extract information (e.g. Directors) from structurally similar pages. html. body. head. class= ‘ head ’. div. class= ‘ content ’. div. title.

hogan
Download Presentation

Automatic Wrappers for Large Scale Web Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman (EMC)

  2. Task: Learn rules to extract information (e.g. Directors) from structurally similar pages. VLDB 2011, Seattle, USA

  3. html body head class=‘head’ div class=‘content’ div title Godfather table width=80% table td td td td td td Title : Godfather Coppola Director : Runtime 118min • We can use the following Xpath rule to extract directors W1 = /html/body/div[2]/table/td[2]/text() VLDB 2011, Seattle, USA

  4. Wrappers • Can be learned with a little amount of supervision. • Very effective for site-level extraction. • Have been extensively studied in literature. VLDB 2011, Seattle, USA

  5. In This Work: • Objective: learn wrappers without site-level supervision. VLDB 2011, Seattle, USA

  6. VLDB 2011, Seattle, USA

  7. Idea • Obtain training data cheaply using dictionaries or automatic labelers. • Make wrapper induction tolerant to noise. VLDB 2011, Seattle, USA

  8. VLDB 2011, Seattle, USA

  9. Summary of Approach • A generic framework, that can incorporate wrapper inductors with plausible properties. • Input : A wrapper inductor Φ, a set of labels L • Idea: Apply Φ on all subsets of L and choose the wrapper that gives the best list. VLDB 2011, Seattle, USA

  10. Summary of Approach • Two main problems: • Wrapper Enumeration: How to generate the space of all the possible wrappers efficiently? • Wrapper Ranking: How to rank the enumerated wrappers based on quality? VLDB 2011, Seattle, USA

  11. Example : TABLE wrapper system • Works on a table. • Generates wrappers from the following space: a single cell, a row, a column or the entire table. VLDB 2011, Seattle, USA

  12. Example : TABLE wrapper system • L = { n1, n2, n4, a4, z5} • 32 possible subsets • 8 unique wrappers : {n1, n2, n4, a4, z5, C1, R4, T} VLDB 2011, Seattle, USA

  13. Wrapper Enumeration Problem • Input : A wrapper inductor, Φ and a set of labels L • Wrapper space of L is defined as W(L) = {Φ(S)| S ⊆ L} • Problem : Enumerate the wrapper space of L in time polynomial in the size of the wrapper space and L. VLDB 2011, Seattle, USA

  14. Wrapper Inductors • TABLE : The wrapper inductor as defined before • XPATH : Learn the minimal xpath rule, in a simple fragment of Xpath, that covers all the training examples • LR : Find the maximal pair of strings preceding and following all the training examples. The output of the wrapper is all strings delimited by the pair. VLDB 2011, Seattle, USA

  15. Well-behaved Inductor • A wrapper inductor Φ is well-behaved if it has following properties: • [Fidelity] L ⊆Φ(L) • [Closure] l ∈ Φ(L) ⇒Φ(L) = Φ(L ∪ l) • [Monotonicity] L1⊆ L2 ⇒Φ(L1) ⊆ Φ(L2) • Theorem : TABLE, LR and XPATH are well-behaved wrapper inductors. VLDB 2011, Seattle, USA

  16. Bottom-up Algorithm • Start with singleton labels in L as candidate label sets • Learn wrappers by feeding candidate label sets to Φ • Incrementally apply one-label extensions to each candidate • Extend candidates with the closure of wrappers learned by Φ • Theorem : Bottom-up algorithm is sound and complete • Theorem : Bottom-up algorithm makes at most k.|L| calls to the wrapper, where k is the size of the wrapper space. VLDB 2011, Seattle, USA

  17. Can we do better? • A wrapper inductor is a feature-based inductor if: • Every label is associated with a set of features ((attribute, value) pairs) • Φ(L) = intersection of all the features of L • Output of a wrapper w = text nodes satisfying all the features of w • E.g. TABLE can be expressed as a feature-based inductor with two features, row and col. • Both LR and XPW can be expressed as a feature-based inductor. VLDB 2011, Seattle, USA

  18. Top-down Algorithm • We give a top-down algorithm for a feature-based wrapper that makes exactly k calls to the wrapper, where k is the size of the wrapper space. VLDB 2011, Seattle, USA

  19. Wrapper Ranking Problem • Given a set of wrappers, we want to output one that gives the “best” list. • Let X be a list extracted by a wrapper w • Choose wrapper that maximizes P[X | L], or equivalently, P[L | X] P[X] VLDB 2011, Seattle, USA

  20. Example: Extracting names from business listings • Let us rank the following three lists as candidates for the set of names: • X1 = first column • X2 = entire table • X3 = first two columns VLDB 2011, Seattle, USA

  21. Example: Extracting names from business listings • X1 = first column • P[L | X1] : 2 wrong labels, 3 correct labels • P[X1] : nice repeating structure, schema size = 4 VLDB 2011, Seattle, USA

  22. Example: Extracting names from business listings • X2 = entire table • P[L | X2] : 0 wrong labels, 5 correct labels • P[X2] : nice repeating structure, schema size =1 VLDB 2011, Seattle, USA

  23. Example: Extracting names from business listings • X3 = first two columns • P(L | X3) : 1 wrong label, 4 correct labels • P(X3) : poor repeating structure, schema size = 1 or 3 VLDB 2011, Seattle, USA

  24. Ranking Model • P[L | X] • Assume a simple annotator with precision p and recall r that independently labels each node. • Each node in X is added to L with probability r • Each node not in X is added to L with probability 1- p VLDB 2011, Seattle, USA

  25. Ranking Model • P[X] • Define features of the grammar that describes X, e.g. schema size and repeating structure • Learn distributions on the values of features, or take it as input as part of domain knowledge. VLDB 2011, Seattle, USA

  26. Experiments • Datasets: • DEALERS : Used automatic form filling techniques to obtain dealer listings from 300 store locator pages • DISCOGRAPHY : Crawled 14 music websites that contain track listings of albums. • Task : Automatically learn wrappers to extract business names/track titles for each of the website. VLDB 2011, Seattle, USA

  27. VLDB 2011, Seattle, USA

  28. VLDB 2011, Seattle, USA

  29. Summary • A new framework for noise-tolerant wrapper induction • Two efficient wrapper enumeration algorithms • Probabilistic wrapper ranking model • Web-scale information extraction • No site-level supervision  No manual labeling • Tolerating noise in automatic labeling VLDB 2011, Seattle, USA

  30. VLDB 2011, Seattle, USA

  31. Bottom-up Algorithm • INPUT : Φ, L • Z = all singleton subsets of L • W = Z • while (Z not empty) Remove the smallest set S from Z For each possible single-label expansion S’ of S Add Φ(S’) to W Add (Φ(S’) ∩ L) back to Z VLDB 2011, Seattle, USA

  32. Bottom-up Algorithm Z={n1, n2, n4, a4, z5} n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

  33. Bottom-up Algorithm Z={n2, n4, a4, z5, {n1, n2, n4}} C1 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

  34. Bottom-up Algorithm Z={n2, n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 a4 z5 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

  35. Bottom-up Algorithm Z={n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 a4 z5 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

  36. Bottom-up Algorithm Z={a4, z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

  37. Bottom-up Algorithm Z={z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

  38. Bottom-up Algorithm Z={{n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

  39. Bottom-up Algorithm Z={{n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

  40. Bottom-up Algorithm Z={{n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

  41. Bottom-up Algorithm Z={} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA

  42. Top-down Algorithm n1, n2, n4, a4, z5 row column n1, n2, n4 n4, a4 a4 z5 row n2 n1 n4 VLDB 2011, Seattle, USA

  43. non-labeled nodes outside X All nodes H Wrapper Ranking labeled nodes outside X X A2 • argmaxX P(L|X) P(X) ? • Possible values of X are the possible wrappers computed byΦ • P (L |X ): probability of observing L given that X is the right wrapper • The annotator has precision p, and recall r (estimated from tested labelings) • Independent annotation process: • Decide on labeling nodes independently • Each node in X is added to L with probability r • Each node not in X is added to L with probability 1-p L A1 X1 X2 Non-labeled nodes in X labeled nodes labeled nodes in X VLDB 2011, Seattle, USA

More Related