430 likes | 494 Views
This research presented at VLDB 2011 in Seattle focuses on learning rules for extracting information from structurally similar web pages. It explores the use of automatic wrappers for large-scale web extraction, discussing wrappers, their learning process, and wrapper inductors. The study delves into wrapper enumeration, ranking, problems, and suggests a feature-based wrapper approach with algorithms like bottom-up and top-down for learning wrappers efficiently.
E N D
Automatic Wrappers for Large Scale Web Extraction Nilesh Dalvi (Yahoo!), Ravi Kumar (Yahoo!), Mohamed Soliman(EMC)
Task: Learn rules to extract information (e.g. Directors) from structurally similar pages. VLDB 2011, Seattle, USA
html body head class=‘head’ div class=‘content’ div title Godfather table width=80% table td td td td td td Title : Godfather Coppola Director : Runtime 118min • We can use the following Xpath rule to extract directors W1 = /html/body/div[2]/table/td[2]/text() VLDB 2011, Seattle, USA
Wrappers • Can be learned with a little amount of supervision. • Very effective for site-level extraction. • Have been extensively studied in literature. VLDB 2011, Seattle, USA
In This Work: • Objective: learn wrappers without site-level supervision. VLDB 2011, Seattle, USA
Idea • Obtain training data cheaply using dictionaries or automatic labelers. • Make wrapper induction tolerant to noise. VLDB 2011, Seattle, USA
Summary of Approach • A generic framework, that can incorporate wrapper inductors with plausible properties. • Input : A wrapper inductor Φ, a set of labels L • Idea: Apply Φ on all subsets of L and choose the wrapper that gives the best list. VLDB 2011, Seattle, USA
Summary of Approach • Two main problems: • Wrapper Enumeration: How to generate the space of all the possible wrappers efficiently? • Wrapper Ranking: How to rank the enumerated wrappers based on quality? VLDB 2011, Seattle, USA
Example : TABLE wrapper system • Works on a table. • Generates wrappers from the following space: a single cell, a row, a column or the entire table. VLDB 2011, Seattle, USA
Example : TABLE wrapper system • L = { n1, n2, n4, a4, z5} • 32 possible subsets • 8 unique wrappers : {n1, n2, n4, a4, z5, C1, R4, T} VLDB 2011, Seattle, USA
Wrapper Enumeration Problem • Input : A wrapper inductor, Φ and a set of labels L • Wrapper space of L is defined as W(L) = {Φ(S)| S ⊆ L} • Problem : Enumerate the wrapper space of L in time polynomial in the size of the wrapper space and L. VLDB 2011, Seattle, USA
Wrapper Inductors • TABLE : The wrapper inductor as defined before • XPATH : Learn the minimal xpath rule, in a simple fragment of Xpath, that covers all the training examples • LR : Find the maximal pair of strings preceding and following all the training examples. The output of the wrapper is all strings delimited by the pair. VLDB 2011, Seattle, USA
Well-behaved Inductor • A wrapper inductor Φ is well-behaved if it has following properties: • [Fidelity] L ⊆Φ(L) • [Closure] l∈ Φ(L) ⇒Φ(L) = Φ(L ∪ l) • [Monotonicity] L1⊆ L2 ⇒Φ(L1) ⊆ Φ(L2) • Theorem : TABLE, LR and XPATH are well-behaved wrapper inductors. VLDB 2011, Seattle, USA
Bottom-up Algorithm • Start with singleton labels in L as candidate label sets • Learn wrappers by feeding candidate label sets to Φ • Incrementally apply one-label extensions to each candidate • Extend candidates with the closure of wrappers learned by Φ • Theorem : Bottom-up algorithm is sound and complete • Theorem : Bottom-up algorithm makes at most k.|L| calls to the wrapper, where k is the size of the wrapper space. VLDB 2011, Seattle, USA
Can we do better? • A wrapper inductor is a feature-based inductor if: • Every label is associated with a set of features ((attribute, value) pairs) • Φ(L) = intersection of all the features of L • Output of a wrapper w = text nodes satisfying all the features of w • E.g. TABLE can be expressed as a feature-based inductor with two features, row and col. • Both LR and XPW can be expressed as a feature-based inductor. VLDB 2011, Seattle, USA
Top-down Algorithm • We give a top-down algorithm for a feature-based wrapper that makes exactly k calls to the wrapper, where k is the size of the wrapper space. VLDB 2011, Seattle, USA
Wrapper Ranking Problem • Given a set of wrappers, we want to output one that gives the “best” list. • Let X be a list extracted by a wrapper w • Choose wrapper that maximizes P[X | L], or equivalently, P[L | X] P[X] VLDB 2011, Seattle, USA
Example: Extracting names from business listings • Let us rank the following three lists as candidates for the set of names: • X1 = first column • X2 = entire table • X3 = first two columns VLDB 2011, Seattle, USA
Example: Extracting names from business listings • X1 = first column • P[L | X1] : 2 wrong labels, 3 correct labels • P[X1] : nice repeating structure, schema size = 4 VLDB 2011, Seattle, USA
Example: Extracting names from business listings • X2 = entire table • P[L | X2] : 0 wrong labels, 5 correct labels • P[X2] : nice repeating structure, schema size =1 VLDB 2011, Seattle, USA
Example: Extracting names from business listings • X3 = first two columns • P(L | X3) : 1 wrong label, 4 correct labels • P(X3) : poor repeating structure, schema size = 1 or 3 VLDB 2011, Seattle, USA
Ranking Model • P[L | X] • Assume a simple annotator with precision p and recall r that independently labels each node. • Each node in X is added to L with probability r • Each node not in X is added to L with probability 1- p VLDB 2011, Seattle, USA
Ranking Model • P[X] • Define features of the grammar that describes X, e.g. schema size and repeating structure • Learn distributions on the values of features, or take it as input as part of domain knowledge. VLDB 2011, Seattle, USA
Experiments • Datasets: • DEALERS : Used automatic form filling techniques to obtain dealer listings from 300 store locator pages • DISCOGRAPHY : Crawled 14 music websites that contain track listings of albums. • Task : Automatically learn wrappers to extract business names/track titles for each of the website. VLDB 2011, Seattle, USA
Summary • A new framework for noise-tolerant wrapper induction • Two efficient wrapper enumeration algorithms • Probabilistic wrapper ranking model • Web-scale information extraction • No site-level supervision No manual labeling • Tolerating noise in automatic labeling VLDB 2011, Seattle, USA
Bottom-up Algorithm • INPUT : Φ, L • Z = all singleton subsets of L • W = Z • while (Z not empty) Remove the smallest set S from Z For each possible single-label expansion S’ of S Add Φ(S’) to W Add (Φ(S’) ∩ L) back to Z VLDB 2011, Seattle, USA
Bottom-up Algorithm Z={n1, n2, n4, a4, z5} n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA
Bottom-up Algorithm Z={n2, n4, a4, z5, {n1, n2, n4}} C1 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA
Bottom-up Algorithm Z={n2, n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 a4 z5 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA
Bottom-up Algorithm Z={n4, a4, z5, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 a4 z5 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA
Bottom-up Algorithm Z={a4, z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA
Bottom-up Algorithm Z={z5, {n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA
Bottom-up Algorithm Z={{n4, a4}, {n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA
Bottom-up Algorithm Z={{n1, n2, n4}, {n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA
Bottom-up Algorithm Z={{n1, n2, n4, a4, z5}} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA
Bottom-up Algorithm Z={} T C1 R4 a4 z5 a4 n4 n2 n1 n2 n4 a4 z5 VLDB 2011, Seattle, USA
Top-down Algorithm n1, n2, n4, a4, z5 row column n1, n2, n4 n4, a4 a4 z5 row n2 n1 n4 VLDB 2011, Seattle, USA
non-labeled nodes outside X All nodes H Wrapper Ranking labeled nodes outside X X A2 • argmaxX P(L|X) P(X) ? • Possible values of X are the possible wrappers computed byΦ • P (L |X ): probability of observing L given that X is the right wrapper • The annotator has precision p, and recall r (estimated from tested labelings) • Independent annotation process: • Decide on labeling nodes independently • Each node in X is added to L with probability r • Each node not in X is added to L with probability 1-p L A1 X1 X2 Non-labeled nodes in X labeled nodes labeled nodes in X VLDB 2011, Seattle, USA