1 / 21

Robust Web Extraction: An Approach Based on a Probabilistic Tree –Edit Model

Robust Web Extraction: An Approach Based on a Probabilistic Tree –Edit Model. Nilesh Dalvi, Philip Bohannon, Fei Sha Presented by Vinay Rambhia. Introduction. Script generated websites have html tree structure Wrappers are used to extract information

mai
Download Presentation

Robust Web Extraction: An Approach Based on a Probabilistic Tree –Edit Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust Web Extraction: An Approach Based on a Probabilistic Tree –Edit Model Nilesh Dalvi, Philip Bohannon, Fei Sha Presented by Vinay Rambhia

  2. Introduction • Script generated websites have html tree structure • Wrappers are used to extract information • Xpath expression to extract director information w1=/html/body/div[2]/table/td[2]/text() • Works for similar pages

  3. Introduction • Evolution cause wrappers to break so high maintenance • Other wrappers w2=//div[@class=‘content’]/*/td[2]/text() w3=//table[@width=‘80%’/td[2]/text() w4=//text()[psib::*[1][text()=‘director’]]

  4. Introduction • This paper discuss • use temporal snapshot of WebPages to develop probabilistic tree edit model • use this model to improve wrapper construction • Method estimates efficiently in quadratic time in the size of the tree • When applied to IMDB it was 86% robust whereas traditional wrappers were 40% robust

  5. Robust Extraction Framework

  6. Change Model • Change model is defined in terms of conditional transducer ‘п’ process • When a forest T is given to П process it converts into forest S • П process is defined into 2 sub process пins ,пds

  7. Change Model • To summarize, the generative process π is characterized by following parameters θ = (pstop, {pdel(l)}, {pins(l)}, {psub(l1, l2))} for l, l1, l2 ∈∑ along with the following conditions: • 0 < pstop < 1 • 0 ≤ pdel(l) ≤ 1 • pins(l) ≥ 0, ∑L pins(l) = 1 • psub(l1, l2) ≥ 0,∑L2 psub(l1, l2) = 1 ……..eq(A)

  8. Model Learner • Archival data contains {S,T} pairs were S is old versions and T is new versions • Model is specified in terms of set of parameters θ • We want to find θ* θ∗ = arg max Π(T,S)∈ArchivalData Pθ(T | S) • Pθ(T | S) is a Computing Transformation Probability

  9. Computing Transformation Probabilities • The transducer π performs a sequence of edit operations consisting of insertions, deletions and substitutions to transform a tree S into another tree T. • Use dynamic programming to compute probabilities as there various ways

  10. Computing Transformation Probabilities • Let DP1(Fs, Ft) denote the probability that π(Fs) = Ft due πins ,πsub • two cases: • The node v was the result of an insertion by πins operator. Let p be the probability that πins inserts the node v in Ft−v to form Ft.Then, the probability of this case is DP1(Fs, Ft −v) ∗ p. • The node v was the result of a substitution. The probability of this case is DP2(Fs, Ft). Hence, we have DP1(Fs, Ft) = DP2(Fs, Ft) + p ∗ DP1(Fs, Ft − v) ……..Eq(1)

  11. Computing Transformation Probabilities • Let DP2(Fs, Ft) denote the probability that π(Fs) = Ft πsub • two cases: • v was substituted for u. In this case, we must have Fs − [u] transform to Ft − [v] and ⌊u⌋ transform to ⌊v⌋. Denoting psub(label(u), label(v)) with p1, the total probability of this case is p1 ∗ DP1(Fs −[u], Ft −[v]) ∗ DP1(⌊u⌋, ⌊v⌋) • v was substituted for some node other than u. we have DP2(Fs, Ft) = p1DP1(Fs − [u], Ft − [v])DP1(⌊u⌋, ⌊v⌋)+ p2DP2(Fs − u, Ft) ……..Eq(2)

  12. Computing Transformation Probabilities

  13. Computing Transformation Probabilities • Let T1 be the tree with the nodes a and b, • let T2 be the tree with single node c. • Let us compute the probability that π(T1) = T2, which is denoted by DP1(T1, T2). Applying Eq (1) we get DP1(T1, T2) = DP2(T1, T2) + pins(c) ∗ DP1(T1, ∅) • Let T3 denote the tree with single node b. Then, DP2(T1, T2) = psub(a, c) ∗ DP1(∅, ∅) ∗ DP1(T3, ∅)+ pdel(a) ∗ DP2(T3, T2) • To compute DP2(T3, T2), we get DP2(T3, T2) = psub(b, c) ∗ DP1(∅, ∅) ∗ DP1(∅, ∅)+ pdel(b) ∗ DP2(∅, T2) • Total probability DP1(T1, T2) = psub(a, c) ∗ pdel(b) ∗ p2 stop + psub(b, c) ∗ pdel(a) ∗ p2 stop+ pdel(a) ∗ pdel(b) ∗ pins(c) ∗ pstop

  14. Parameter estimation • θ∗ = arg max θ N∑n=1logPθ(Tn | Sn) • It is difficult to calculate θ∗so we calculate by Gradient ascent θt+1 = θt + ηg(θt)…..eq(3) g(θ) =∂ log ℓ(θ)/∂θ = N∑n=1∂ logP(Tn | Sn)/ ∂θ • Θ has to satisfy eq(A) • So we use variable reparameterization θij = e αij /N∑j=1 eαij • Eq(3) becomes αt+1 = αt + ηg(αt)

  15. Generating Candidate Wrappers • We use bottom up algorithm starting from general Xpath and specializing it till it matches only the target node • w0 = //table/ ∗ /td/text() //table/tr/td/text() //table[bgcolor =′ red′]/ ∗ /td/text() //table/ ∗ /td[2]/text() • Algorithm maintains a set P of partial wrappers which has recall=1 and precision<1 • Algorithm applies specialization steps to Xpaths in P to convert into new Xpath such that precision becomes 1

  16. Generating Candidate Wrappers

  17. Evaluating robustness of Wrappers • Rob X,θ(ϕ) =∑XY | ϕ |=Pθ(Y | X) • Algorithm for calculating robustness

  18. Experimental Evaluation Change model

  19. Experimental Evaluation Generating Robust Wrappers

  20. Experimental Evaluation Evaluation of Model Learner

  21. Thank youAny Questions?

More Related