Flint: exploiting redundant information to wring out value from Web data

Flint: exploiting redundant information to wring out value from Web data • Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti • Roma Tre University - Rome, Italy

Motivations • The opportunity:An increasing number of web sites with structured information • The problem: • Current technologies are limited in exploiting the data offered by these sources • Web semantics technologies are too complex and costly • Challenges: • development of unsupervised, scalable techniques to extract and integrate data from fairly structured large corpora available on the Web [DB Claremont Report 2008]

Structured Data at Work: Search Engines

Introduction • Notable approaches for massive extraction of Web data concentrate on information organized according to specific patterns that occur on the Web • WebTables [Cafarella et al VLDB2008] and ListExtract [Elmeleegy et al VLDB2009] focus on data published in HTML tables and lists • Information extraction systems (e.g. TextRunner [Banko-Etzioni ACL2008]) exploit lexical-syntactic patterns to extract collections of facts (e.g., x is the capital of y) • Even a small fraction of the Web implies an impressive amount of data • given a Web fragment of 14 billion pages: 1.1% of them are good tables ->154 millions [Cafarella et al VLDB2008]

Observation • Many sources publish data about one object of a real-world entity for each page • Collections of pages can be thought as HTML encodings of a relation

Observation • Learned while looking for pages to evaluate RoadRunner • I was frustrated … RoadRunner was build to infer arbitrary nested structures (list of list of list …) but pages were much more simpler • And pages with complex structures usually were designed to support the navigation to detail pages

Information redundancy on the Web • For many disparate entities (e.g. stock quotes, people, products, movies, books, etc.) many web sites follow this publishing strategy • These sites can be considered as sources that provide redundant information. The redundancy occurs: • at the schema level: the same attributes are published by more than one source (e.g. volume, min/max/avg price, market capt. for stock quotes) • at the extensional level: several objects are published by more than one source (e.g. many web sites publish data about the same stock quotes)

Abstract Generative Process "Hidden Relation" = λY(e Y(σY(π Y(R0)))) SY SG = λG (e G(σG(π G(R0)))) SR =λR (e R(σR(πR(R0))))

Abstract Generative Process "Hidden Relation" σticker like "nasdaq:%" (πticker, price, volume(R0)) • Each source generated by: • πprojection • σselection • eerror (e.g. approx, mistakes, formattings) • λtemplate encoding

Abstract Generative Process "Hidden Relation" • Each source generated by: • πprojection • σselection • eerror (e.g. approx, mistakes, formattings) • λtemplate encoding e Y (σticker like "nasdaq:%" (πticker, price, volume(R0))) e Y(): round(volume, 1000), price(Nσ)

Abstract Generative Process "Hidden Relation" SY = λY (e Y (σticker like "nasdaq:%" (πticker, price, volume(R0)))) e Y(): round(volume, 1000), price(Nσ) • Each source generated by: • πprojection • σselection • eerror (e.g. approx, mistakes, formattings) • λtemplate encoding

Problem: Invert the Process "Hidden Relation"

The Flint approach • Exploit the redundancy of information • to discover the sources[Blanco et al WIDM08, WWW11] • to generate the wrappers, to match data from different sources, to infer labels for the extracted data[Blanco et al EDBT08,WebDB10, VLDS11] • to evaluate the quality of the data and the accuracy of the sources[Blanco et al Caise2010, Wicow11]

The Flint approach (intuition) SG SY SR

Integration and Extraction • integration problem • extraction problem • how they can be tackled contextually • We start considering the web sources as relational views over the hidden relation

Integration Problem • Given a set of sources S1, S2, … Sk, each Si publishes a view of the hidden relation • Problem: create a set of mappings, where each mapping is a set of attributes with the same semantics

Integration Problem SG SY SR

Integration Algorithm • Intuition: we match attributes from different sources to build aggregations of attributes with the same semantics • Assumption: alignment (record linkage) over a bunch of tuples • To identify attributes with the same semantics, we rely on an instance based matching • noise implies possible discrepancies in the values! SA SB d(a1, b1) = 0.08

Integration Algorithm SA • Every attribute is a node SB SC

Integration Algorithm • Every attribute is matched against all other attributes

Integration Algorithm • Edges are ranked w.r.t. to the distance (due to the discrepancies). We start with the best match

Integration Algorithm • We drop useless edges

Integration Algorithm • We take the next edge in the rank and drop useless edges

Integration Algorithm

Integration Algorithm • Clustering algorithm to solve the problem • AbstractIntegration is O(n2) over the total number of attributes in the sources • But we are dealing with clean relational views... are these the relations we get from wrappers?

Extraction Problem • A source Si is a collection of pages Si = p1, p2,… , pn • each page publishes data about one object of a real-world entity • Two different types of values can appear in a page: • target values: data from the hidden relation • noise values: not relevant data (e.g., advertising, template, layout, etc)

Extraction Problem • A wrapper wi is a set of extraction rules wi = erA1, …, erAn page 1 page 2 er 1 er 3 er 2 er 4

Extraction Problem • A wrapper wi is a set of extraction rules wi = erA1, …, erAn • Unsupervised wrapper inference limits: • Extraction of noise data (e.g. er 3) • Some extraction rule may be imprecise (e.g. er 4) er 1 er 3 er 2 er 4 page 1 page 2

Extraction Problem • An extraction rule is: • correct if for every page it extracts a target value of the same conceptual attribute • weak if it mixes either target values with different semantics or target values with noise values

Extraction Problem • Problem: Given a set of sources S = S1, S2, … Sn, produce a set of wrappers W*={w1, w2, … wn}, such that wi contains all and only the correct rules for Si • We leverage the redundant information among different sources to identify and filter out the weak rules • In a redundant environment, extracted data do not match by chance!

Overlapping Rules • To increase the probability of getting the correct rules, we need a wrapper with more extraction rules P1 P2 P3 … er1= 2.1%, 42ML, 3.0%, ... html er2= 2.1%, 1.3%, 3.0%, ... tags body er3= 33ML, 42ML, 1ML, ... er4= 5, 5, 6, ... div div 2.1% ... ... 33ML

Overlapping Rules • Two extraction rules from the same wrapper overlap if they extract the same occurrence of the same string from one page P1 P2 P3 … html er1= 2.1%, 42ML, 3.0%, ... er2= 2.1%, 1.3%, 3.0%, ... tags body div div 2.1% ... ... 33ML

Overlapping Rules • Given a set of overlapping rules, one is correct, the others are weak • Idea: match all of them against rules from other sources: (i) correct rule is the one with the best matching score, (ii) drop the others S1 S2

Overlapping Rules • Given a set of overlapping rules, one is correct, the others are weak • Idea: match all of them against rules from other sources: (i) correct rule is the one with the best matching score, (ii) drop the others 0.5 0 1 0.5 S1 S2

Integration Algorithm • It is correct • It is O(n2) over the total number of attributes in the source

Extraction and Integration Alg. • Lemma: AbstractExtraction is correct • AbstractExtraction is O(n2) over the total number of extraction rules

Extraction and Integration Alg. • Greedy best-effort algorithm for integration and extraction [Blanco et al. WebDb2010, WWW2011] • Promising experimental results

Some Results • R = number of correct extraction rules over the number of sources containing the actualattribute

Adding Labels • Last step: assign a label to each mapping • Candidate labels: the textual template nodes that occur closest to the extracted values • poor performances on a single source • but effective on large number of sources because it exploits the redundancy of labels (observed also in [Cafarella et al SIGMOD Record2008] )

The Flint approach • Exploit the redundancy of information • to discover the sources[Blanco et al WIDM08, WWW11] • to generate the wrappers, to match data from different sources, to infer labels for the extracted data[Blanco et al EDBT08,WebDB10, VLDS11] • to evaluate the quality of the data and the accuracy of the sources[Blanco et al Caise2010, Wicow11]

Source Discovery • We developed crawling techniques to discover and collect the collections of our input sources [Blanco et al WIDM08, WWW11] • Input: a few sample pages • The crawler also associates an identifier to objects described in the collected pages

Data Quality and Source Accuracy • Redundancy implies inconsistencies and conflicts, since sources can provide different values for the same attribute of a given object • This is modeled by the error function in the abstract generation process) • A concrete example: • on April 21th 2009, the open trade for the Sun Microsystem Inc. stock quote published by three distinct finance web sites, was 9.17, 9.15 and 9.15 • Which one is correct? (probability distribution) • What is the accuracy of the sources? • … is there any source that's copying values?

Data Quality and Source Accuracy • Probabilistic models to evaluate the accuracy of web data • NAIVE (voting) • ACCU [Yin et al, TKDE08; Wu&Marian, WebDb07; Galland et al, WSDM10] (voting + source accuracy) • DEP [Dong et al, PVLDB09](voting + source accuracy + copiers) • M-DEP [Blanco et al, Caise10; Dong et al, PVLDB10](voting + source accuracy + copiers over more attributes)

Conclusion • Data do not match by chance • Unexpected attributes discovered • Tolerant to noise (financial data challenging) • Other projects are exploiting data redundancy (e.g. Nguyen et al VLDB11, Rastogi et al VLDB10, Gupta-Sarawagi WIDM11) • Plans to leverage also schema knowledge • The approach applies for domains where instances are replicated over more sites (e.g. not suitable for real estate)

Flint: exploiting redundant information to wring out value from Web data