230 likes | 344 Views
Characterizing the Uncertainty of Web Data: Models and Experiences. Lorenzo Blanco , Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre Dipartimento di Informatica ed Automazione {blanco,crescenz,merialdo,papotti}@dia.uniroma3.it.
E N D
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi,Paolo Merialdo, Paolo Papotti UniversitàdegliStudi Roma Tre Dipartimento di InformaticaedAutomazione {blanco,crescenz,merialdo,papotti}@dia.uniroma3.it
The Web as a Source of Information Opportunities a huge amount of information publicly available valuable data repository can be built by aggregating information spread over many sources abundance of redundancy for data of many domains
The Web as a Source of Information Opportunities a huge amount of information publicly available valuable data repository can be built by aggregating information spread over many sources abundance of redundancy for data of many domains [Blanco et al. WebDb2010, demo@WWW2011]
Limitations sources are inaccurate, uncertain and unreliable some sources reproduce the contents published by others Data conflicts HRBN max price? 20.64 20.49 20.88 …
Several ranking methods for web sources E.g. Google PageRank, Alexa Traffic Rank Mainly based on the popularity of the sources Several factors can compromise the quality of data even when extracted from authoritative sources Errors in the editorial process Errors in the publishing process Errors in the data extraction process Popularity-based rankings
Problem Definition A set of sources (possibly with copiers) provide values of several attributes for a common set of objects w1 w2 w3 errors in bold
Problem Definition A set of sources (possibly with copiers) provide values of several attributes for a common set of objects We want to compute automatically A score of accuracy for each web source The probability distribution for each value (Copier) w1 w2 w3 w4 score (w1)?... score (w4)?
State-of-the-art Probabilistic models to evaluate the accuracy of web data sources(i.e., algorithms to reconcile data from inaccurate sources) NAIVE (voting) ACCU [Yin et al, TKDE08; Wu&Marian, WebDb07; Galland et al, WSDM10] DEP [Dong et al, PVLDB09] M-DEP [Blanco et al, Caise10; Dong et al, PVLDB10]
Goals The goal of our work is twofold: illustrate the state-of-the-art models compare the result of these models on the same real world datasets
NAIVE Independentsources Consider a single attribute at a time Count the votes for each possible value Sources Truth it works it does not! 381 gets 2 votes 380 gets 1 vote
Limitations of the NAIVE Model Real sources can exhibit different accuracies Every source is considered equivalent independently from its authority and accuracy More accurate sources should weight more than inaccurate sources
ACCU: a Model considering the Accuracy of the Sources The vote of a source is weighted according to its accuracy with respect to that attribute Truth Sources Result 542 45 Accuracy 3/3 1/3 1/3 • Main intuition: it's difficult that sources agree on errors! Consensus on (many) true values allows the algorithm to compute accuracy Source Accuracy Discovery Truth Discovery (consensus)
Limitations of the ACCU model Misleading majorities might be formed by copiers Truth Result Sources: Independents Copier 2/3 Accuracy 3/3 2/3 1/3 Both values (380 and 381) get 3/3 as weighted vote • Copiers have to be detected to neutralize the “copied” portion of their votes
A Generative Model of Copiers Source 1 Truth e2 independently produced objects e Copier copied objects e1 Source 3 e1 e2 Source 2
DEP: A Model to Consider Source Dependencies A source is copying 2/3 of its tuples Truth Result Independents Copier Sources: 2/3 Accuracy 3/3 2/3 1/3 380 gets 3/3 as independent weighted vote 381 gets 2/3 x 3/3 + 1/3 x 1/3 = 7/9 as independent weighted vote 1/3 3/3 3/3 “Portion” of independent opinion • Main intuition: copiers can be detected as they propagate false values (i.e., errors)
Contextual Analysis of Truth, Accuracies, and Dependencies Truth Discovery Source Accuracy Discovery Dependence Detection
M-DEP: Improved Evidence from MULTIATT Analysis An analysis based only on the Volume would fail in this example: it would recognizes w2 as a copier of w1 but it would not detect w4 as a copier of w3 actually w1 and w2 are independent sources sharing a common format for volumes w1 w2 w3 w4 Truth MULTIATT(3) Copier errors in bold
Experiments with Web Data Soccer players Truth: hand crafted from official pages Stats: 976 objects and 510 symbols (on average) Videogames Truth: www.esrb.com Stats: 227 objects and 40 symbols (on average) NASDAQ Stock Quotes Truth: www.nasdaq.com Stats: 819 objects, 2902 symbols (on average)
Sample Accuracies of the Sources Sampled accuracy: the number of true values correctly reported over the number of objects. Pearson correlation coefficient shows that quality of data and popularity do not overlap
Experiments with Models Probability Concentration measures the performance in computing probability distributions for the observed objects. Low scores for Soccer: no authority on the Web Differences in VideoGames: #of distinct symbols (5 vs 75) High SA scores in Finance for every model: large #of distinct symbols a
Lessons Learned Three dimensions to decide which technique to use: Characteristics of the domain - domains where authoritative sources exist are much easier to handle - large number of distinct symbols help a lot too Requirements on the results - on average, more complex models return better results, especially for Probability Concentration Execution times - depend on the number of objects and number of distinct symbols. Naïve always scales well.