160 likes | 291 Views
A “Quick and Dirty” Website Data Quality Indicator. Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona. Information quality on the web: DEBKA file.
E N D
A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona
Information quality on the web: DEBKAfile • An Israeli, Jerusalem-based website (www.debka.co.il) with commentary and analyses on terrorism, intelligence, security, and military and political affairs in the Middle East • According to DEBKAfile, over 1,000,000 viewers a week • Forbes' Best of The Web award: “Debkafile has been ahead of the pack often enough to suggest that the reporting is good.” • However, Forbes decries the fact that "most of the information is attributed to unidentified sources" • Has been criticized as a fringe outfit catering to conspiracy theorists. Some claim that the site relies on sources with an agenda, and that Israeli intelligencedoes not consider even10%of the content reliable • The site's operators have claimed that 80% turns out to be true WICOW 2008
repeated Information quality on the web: DEBKAfile WICOW 2008
Content quality: Highly current, up-to-date But.. deficiencies in Accuracy Source reliability Objectivity Representation quality: Spelling errors & various typos Very long sentences Grammatical errors .. Information quality on the web: DEBKAfile WICOW 2008
Critical observation Information quality deficiencies are often not isolated Poor information quality control? WICOW 2008
Website information quality assessment: Our approach (I) • Look for an easy to measure data quality facet • Use it as an indicator of aggregate data quality WICOW 2008
Website information quality assessment: Our approach (I) • Focus on spelling errors as an indicator of aggregate data quality • Hypothesis 1: The spelling error rate of a document set is positively related to the aggregate data quality of the set WICOW 2008
Related questions (I) • To what extent is a lower aggregate quality detected by the spelling error rate? • To what extent does a higher spelling error rate indicate a lower aggregate quality? • Are there significant variations across different settings? WICOW 2008
Our approach (II):A “quick and dirty” indicator • Instead of an exhaustive spelling error check, focus on a minimal set of spelling errors, carefully chosen to fit the target document population • Use the hit countfeature of acommonsearch engine (e.g., Google) toassess the rate of the chosen spelling errors in the target population WICOW 2008
A “quick and dirty” indicator: Initial implementation • 10 common English spelling errors selected from the autocorrectword list of MS Office • target broad document populations • Google’s hit count WICOW 2008
A “quick and dirty” indicator: Initial implementation WICOW 2008
, j=1,..,10, denotes the jth spelling error denotes the correct spelling that matches d denotes the document set A “quick and dirty” indicator: Initial implementation • Indicator defined by: WICOW 2008
Website information quality assessment: Our approach (II) • Hypothesis 2: The proposed indicator is positively related to the aggregate data quality of the document set WICOW 2008
Related questions (II) • To what extent is a lower aggregate quality detected by this indicator? • … (see Questions I) • Spelling error set: • what spelling errors to include? • How many? • Hit count: • is it reliable? • How valid is it in measuring error rates? WICOW 2008
Initial tests & results • We have conducted initial tests of hypothesis 1, hypothesis 2, & related questions • Askira Gelman I. and Barletta A.L. Initial Study of a “Quick and Dirty” Website Data Quality Index, ICIQ 2008 WICOW 2008
Initial tests & results • To what extent does a higher spelling error rate indicate a lower aggregate quality? • Positive initial results on large websites & web domains (.gov sites, university sites, wikipedia, and more) • Spelling error set: size can be increased; select carefully to avoid the lack of context sensitivity of the search engine • Hit count: for higher reliability conduct a series of measurements and remove outliers WICOW 2008