1 / 16

A “Quick and Dirty” Website Data Quality Indicator

A “Quick and Dirty” Website Data Quality Indicator. Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona. Information quality on the web: DEBKA file.

hogan
Download Presentation

A “Quick and Dirty” Website Data Quality Indicator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman University of Arizona Anthony L. Barletta University of Arizona

  2. Information quality on the web: DEBKAfile • An Israeli, Jerusalem-based website (www.debka.co.il) with commentary and analyses on terrorism, intelligence, security, and military and political affairs in the Middle East • According to DEBKAfile, over 1,000,000 viewers a week • Forbes' Best of The Web award: “Debkafile has been ahead of the pack often enough to suggest that the reporting is good.” • However, Forbes decries the fact that "most of the information is attributed to unidentified sources" • Has been criticized as a fringe outfit catering to conspiracy theorists. Some claim that the site relies on sources with an agenda, and that Israeli intelligencedoes not consider even10%of the content reliable • The site's operators have claimed that 80% turns out to be true WICOW 2008

  3. repeated Information quality on the web: DEBKAfile WICOW 2008

  4. Content quality: Highly current, up-to-date But.. deficiencies in Accuracy Source reliability Objectivity Representation quality: Spelling errors & various typos Very long sentences Grammatical errors .. Information quality on the web: DEBKAfile WICOW 2008

  5. Critical observation Information quality deficiencies are often not isolated Poor information quality control? WICOW 2008

  6. Website information quality assessment: Our approach (I) • Look for an easy to measure data quality facet • Use it as an indicator of aggregate data quality WICOW 2008

  7. Website information quality assessment: Our approach (I) • Focus on spelling errors as an indicator of aggregate data quality • Hypothesis 1: The spelling error rate of a document set is positively related to the aggregate data quality of the set WICOW 2008

  8. Related questions (I) • To what extent is a lower aggregate quality detected by the spelling error rate? • To what extent does a higher spelling error rate indicate a lower aggregate quality? • Are there significant variations across different settings? WICOW 2008

  9. Our approach (II):A “quick and dirty” indicator • Instead of an exhaustive spelling error check, focus on a minimal set of spelling errors, carefully chosen to fit the target document population • Use the hit countfeature of acommonsearch engine (e.g., Google) toassess the rate of the chosen spelling errors in the target population WICOW 2008

  10. A “quick and dirty” indicator: Initial implementation • 10 common English spelling errors selected from the autocorrectword list of MS Office • target broad document populations • Google’s hit count WICOW 2008

  11. A “quick and dirty” indicator: Initial implementation WICOW 2008

  12. , j=1,..,10, denotes the jth spelling error denotes the correct spelling that matches d denotes the document set A “quick and dirty” indicator: Initial implementation • Indicator defined by: WICOW 2008

  13. Website information quality assessment: Our approach (II) • Hypothesis 2: The proposed indicator is positively related to the aggregate data quality of the document set WICOW 2008

  14. Related questions (II) • To what extent is a lower aggregate quality detected by this indicator? • … (see Questions I) • Spelling error set: • what spelling errors to include? • How many? • Hit count: • is it reliable? • How valid is it in measuring error rates? WICOW 2008

  15. Initial tests & results • We have conducted initial tests of hypothesis 1, hypothesis 2, & related questions • Askira Gelman I. and Barletta A.L. Initial Study of a “Quick and Dirty” Website Data Quality Index, ICIQ 2008 WICOW 2008

  16. Initial tests & results • To what extent does a higher spelling error rate indicate a lower aggregate quality? • Positive initial results on large websites & web domains (.gov sites, university sites, wikipedia, and more) • Spelling error set: size can be increased; select carefully to avoid the lack of context sensitivity of the search engine • Hit count: for higher reliability conduct a series of measurements and remove outliers WICOW 2008

More Related