170 likes | 193 Views
21st Century Statistics. The Case against Data Editing Jean-Pierre Kent Ljubljana, 9-11 May 2011. Data editing?. An art of the past …. Why?. Competition Loss of monopoly Quantity Exponential growth Quality Changing criteria. Competition: Google. GPI: Google Price Index
E N D
21st Century Statistics The Case against Data Editing Jean-Pierre Kent Ljubljana, 9-11 May 2011
Data editing? An art of the past … The Case against Data Editing
Why? • Competition • Loss of monopoly • Quantity • Exponential growth • Quality • Changing criteria The Case against Data Editing
Competition: Google • GPI: Google Price Index • Automatic, real time, internet based, free • Protocol buffers • An alternative for XML • DSPL • An alternative for SDMX • Google Chart tools • Graphic and interactive presentation of data • Google Public Data • An alternative for NSI web sites The Case against Data Editing
Competition: Google “Basically, our goal is to organize the world’s information and to make it universally accessible and useful”. Google The Case against Data Editing
Another alternative price index: Numbeo.com Numbeo • provides to a reader of a website prices for free • allows a person to estimate they own expenses • uses the wisdom of the crowd to get as reliable data as possible • provides a system for systematic research of cost of living and property markets • provides a system for other systematic economical research on huge dataset with worldwide data In real time! And for free! Example The Case against Data Editing
Competition: conclusion Because our competitors offer free and actual statistics, we need to avoid as much as possible time-consuming and labour-intensive activities. For example: data editing. The Case against Data Editing
Quantity:Size of the Web In 2009:500 petabytesIn 2011:1800 exabytesExpected in 2020:50 zettabytes Source: http://www.lesk.com/mlesk/ksg97/ksg.htmlIn 1997… …and today The Case against Data Editing
Quantity • 1800 exabytes • If 99.9% is video, photo, audio, text and nonsense, “only” 0.1% is interesting. • This is 1800 petabytes of relevant data. The Case against Data Editing
Quantity: 1800 petabytes • Can we afford to ignore this mass? • Primary and Register data: A few terabytes • Is this representative of available data? • Can we afford to edit such an amount of data? • Tip: it is growing exponentially! The Case against Data Editing
Quality • Why do we edit data? • Quality • But: • What are the quality criteria? • Who specifies them? • How do we approach the quality of very large data sets? The Case against Data Editing
Quality • This has already happend to: • Furniture (1950’s) • Watches (1970’s) • Publishing (1990’s) • Many others (1800-2000) • This will also happen to: • Statistics (2010’s) • From Best … • Quality specified by producer • High cost • Long time to market • … to Good Enough • Compromise between quality and cost • User in control of quality and cost The Case against Data Editing
Quality:Impact of data editing How does data editing affect quality criteria? • Cost: negative • Time to market: negative • Authenticity: negative • Variance reliability: negative The Case against Data Editing
Quality of Internet data • These data are produced by relevant processes • These processes depend on the quality of these data • Therefore these data are relevant, representative and good enough… • … and don’t require editing. The Case against Data Editing
Qualityof very large data sets … … does not depend on quality of individual records: • Quality of photo quality of individual pixels • Quality of music quality of individual sound bits • Knowledge of crowd behaviour knowledge of individual behaviour • Why should this be different for statistical data? The Case against Data Editing
Think about it! The Case against Data Editing “The idea that you can take incremental steps in the media business is over. You have to take some big steps and you have to take some risks.” Dave Hunke USA Today