1 / 25

Misinterpretation of Data and the Importance of Metadata

This article discusses the importance of metadata in accurately interpreting data. It includes examples of crime rates, census data, and sample surveys, and highlights the need for good metadata to understand data sources and methodologies.

mazurek
Download Presentation

Misinterpretation of Data and the Importance of Metadata

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Misinterpretation of Data and the Importance of Metadata Bernie Gloyn ACCOLEDS - December 8, 2004

  2. Outline • Crime rates example from Wendy • Metadata • Some considerations by data types • Census • Sample Survey • Administrative • Comparisons • Crude vs standardized

  3. Crime Rates Example • Ebert & Roeper review of Michael Wilson movie “Michael Moore hates America” Ebert doubted claim that Cdn crime rate 2X the USA rate • Moorelies.com | News: Whoa; Stuart Didn't See That One Coming • Ebert conceded with writer that stats supported claim - figures on right • Comparison of STC and US Bureau of Justice Statistics website stats

  4. Crime Rates Example • Debunked by Craig from Canada • Simplistic comparison • Similar category titles on violent and property crimes but different definitions • Concluded violent crime 2-3 X times higher in US, property crimes close • Bureau of Justice Statistics Crime & Justice Data Online • Canadian Statistics - Crimes by type of offence

  5. Crime Rates Example

  6. Metadata • STC Policy on Informing Users of Data Quality • In place since 1978 • tightened up 2000 in response to 1999 AG report • Looked at 4 surveys LFS, CPI, MSM & UCRS • Recognised “All statistics are to some extent estimates” • To be used with awareness of strengths and weaknesses – “fitness for use” • Key tool is the Integrated Meta Database that you see definitions, data sources and methods • Repository of info on STC surveys and programs

  7. Metadata • Can’t over emphasize importance of good metadata, finding it and reading it • Definitions, Data Sources and Methods • Status and Description of survey • Information about the survey • Data sources and methodology • Data Accuracy • Subjects and keywords • Documentation • Statistics Canada: Canadian Community Health Survey

  8. Metadata • Online Catalogue (OLC) • Canadian Community Health Survey: public use microdata file: Product main page • DLI website • DLI - Canadian Community Health Survey Cycle 1.1 • DLI listserv • Ask and we will find out from the Division

  9. Metadata • With Public Use Microdata Files, the code book is very important • Gives questions asked and codes used for responses • “Missing values”, “refusals”, “don’t know” and “not applicable” numeric codes are often assigned • Not consistent in the numeric codes used • Numeric codes that to most software would seem to be valid response

  10. Metadata 1990 Health Promotion Survey there were a series of questions about alcohol consumption. First they asked if the respondent EVER drank alcohol, and if YES asked if they drank within the last 12 months and if YES asked for number of drinks for each day for the past 7 days. The code book showed number of drinks per day as: 81 F4MON 2 0096‑0097 HOW MANY DRINKS DID YOU HAVE ON: MONDAY ? 00 NONE 4651/ 7334907 01:40 NUMBER OF DRINKS 1403/ 2585080 41 MORE THAN 40 DRINKS 1/ 106 98 QUESTION NOT ASKED 7648/10567910 99 NOT STATED 89/ 155377 NOTE: F4MON‑SUN NOT ASKED IF F4=1 OR F1=2 OR F2=2 82 F4TUE 2 0098‑0099 HOW MANY DRINKS DID YOU HAVE ON: TUESDAY ? 00 NONE 4608/ 7306101 01:40 NUMBER OF DRINKS 1447/ 2613991 98 QUESTION NOT ASKED 7648/10567910 99 NOT STATED 89/ 155377 NOTE: F4MON‑SUN NOT ASKED IF F4=1 OR F1=2 OR F2=2

  11. Some Considerations by Data Type • Census • Short form - 9 questions are 100% • Long form – 20% sample • Sample Survey • Most data sets – LFS, GSS, NPHS, etc • Administrative • GST, Revenue Canada, Vital Stats, school enrollments, provincial health insurance, …

  12. Census • High quality but • Non sampling errors • coverage, measurement, non response, processing errors • Key documents are the Census Handbook, Census Dictionary and Census Technical reports • Communiqué for revisions • Population and dwelling count amendments • Don’t change the Census base

  13. Census • Conceptual/definition changes over time can be very important • Census family • Refers to a married couple (with or without children of either or both spouses), …. … A couple living common-law may be of opposite or same sex. “Children” in a census family include grandchildren living with their grandparent(s) but with no parents present • census family, 2001 census • Economic family • Refers to a group of two or more persons who live in the same dwelling and are related to each other by blood, marriage, common-law or adoption. • economic family, 2001 census

  14. Sample Surveys • Estimates • Estimate of the population characteristics based on a sample from a survey frame • Bigger sample gives better estimates • Issue of sample size • 30,000 sample • Want sub population – retirees ~ 3000, males ~1400, immigrants ~ 200, BC ~ 40 • Unstable estimates as you break down the sample • Often forget estimate has a confidence interval • 73% with a CI 10% is not significantly different than 80%

  15. Sample Surveys • Statistical measures of quality • Coefficient of Varience (CV) • gives Standard Deviation as % of Mean • Measure of the fitness for use • smaller the CV, the more reliable the estimate is • CVs < or = 15% generally considered reliable for most uses • CVs > 15% but < 33% are reliable for some purposes with “caution” • CVs > 33% are unreliable and not published

  16. Data Quality Symbols

  17. Sample Surveys • Sample value weighted up to represent population • 20% sample for census • Simple weight is 5, more complex, adjusted for characteristics, response rates, etc • example from Mike • Another Health survey • Analyst confusion on weight and height asked in survey • Used body weight as the survey weight • Survey weight was around 400 • … number of obese Cdns!!

  18. Sample Surveys • Changes in frame used for the sample • Annual Survey of Manufacturers moved to the Business Register (ref yr. 2000) • 25,000 incorporated firms missing from survey coverage before • 5% (1/3) of 15% increase from 1999 – 2000 • ASM also changed survey coverage • included 35,000 incorporated firms below $30,000 annual sales • 2% of 15% increase from 1999 – 2000 • Almost half the 15% annual increase from coverage improvements • Manufacturing industries of Canada, national and provincial areas: Product main page

  19. Administrative Data • Original purpose that the data was collected • Provincial Health counts differ from Census • Definitions used aren’t the same • Success rate higher for students at some universities (mostly in QC) • Deregister 4 weeks into course, elsewhere is 3 to 4 days • Coverage of the universe (total population) • not everyone reports income tax • Administrative changes can affect data series

  20. Administrative Data • Provide small area estimates • Normally postal code geography • Postal code can be problematic • Highest income neighbourhood example

  21. Crude vs Standardised Comparisons between countries

  22. Crude vs Standardised

  23. Crude vs Standardised

  24. Crude vs Standardised • Another mortality comparison but over time • 1951 - 2.83 per 1000 from heart disease • 1993 - 1.93 “ “ “ “ “ • Improvement from advances - 0.9 ? • change due to progress - 2.19 • change due to aging +1.29

  25. Thank you!

More Related