250 likes | 260 Views
This article discusses the importance of metadata in accurately interpreting data. It includes examples of crime rates, census data, and sample surveys, and highlights the need for good metadata to understand data sources and methodologies.
E N D
Misinterpretation of Data and the Importance of Metadata Bernie Gloyn ACCOLEDS - December 8, 2004
Outline • Crime rates example from Wendy • Metadata • Some considerations by data types • Census • Sample Survey • Administrative • Comparisons • Crude vs standardized
Crime Rates Example • Ebert & Roeper review of Michael Wilson movie “Michael Moore hates America” Ebert doubted claim that Cdn crime rate 2X the USA rate • Moorelies.com | News: Whoa; Stuart Didn't See That One Coming • Ebert conceded with writer that stats supported claim - figures on right • Comparison of STC and US Bureau of Justice Statistics website stats
Crime Rates Example • Debunked by Craig from Canada • Simplistic comparison • Similar category titles on violent and property crimes but different definitions • Concluded violent crime 2-3 X times higher in US, property crimes close • Bureau of Justice Statistics Crime & Justice Data Online • Canadian Statistics - Crimes by type of offence
Metadata • STC Policy on Informing Users of Data Quality • In place since 1978 • tightened up 2000 in response to 1999 AG report • Looked at 4 surveys LFS, CPI, MSM & UCRS • Recognised “All statistics are to some extent estimates” • To be used with awareness of strengths and weaknesses – “fitness for use” • Key tool is the Integrated Meta Database that you see definitions, data sources and methods • Repository of info on STC surveys and programs
Metadata • Can’t over emphasize importance of good metadata, finding it and reading it • Definitions, Data Sources and Methods • Status and Description of survey • Information about the survey • Data sources and methodology • Data Accuracy • Subjects and keywords • Documentation • Statistics Canada: Canadian Community Health Survey
Metadata • Online Catalogue (OLC) • Canadian Community Health Survey: public use microdata file: Product main page • DLI website • DLI - Canadian Community Health Survey Cycle 1.1 • DLI listserv • Ask and we will find out from the Division
Metadata • With Public Use Microdata Files, the code book is very important • Gives questions asked and codes used for responses • “Missing values”, “refusals”, “don’t know” and “not applicable” numeric codes are often assigned • Not consistent in the numeric codes used • Numeric codes that to most software would seem to be valid response
Metadata 1990 Health Promotion Survey there were a series of questions about alcohol consumption. First they asked if the respondent EVER drank alcohol, and if YES asked if they drank within the last 12 months and if YES asked for number of drinks for each day for the past 7 days. The code book showed number of drinks per day as: 81 F4MON 2 0096‑0097 HOW MANY DRINKS DID YOU HAVE ON: MONDAY ? 00 NONE 4651/ 7334907 01:40 NUMBER OF DRINKS 1403/ 2585080 41 MORE THAN 40 DRINKS 1/ 106 98 QUESTION NOT ASKED 7648/10567910 99 NOT STATED 89/ 155377 NOTE: F4MON‑SUN NOT ASKED IF F4=1 OR F1=2 OR F2=2 82 F4TUE 2 0098‑0099 HOW MANY DRINKS DID YOU HAVE ON: TUESDAY ? 00 NONE 4608/ 7306101 01:40 NUMBER OF DRINKS 1447/ 2613991 98 QUESTION NOT ASKED 7648/10567910 99 NOT STATED 89/ 155377 NOTE: F4MON‑SUN NOT ASKED IF F4=1 OR F1=2 OR F2=2
Some Considerations by Data Type • Census • Short form - 9 questions are 100% • Long form – 20% sample • Sample Survey • Most data sets – LFS, GSS, NPHS, etc • Administrative • GST, Revenue Canada, Vital Stats, school enrollments, provincial health insurance, …
Census • High quality but • Non sampling errors • coverage, measurement, non response, processing errors • Key documents are the Census Handbook, Census Dictionary and Census Technical reports • Communiqué for revisions • Population and dwelling count amendments • Don’t change the Census base
Census • Conceptual/definition changes over time can be very important • Census family • Refers to a married couple (with or without children of either or both spouses), …. … A couple living common-law may be of opposite or same sex. “Children” in a census family include grandchildren living with their grandparent(s) but with no parents present • census family, 2001 census • Economic family • Refers to a group of two or more persons who live in the same dwelling and are related to each other by blood, marriage, common-law or adoption. • economic family, 2001 census
Sample Surveys • Estimates • Estimate of the population characteristics based on a sample from a survey frame • Bigger sample gives better estimates • Issue of sample size • 30,000 sample • Want sub population – retirees ~ 3000, males ~1400, immigrants ~ 200, BC ~ 40 • Unstable estimates as you break down the sample • Often forget estimate has a confidence interval • 73% with a CI 10% is not significantly different than 80%
Sample Surveys • Statistical measures of quality • Coefficient of Varience (CV) • gives Standard Deviation as % of Mean • Measure of the fitness for use • smaller the CV, the more reliable the estimate is • CVs < or = 15% generally considered reliable for most uses • CVs > 15% but < 33% are reliable for some purposes with “caution” • CVs > 33% are unreliable and not published
Sample Surveys • Sample value weighted up to represent population • 20% sample for census • Simple weight is 5, more complex, adjusted for characteristics, response rates, etc • example from Mike • Another Health survey • Analyst confusion on weight and height asked in survey • Used body weight as the survey weight • Survey weight was around 400 • … number of obese Cdns!!
Sample Surveys • Changes in frame used for the sample • Annual Survey of Manufacturers moved to the Business Register (ref yr. 2000) • 25,000 incorporated firms missing from survey coverage before • 5% (1/3) of 15% increase from 1999 – 2000 • ASM also changed survey coverage • included 35,000 incorporated firms below $30,000 annual sales • 2% of 15% increase from 1999 – 2000 • Almost half the 15% annual increase from coverage improvements • Manufacturing industries of Canada, national and provincial areas: Product main page
Administrative Data • Original purpose that the data was collected • Provincial Health counts differ from Census • Definitions used aren’t the same • Success rate higher for students at some universities (mostly in QC) • Deregister 4 weeks into course, elsewhere is 3 to 4 days • Coverage of the universe (total population) • not everyone reports income tax • Administrative changes can affect data series
Administrative Data • Provide small area estimates • Normally postal code geography • Postal code can be problematic • Highest income neighbourhood example
Crude vs Standardised Comparisons between countries
Crude vs Standardised • Another mortality comparison but over time • 1951 - 2.83 per 1000 from heart disease • 1993 - 1.93 “ “ “ “ “ • Improvement from advances - 0.9 ? • change due to progress - 2.19 • change due to aging +1.29