670 likes | 1.02k Views
Presentation Outline. Sample DesignTarget population, sample allocation and framesSampling strategies, oversampling of sub-populationsData collection, response ratesImputationWeightingSampling errorSampling variability guidelinesVariance estimation: Bootstrap re-sampling techniqueCV look-u
E N D
1.
Canadian Community Health Survey
Cycle 1.1
Overview of methodological issues and more...
2. Presentation Outline Sample Design
Target population, sample allocation and frames
Sampling strategies, oversampling of sub-populations
Data collection, response rates
Imputation
Weighting
Sampling error
Sampling variability guidelines
Variance estimation: Bootstrap re-sampling technique
CV look-up tables
Analysis
Examples
How to use the Bootvar programs
3. CCHS - Cycle 1.1 Health Region-level survey Main objective
Produce timely cross-sectional estimates for 136 health regions
Target population
individuals living in private occupied dwellings aged 12 years old or over
Exclusions: those living on Indian Reserves and Crown Lands, residents of institutions, full-time members of the Canadian Armed Forces and residents of some remote areas
CCHS 1.1 covers ~98% of the Canadian population
4. CCHS - Sample Allocation to Provinces Prov Pop # of 1st Step 2nd Step Total
Size HRs 500/HR X-prop Sample
NFLD 551K 6 *2,780 1,230 4,010
PEI 135K 2 1,000 1,000 2,000
NS 909K 6 3,000 2,040 5,040
NB 738K 7 3,500 1,650 5,150
QUE 7,139K 16 8,000 16,280 24,280
ONT 10,714K 37 18,500 23,760 42,260
MAN 1,114K 11 5,500 2,500 8,000
SASK 990K 11 *5,400 2,320 7,720
ALB 2,697K 17 *8,150 6,050 14,200
BC 3,725K 20 10,000 8,090 18,090
CAN 29,000K 133 65,830 64,920 130,750
* The sampling fraction in some small HRs was capped at 1 in 20 households
5. CCHS - Sample Allocation to Health Regions Pop. Size # of Mean
Range HRs Sample Size
Small less than 75,000 41 525
Medium 75,000 - 240,000 60 900
Large 240,000 - 640,000 25 1,500
X-Large 640,000 and more 7 2,500
6. CCHS - Sample Allocation to Territories Population Sample
Yukon 25,000 850
NWT 36,000 900
Nunavut 22,000 800
7. CCHS - Sample Frame CCHS sample selected from three frames:
Area frame (Labour Force Survey structure)
RDD frame of telephone numbers (Random Digit Dialling)
List frame of telephone numbers
Three frames are needed for CCHS for the following reasons:
1. To yield the desired sample sizes in all health regions
2. Have a telephone data collection structure in place to quickly address provincial/regional requests for buy-in sample and/or content at any point in time
3. Optimize collection costs
8. Area frame - Sampling of households
83% of CCHS sampled households
Multistage stratified cluster sample design
9. RDD frame of telephone numbers Sampling of households Elimination of non-working banks method
7% of CCHS sampled households
Telephone bank: area code + first 5 digits of a 7-digit phone #
1- Keep the banks with at least one valid phone #
2- Group the banks to encompass as closely as possible the health region areas - RDD strata
3- Within each RDD stratum, first select one bank at random and then generate at random one number between 00 and 99
4- Repeat the process until the required number of telephone numbers within the RDD stratum is reached
10. List frame of telephone numbers Sampling of households Simple random sample of telephone numbers
10% of CCHS sampled households
Telephone companies’ billing address files and Telephone Infobase (repository of phone directories)
1- Create a list of phone numbers
2- Stratify the phone numbers by health region using the residential postal codes
3- Select phone numbers at random within a health region
4- Repeat the process until the required number of telephone numbers is reached
11. CCHS - Sampling of persons Area frame
Simple random sample (SRS) of one person aged 12 years of age or older (82% of households)
SRS sample of two persons aged 12 years of age or older (18%)
RDD / List frames
SRS sample of one person aged 12 years of age or older
12. CCHS - Sampling of persons Age 1996 LFS * CCHS
group Census sample simulated (all persons) sample
( only 1 person)
12-19 13.2 13.7 8.5
20-29 16.4 14.4 14.3
30-44 30.8 28.7 29.1
45-64 25.8 28.0 27.9
65 + 13.8 15.2 20.2
* averaged distribution over 100 repetitions using the May 99 LFS sample
13. CCHS - Representativity of sub-populations To address users’ needs, two sub-population groups needed larger effective sample sizes:
Youths (12-19 years old)
Decision > Oversample youths by selecting a second person (12-19) in some households based on their composition
Elderlies (65 years old and +)
Decision > Do not oversample - let the general sample selection process address the issue by itself
14. Sampling strategy based on household composition Number of persons aged 20 or over
Number 0 1 2 3 4 5+ of 12-19
0 - A A A A B
1 A A C C C B
2 A C C C C C
3+ A C C C C C
A: Simple random sample (SRS) of one person aged 12+
B: SRS of two persons aged 12+
C: SRS of one person in the age group 12-19 and SRS of one person 20+
15. CCHS - Sample Distribution after Oversampling
Age 1996 * CCHS * CCHS
group Census simulated simulated sample sample
( only 1 person) ( some 2 persons)
12-19 13.2 8.5 14.9
20-29 16.4 14.3 13.1
30-44 30.8 29.1 28.1
45-64 25.8 27.9 26.3
65 + 13.8 20.2 17.6
* averaged distribution over 100 repetitions using the May 99 LFS sample
16. CCHS - Initial data collection plan 12 monthly samples
12 collection months + 1
Area frame
CAPI
STC field interviewers
targeted response rate: 90%
anticipated vacancy rate: 13%
(09 / 2000 - 08 / 2001) + 09 / 2001
RDD / List frames
CATI
STC call centres
targeted response rate: 85%
telephone hit rate: 15-60%
17. CCHS data collection - Observed situation Field interviewers
workload exceeded field staff capacity
Call centres
new collection infrastructure
unequal allocation of work among call centres
Descriptive paper:
« Preventing nonresponse in the Canadian Community Health Survey », Y. Béland, J. Dufour, and M. Hamel. 2001, Hull, Statistics Canada XVIIIth International Symposium.
18. CCHS - Final response rates
Field Call centres Total
NFLD 86.6 89.3 86.8
PEI 87.7 82.6 84.7
NS 88.8 89.3 88.8
NB 88.4 92.4 88.5
QUE 85.7 84.8 85.6
ONT 82.8 79.5 82.0
MAN 90.0 85.0 89.5
SASK 87.0 85.4 86.8
ALB 85.2 84.9 85.1
BC 83.9 86.7 84.7
YUK 79.3 95.6 82.7
NWT 89.6 85.4 89.2
NUN 66.3 34.6 62.5
CAN 85.1 83.1 84.7
20. Modules for proxy and non-proxy Alcohol
Chronic condition
Exposure to second hand smoke
Food insecurity
General health (Q1, Q2 and Q7)
Health care utilization
Health Utility Index (HUI)
Height / Weight (Q2 and Q3)
Injuries
Restriction of activities
Smoking
Tobacco alternatives Two-week disability
Household composition & housing
Income
Labour force
Socio-demographic characteristics
Administration
Drug use (optional)
Home care (optional)
21. Modules for non-proxy only Alcohol dependence / abuse
Blood pressure check
Breastfeeding
Contacts with mental health professionals
Mammography
Fruit & vegetable consumption
General health (Q3-Q6, Q8-Q10)
Height / Weight (Q4 only)
PAP smear test
PSA test
Physical activities
Patient Satisfaction**
Breast examinations
Breast self examinations
Changes made to improve health
Depression
24. CCHS - Weighting and Estimation
Estimation relates sample back to population
MUST use weights in calculation of estimates to correctly draw conclusions about population of interest
Sampling weight is related to the probability of selecting a person in the sample
Persons are selected with unequal probabilities therefore have varying weights
25. CCHS - Weighting and Estimation Three separate weighting systems:
Area frame design
RDD frame design
List frame design
Several adjustments
non-response (household and person)
seasonal factor
etc...
Integration of the two weighting systems based on design effects and sample sizes ( n / deff )
Calibration using a one-dimensional poststratification adjustment of ten age/sex poststrata within each health region
Variance estimation : bootstrap re-sampling approach
set of 500 bootstrap weights for each individual
26. Weighting & Estimation
27. Weighting & Estimation Initial weight: Inverse of the probability of being selected
28. Weighting & Estimation Household nonresponse: Distribute weight of nonresponding households to responding ones
Using “nonresponse classes such as HR, collection period and urban, rural/urban)
29. Weighting & Estimation No phone lines: No coverage of hhlds without a phone line. Weights are “boosted” by a certain rate (specific to each HR)
Rates of “no phone lines” calculated using area frame data
30. Weighting & Estimation # of people in hhld: Convert the hhld-level weight into a person-level weight (multiply by the number of people)
Depends on the # of people selected (1 or 2), and their age
31. Weighting & Estimation Person level nonresponse: Redistribute the weight of selected person who did not respond to the ones who responded
Using classes (age, sex, # person selected, collection period, etc)
32. Weighting & Estimation Multiple phone lines: More phone lines = higher probability of being selected
weight divided by the number of residential phone lines
33. Weighting & Estimation Final weight: Each frame’s final weight is each representative of the total population. To create a single set of weights, they are combined through “Integration”
34. Weighting & Estimation Integration: Combine the 2 sets of weights into one single set of weights
Based on sample size and design effect of each frame
35. Weighting & Estimation Seasonal effect: Adjust weights so that each season contains 25% of the total population
Based on the collection period(sept-nov / dec-feb / mar-may / june - aug)
36. Weighting & Estimation Post-stratification: Ensure the sum of weights matches the estimated population projections in each HR, for 10 age-sex groups
12-19, 20-29, 30-44, 45-64 and 65+ crossed with two sexes
37. Weighting & Estimation Final CCHS weight: Final weight present on the CCHS master file
38. CCHS - Special Weights For various reasons, many other weights are produced
Quarter 4 special weight
PEI special weight
Share weights (master, Q4 and PEI special)
Link weights (master, Q4 and PEI special)
39. Sampling Error Difference in estimates obtained from a sample as compared to a census
The extent of this error depends on four factors:
sample size
variability of the characteristic of interest
sample design
estimation method
Generally, the sampling error decreases as the size of the sample increases
40. Sampling Error Measures of precision associated to an estimate
Variance
Standard deviation (square root of the variance)
95% confidence interval (estimate ± 1.96 x standard deviation)
Coefficient of variation
Standard deviation of estimate x 100% / estimate itself
CV allows comparison of precision of estimates with different scales
Examples:
24% of population are daily smokers, std dev. = 0.003
> CV=0.003/0.24 x 100%=1.25%
> 95% CI: 0.240 ± 1.96 x 0.003 : {0.234 ; 0.246 }
41. Sampling Variability Guidelines Type of estimate CV Guidelines
Acceptable 0.0-16.5 General unrestricted release
Marginal 16.6-33.3 General unrestricted release but with warning cautioning users of the high sampling variablitity. Should be identified by letter E.
Unacceptable > 33.3 No release.
Should be flagged with letter F.
42. Sampling Error Measuring sampling error for complex sample designs:
Simple formulas not available
Most software packages do not incorporate design effect (and weights adjustments) appropriately for calculations
Solution for CCHS: the Bootstrap re-sampling method
43. Bootstrap method Principle:
You want to estimate how precise is your estimation of the number of smokers in Canada
You could draw 500 totally new CCHS samples, and compare the 500 estimations you would get from these samples. The variance of these 500 estimations would indicate the precision.
Problem: drawing 500 new CCHS samples is $$$
Solution: Assuming your sample is representative of the population, sample 500 new subsamples and compute new sampling weights for each subsample.
44. Bootstrap method How CCHS Bootstrap weights are created(the secret is now revealed!!!)
45. Bootstrap Method How Bootstrap replicates are built?
The “real” recipe
1- Subsample clusters (SRS) within a design stratum
2- Apply (initial design) weight
3- Adjust (boost) weight for selection of n-1 among n
4- Apply all standard weight adjustments (nonresponse, integration, share, etc.)
5- Post-stratification to population counts
The bootstrap method intends to mimic the same approach used for the sampling and weighting processes
46. Bootstrap Method Sampling weight versus Bootstrap weights
Sampling weight used to compute the estimation of a parameter (e.g.: number of smokers)
Bootstrap weights used to compute the precision of the estimation (e.g.: the CV of the number of smokers estimation)
47. Bootstrap Method The process of variance estimation is divided into two phases:
Calculation of bootstrap weights
Need to be produced only once
Done by Statistics Canada methodologists
48. Bootstrap Method Variance estimation using bootstrap weights
Done by anyone - internally or externally
Bootstrap weights files distributed with all CCHS files,except Public-Use Microdata File (PUMF)
Bootstrap weights are in a separate file (match using IDs)
Not for PUMF because bootstrap weights reveal confidential info
PUMF users must proceed through remote access to get ‘ exact ’ variances or use the CV look-up tables
49. Bootstrap Method Variance estimation using bootstrap weights
SAS and SPSS (beta) macro programs provided to users (BOOTVAR)
Allow users to perform a few statistical analysis (totals, proportions, differences of proportions and regression analysis)
Fully documented with examples
Bootstrap hands-on workshop
50. How to use the Bootvar program STEP #1
Create your ‘‘analytical file”
51. How to use the Bootvar program Statistical analysis
Using the NPHS cycle 3 (1998) cross-sectional dummy data, estimate the number of ontarians aged 12, by gender, who perceive themselves as being:
- in poor or fair health,
- in good health,
- in very good health,
- in excellent health.
- Compute 95% confidence interval for each point estimate using the Bootvar program.
52. Necessary variables for the analysis Self-perceived health (GHC8DHDI)
0 = poor, 1 = fair, 2 = good, 3 = very good, 4 = excellent, 9 = not stated
Age (DHC8_AGE) Sex (DHC8_SEX)
>= 12 1 = male, 2 = female
Province (PRC8_CUR) Sampling weight (WT68)
35 = Ontario
Record identifier for the household (REALUKEY)
Number identifying the person in the household (PERSONID)
53. Basic theoritical notions for estimating a proportion Example of a data file
ID Weight Sex Asthma Asthma_id
A 50 M YES 1
B 60 M NO 0
C 50 M NO 0
D 70 M YES 1
E 50 M NO 0
(WeightA + WeightD)
(WeightA+WeightB+WeightC+WeightD+WeightE)
= (50 + 70) / (50 + 60 +50 + 70 + 50) * 100 = 120 / 280 * 100 = 43%
54. Little trick for the statistical analysis Create your univariate dummy variable :
Men = 1,0 (men)
Good health = 1,0 (good)
Men in good health : mgood = men * good
men * good = mgood
1 0 0
1 1 1
0 0 0
0 1 0
55. Results of the statistical analysis Self-perceived health of ontarians aged 12 or older by gender
in 1998
# (‘000) 95% CI % 95% CI
Men
- Poor / fair 391 (330 ; 452) 8.4 (7.1 ; 9 .8)
- Good 1,106 (1,007 ; 1,204) 23.9 (21.7 ; 26.0)
- Very good 1,764 (1,648 ; 1,880) 38.1 (35.6 ; 40.6)
- Excellent 1,373 (1,268 ; 1,479) 29.6 (27.4 ; 31.9)
Women
- Poor / fair 480 (409 ; 551) 9.9 (8.5 ; 11.4)
- Good 1,258 (1,151 ; 1,364) 26.1 (23.9 ; 28.3)
- Very good 1,846 (1,726 ; 1,965) 38.2 (35.8 ; 40.7)
- Excellent 1,243 (1,138 ; 1,348) 25.8 (23.6 ; 27.9)
56. Why use the Bootstrap method? Other techniques:
Taylor
Need to define a linear equation for each statistic examined
Jacknife
Number of replicates depends on the number of strata (large number of strata makes it impossible to disseminate)
57. Why use the Bootstrap method? BOOTSTRAP
more user-friendly when there is a large number of strata
sets of 500 bootstrap weights can be distributed to data users
Recommended (over the jackknife) for estimating the variance of nonsmooth functions like quantiles, LICO
Official reference:
“Bootstrap Variance Estimation for the National Population Health Survey”, D. Yeo, H. Mantel, and T.-P. Liu. 1999, Baltimore, ASA Conference.
58. CV Look-up Tables Alternative to bootstrap
Approximate
Can only be used for categorical variables, and for estimations of totals and proportions
Available for every health region, province and Canada
Provided with PUMF and Share file for some subpopulations
59. CV Look-up Tables—Example National Population Health Survey - 1996/1997
Approximate Sampling Variability Tables for Ontario Health Area:OTTAWA CARLETON - Selected members
NUMERATOR OF ESTIMATED PERCENTAGE
PERCENTAGE
('000) 0.1% 1.0% 2.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 50.0% 70.0% 90.0%
1 ******** 48.6 48.4 47.6 46.4 45.0 43.7 42.3 40.9 39.4 37.8 34.5 26.8 15.5
2 ******** 34.4 34.2 33.7 32.8 31.9 30.9 29.9 28.9 27.9 26.8 24.4 18.9 10.9
3 ******** 28.1 27.9 27.5 26.8 26.0 25.2 24.4 23.6 22.7 21.9 19.9 15.5 8.9
4 ******** 24.3 24.2 23.8 23.2 22.5 21.9 21.2 20.4 19.7 18.9 17.3 13.4 7.7
5 ******** 21.7 21.6 21.3 20.7 20.1 19.5 18.9 18.3 17.6 16.9 15.5 12.0 6.9
6 ******** 19.8 19.7 19.4 18.9 18.4 17.8 17.3 16.7 16.1 15.5 14.1 10.9 6.3
7 ******** 18.4 18.3 18.0 17.5 17.0 16.5 16.0 15.5 14.9 14.3 13.1 10.1 5.8
8 **************** 17.1 16.8 16.4 15.9 15.5 15.0 14.5 13.9 13.4 12.2 9.5 5.5
9 **************** 16.1 15.9 15.5 15.0 14.6 14.1 13.6 13.1 12.6 11.5 8.9 5.2
10 **************** 15.3 15.1 14.7 14.2 13.8 13.4 12.9 12.5 12.0 10.9 8.5 4.9
...
...
300 **************************************************************************************** 2.0 1.5 0.9
350 **************************************************************************************** 1.8 1.4 0.8
400 ************************************************************************************************ 1.3 0.8
450 ************************************************************************************************ 1.3 0.7
500 ************************************************************************************************ 1.2 0.7
NOTE: FOR CORRECT USAGE OF THESE TABLES PLEASE REFER TO MICRODATA DOCUMENTATION
60. Another example using the Bootvar program Statistical analysis
Using the NPHS cycle 3 (1998) cross-sectional dummy data, determine whether or not the number of men aged 12 or older who perceive themselves as being in excellent health in Ontario is statistically different (at level ?=5%) than the number of women.
61. Basic theoritical notions for performing a Z-test M_excel = estimated proportion of men in excellent health
F_excel = estimated proportion of women in excellent health
Hypothesis test: H0: M_excel = F_excel
H1: M_excel ? F_excel
At level ? = 0,05, we conclude H0 if | z | <= 1.96
We conclude H1 otherwise.
Z = ( M_excel - F_excel )
sd (M_excel-F_excel)
We use the section “difference of proportions” of the BOOTVAR program to estimate the standard deviation of the difference between the two estimates.
62. Results M_excel = 29.64% ; F_excel = 25.75% ; sd(M_excel-F_excel) = 1.62
Z = ( M_excel - F_excel ) = (29.64 - 25.75) = 3.89 = 2.40
sd (M_excel-F_excel) 1.62 1.62
At ? = 0,05 level , we conclude H1 because z = 2.40 > 1.96 .
We can then conclude that among the ontarians aged 12 or older there is a statistical difference between men and women with regard to the caracteristic “self-perceived health = excellent”.
63. CCHS - Data Dissemination Strategy Wide range of users and capacity
136 health regions
13 provincial/territorial Ministries of Health
Health Canada and CIHI
Internal STC analysts
Academics
Others
Data products
Microdata
Analytical products (Health Reports, How Healthy are Canadians, etc…)
Tabular statistics (ePubs, Cansim II, community profiles, etc…)
Client support (head and regional offices, CCHS website, workshops, etc…)
64. CCHS - Access to microdata Master file
all records, all variables
Statistics Canada
university research data centres
remote access
Share / Link files
respondents who agreed to share / link
provincial/territorial Ministries of Health
health regions (through the STC third-party share agreement)
Public Use Microdata File (PUMF)
all records, subset of variables with collapsed response categories
free for 136 health regions
cost recovery for others
65. CCHS - Overview of Cycle 1.2
Produce provincial cross-sectional estimates from a sample of 30,000 respondents
Area frame sample only / one person per household
CAPI only
90 minute in-depth interviews on mental health and well-being based on WMH2000 questionnaire
Scheduled to begin collection in May 2002
66. CCHS - Future Plans Same two-year cycle approach:
health region level survey starting in January 2003
provincial level survey starting in January 2004
New consultation process with provincial and regional authorities
Flexible sample designs (adaptable to regional needs)
Development of an in-depth nutrition focus content (Cycle 2.2)
67. CCHS Web site
www.statcan.ca/health_surveys
www.statcan.ca/enquetes_santé
68. Contacts in Methodology Yves Béland:yves.beland@statcan.ca
François Brisebois: francois.brisebois@statcan.ca