370 likes | 500 Views
No CLT – No Problem? Enter the Bootstrap!. John McGready Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~jmcgread. Slide #2. Goals of Inferential Statistics.
E N D
No CLT – No Problem?Enter the Bootstrap! John McGready Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~jmcgread
Goals of Inferential Statistics • Much of what we do in statistics involves trying to talk about true characteristics of a process, using an imperfect subset of information from the process Population Information (what we WANT) Sample Information (what we have)
Medical Expenditures • Suppose we want to study the FY 2005 medical expenditures for 13,000 + employees in a particular company • However, the benefits administrator will only give us one random sample of 200 employees
Medical Expenditures (True) mean = 2.3 (True) sd =5.0 Median = 0.59, Mean = 2.3, sd = 5.0 (Sample) mean = 1.9 (Sample) sd =4.0 Median = 0.57, Mean = 2.0, sd = 4.3
Medical Expenditures • Given the right skew, our first choice for estimating the center of the distribution is to work with the median • We can only estimate the true median using the sample median from our 200 observations
Medical Expenditures • We are interested in how “good a guess” the sample median is of the true median • We would also like to estimate a range of possibilities for the true median (ie: a confidence interval)
Medical Expenditures • In order to understand how a sample median from 200 observations relates to the true mean, let’s call our administrator and see if we can get 1,000 more random samples of size 200 • This way, we can compute 1,000 more sample medians and see how variable they are
The Response No Way!
What to Do Now?? • Well, it seems we are out of luck • Let’s just estimate the mean instead, and use the Central Limit Theorem to estimate a range of possible values for the true mean
Review: Sampling Behavior via the CLT Standard error (spread) =
Sampling Behavior via the CLT • Most (95%) of the sample means we could get from samples of 200 would fall between the 2.5th and 97.5% of this distribution • These percentiles correspond to true mean +/- 1.96 standard errors
Sampling Behavior via the CLT • Rub #1 • If we knew the true mean, we wouldn’t care about possible mean values • However, taking this one step further implies that 95% of the samples we could get will fall within a know range of the truth
Sampling Behavior via the CLT • Rub #2 • If we only have one sample, we don’t know true sampling distribution • However, CLT says it will be normal • We spread from our sample data, and center it at our sample mean
Sampling Behavior via the CLT • Our Sample info • Sample mean : 2.0 (thousand $) • Sample standard deviation: 4.3 (thousand $) • Sample estimate of standard error (spread of sampling distribution (thousand $)
Sampling Behavior via the CLT • True 95% CI • Sample mean +/- 1.96*(true standard error) • (1.3,2.7) • Estimated 95% CI • Sample mean +/- 1.97*(estimated standard error) • (1.4, 2.6)
Another Approach to Estimating Sampling Distribution • Instead of relying on CLT, how about we simulate sampling distribution using just our sample of 200? • Treat our sample as “truth” • Resample multiple times (say 1000) taking random draws of 200 with replacement
Resampling With Replacement Original sample (n=4): Potential resample of same size: S1 S1 S2 S2 S3 S3 S4
Bootstrap Estimate of Sampling Distribution • Take 1,000 resamples • Compute the mean of each re-sample • Plot a distribution of the means
Bootstrap 95% CIs • How to get a 95% CI from the bootstrap dist • Assume normality (normal bootstrap method) • But estimate standard error from bootstrap distribution • Pick off 2.5th, 97.5th percentiles (bootstrap percentile method) • Pick off “adjusted” percentile (bias-corrected acclerated –BCa - method)
95% CIs • True Mean 2.3 Method 95% CI CLT Estimate 1.40 - 2.60 Bootstrap Normal 1.39 - 2.60 Bootstrap Percentile 1.41 - 2.58 BCa 1.47 - 2.68
Bootstrap 95% CIs : Mean • Empirical Coverage Probabilities1 Method 1K resamps 10K resamps CLT Estimate 2 93.4% Bootstrap Normal 2 93.2% 92.5% Bootstrap Percentile 92.4% 91.6% BCa 92.3% 93.4% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values
What’s The Big Deal? • Why not just use CLT? • For many statistics, we do not have a CLT (or good CLT) based approach • Median • Ratio of mean to sd • Correlation coefficients
95% CIs For Median • True Median 0.59 Method 95% CI (1,00 Reps) CLT Estimate NA Bootstrap Normal 0.44 - 0.71 Bootstrap Percentile 0.39 - 0.68 BCa 0.39 - 0.68
Bootstrap 95% CIs : Median • Empirical Coverage Probabilities1 Method 1K resamps 10K resamps Bootstrap Normal2 94.1% 94.4% Bootstrap Percentile 93.9% 95.0% BCa 94.0% 95.2% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values
Wrap Up • Pros/Cons of boostrap • Theoretical Justicifaction