No CLT – No Problem? Enter the Bootstrap!

No CLT – No Problem?Enter the Bootstrap! John McGready Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~jmcgread

Slide #2

Goals of Inferential Statistics • Much of what we do in statistics involves trying to talk about true characteristics of a process, using an imperfect subset of information from the process Population Information (what we WANT) Sample Information (what we have)

Medical Expenditures • Suppose we want to study the FY 2005 medical expenditures for 13,000 + employees in a particular company • However, the benefits administrator will only give us one random sample of 200 employees

Medical Expenditures (True) mean = 2.3 (True) sd =5.0 Median = 0.59, Mean = 2.3, sd = 5.0 (Sample) mean = 1.9 (Sample) sd =4.0 Median = 0.57, Mean = 2.0, sd = 4.3

Medical Expenditures • Given the right skew, our first choice for estimating the center of the distribution is to work with the median • We can only estimate the true median using the sample median from our 200 observations

Medical Expenditures • We are interested in how “good a guess” the sample median is of the true median • We would also like to estimate a range of possibilities for the true median (ie: a confidence interval)

Medical Expenditures • In order to understand how a sample median from 200 observations relates to the true mean, let’s call our administrator and see if we can get 1,000 more random samples of size 200 • This way, we can compute 1,000 more sample medians and see how variable they are

Making the Call

The Response No Way!

What to Do Now?? • Well, it seems we are out of luck • Let’s just estimate the mean instead, and use the Central Limit Theorem to estimate a range of possible values for the true mean

Review: Sampling Behavior via the CLT Standard error (spread) =

Sampling Behavior via the CLT • Most (95%) of the sample means we could get from samples of 200 would fall between the 2.5th and 97.5% of this distribution • These percentiles correspond to true mean +/- 1.96 standard errors

Sampling Behavior via the CLT

Sampling Behavior via the CLT • Rub #1 • If we knew the true mean, we wouldn’t care about possible mean values • However, taking this one step further implies that 95% of the samples we could get will fall within a know range of the truth

Sampling Behavior via the CLT • Rub #2 • If we only have one sample, we don’t know true sampling distribution • However, CLT says it will be normal • We spread from our sample data, and center it at our sample mean

Sampling Behavior via the CLT • Our Sample info • Sample mean : 2.0 (thousand $) • Sample standard deviation: 4.3 (thousand $) • Sample estimate of standard error (spread of sampling distribution (thousand $)

Sampling Behavior via the CLT • True 95% CI • Sample mean +/- 1.96*(true standard error) • (1.3,2.7) • Estimated 95% CI • Sample mean +/- 1.97*(estimated standard error) • (1.4, 2.6)

Another Approach to Estimating Sampling Distribution • Instead of relying on CLT, how about we simulate sampling distribution using just our sample of 200? • Treat our sample as “truth” • Resample multiple times (say 1000) taking random draws of 200 with replacement

Resampling With Replacement Original sample (n=4): Potential resample of same size: S1 S1 S2 S2 S3 S3 S4

Re-Sampling

Bootstrap Estimate of Sampling Distribution • Take 1,000 resamples • Compute the mean of each re-sample • Plot a distribution of the means

Bootstrap Estimate of Sampling Distribution

Bootstrap 95% CIs • How to get a 95% CI from the bootstrap dist • Assume normality (normal bootstrap method) • But estimate standard error from bootstrap distribution • Pick off 2.5th, 97.5th percentiles (bootstrap percentile method) • Pick off “adjusted” percentile (bias-corrected acclerated –BCa - method)

95% CIs • True Mean 2.3 Method 95% CI CLT Estimate 1.40 - 2.60 Bootstrap Normal 1.39 - 2.60 Bootstrap Percentile 1.41 - 2.58 BCa 1.47 - 2.68

We Could Do with 10,000 Resamples

Bootstrap 95% CIs : Mean • Empirical Coverage Probabilities1 Method 1K resamps 10K resamps CLT Estimate 2 93.4% Bootstrap Normal 2 93.2% 92.5% Bootstrap Percentile 92.4% 91.6% BCa 92.3% 93.4% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values

What’s The Big Deal? • Why not just use CLT? • For many statistics, we do not have a CLT (or good CLT) based approach • Median • Ratio of mean to sd • Correlation coefficients

Getting a 95% CI for A Median

95% CIs For Median • True Median 0.59 Method 95% CI (1,00 Reps) CLT Estimate NA Bootstrap Normal 0.44 - 0.71 Bootstrap Percentile 0.39 - 0.68 BCa 0.39 - 0.68

Bootstrap 95% CIs : Median • Empirical Coverage Probabilities1 Method 1K resamps 10K resamps Bootstrap Normal2 94.1% 94.4% Bootstrap Percentile 93.9% 95.0% BCa 94.0% 95.2% 1 To be thorough, should also look at average width 2 Some intervals could contain illegal (negative) values

Wrap Up • Pros/Cons of boostrap • Theoretical Justicifaction

No CLT – No Problem? Enter the Bootstrap!

No CLT – No Problem? Enter the Bootstrap!

Presentation Transcript

A Short Course in Applied Mathematics 2 February 2004 – 7 February 2004 N ∞M∞T Series Two Course Canisius College, Buffa

Nonparametric Methods II

Bootstrap and Cross-Validation

Debug

Chapter 2: Problem Solving

THE PROBLEM: THE HEART OF THE RESEARCH PROCESS

ENTER

Why is problem solving important?

Hw #3: Prolog

Welcome

Welcome

Welcome

Welcome

An ExPosition of Bootstrap and Permutation tests for Principal Components Analyses

PRESENTS

GENISYS TECHNICIAN TRAINING Steve Zack - SPX

A rmy B enefits C enter- C ivilian

A rmy B enefits C enter- C ivilian

This

Demonstration Problem