100 likes | 179 Views
Limits to Statistical Theory Bootstrap analysis. ESM 206 11 April 2006. Assumption of t -test. Sample mean is a t -distributed random variable Guaranteed if observations are normally distributed random variables or sample size is very large
E N D
Limits to Statistical TheoryBootstrap analysis ESM 206 11 April 2006
Assumption of t-test • Sample mean is a t-distributed random variable • Guaranteed if observations are normally distributed random variables or sample size is very large • In practice, OK if observations are not too skewed and sample size is reasonably large • This assumption also applies when using standard formula for 95% CI of mean
IN AN IDEAL WORLD Take sample Calculate sample mean Take new sample Calculate new mean Repeat many times Look at the distribution of sample means 95% CI ranges from 2.5 percentile to 97.5 percentile IN THE REAL WORLD Find some way to simulate taking a sample Calculate the sample mean Repeat many times Look at the distribution of sample means 95% CI ranges from 2.5 percentile to 97.5 percentile Resampling for a confidence interval of the mean
PARAMETRIC BOOTSTRAP Assume data are random variables from a particular distribution E.g., log-normal Use data to estimate parameters of the distribution E.g., mean, variance Use random number generator to create sample Same size as original Calculate sample mean Allows us to ask: What if data were a random sample from specified distribution with specified parameters? NONPARAMETRIC BOOTSTRAP Assume underlying distribution from which data come is unknown Best estimate of this distribution is the data themselves – the empirical distribution function Create a new dataset by sampling with replacement from the data Same size as original Calculate sample mean WHICH IS BETTER? If underlying distribution is correctly chosen, parametric has more precision If underlying distribution incorrectly chosen, parametric has more bias Bootstrap resampling
Parametric bootstrap If Y is log-normal, it is specified in terms of mean and standard deviation of X = log(Y) Mean = -0.547 SD = 1.360 Use “Monte Carlo Simulation” to generate 999 replicate simulated datasets from log-normal distribution Calculate mean of each replicate and sort means 25th value is lower end of 95% CI 975th value is upper end of 95% CI TcCB in the cleanup site 95% CI: [-0.678, 8.458]
95% CI: [0.917, 2.293] Parametric bootstrap: results
Sort data Index the values (i = 1,2,…,n) Calculate q = i /(n+1) This is the quantile Plot quantiles against data values This is the empirical cumulative distribution function (CDF) Construct CDF of standard normal using same quantiles Compare the distributions at the same quantiles Normal QQ Plot
95% CI: [0.851, 9.248] Nonparametric bootstrap: results
One sample t-test Calculate bootstrap CI of mean Does it overlap test value? Paired t-test Calculate differences: Di = xi - yi Find bootstrap CI of mean difference Does it overlap zero? Two-sample t-test Want to create simulated data where H0 is true (same mean) but allow variance and shape of distribution to differ between populations Easiest with nonparametric: Subtract mean from each sample. Now both samples have mean zero Resample these residuals, creating simulated group A from residuals of group A and simulated group B from residuals of group B Generate distribution of t values P is fraction of simulated t’s that exceed t calculated from data Bootstrap and hypothesis tests
t = 1.45 Bootstrapped ‘t’ values do not follow a t distribution! P = 0.02 TcCB: H0: cleanup mean = reference mean