250 likes | 466 Views
Bootstraps and Jackknives. Hal Whitehead BIOL4062/5062. Confidence in estimators Why use bootstraps or jackknives? The jackknife The parametric bootstrap The non-parametric bootstrap (“The bootstrap”). Estimation without confidence (standard error, confidence interval) has little value.
E N D
Bootstraps and Jackknives Hal Whitehead BIOL4062/5062
Confidence in estimators • Why use bootstraps or jackknives? • The jackknife • The parametric bootstrap • The non-parametric bootstrap • (“The bootstrap”)
Estimation without confidence(standard error, confidence interval)has little value
Confidence in estimates:Traditional approach DATA Biological model Estimator Statistical (Statistic) model Confidence in estimator ?
Confidence in estimates:Traditional approach e.g. What is sex ratio of vole population? Trap: 12 males 15 females Estimate ratio 12/(12+15)=0.444 Using binomial distribution: SE = [0.444x(1-0.444)/(12+15)]=0.096 So: Sex ratio is estimated to be 0.444 (SE 0.096)
e.g. Asymmetry of size among nestlings in nests of 6 Measure: difference between size of nestling and its most similar neighbour {1.2 4.3 4.7 3.2 6.1 1.3} => [0.1 0.4 0.4 1.1 1.4 0.1] = 0.58 But what confidence have we in this?
Confidence in estimator:Mean distance between animals In a small population: what is the expected distance between any two animals? Estimate is: mean of distances between all pairs of animals What is confidence in this estimate? no easy formula (lack of independence)
Use Bootstraps and Jackknives when: • No clear biological model • Deriving statistical model • very difficult, impossible, or tedious • Statistical model too complicated to be useful • Model may not be quite valid • Accurate measure of precision under statistical model only possible with large n
The Jackknife • Data D = {X1, X2, X3, .... ,Xn} => statistic s • Jackknife replicates miss out units (or groups of units) in turn: • J1 = X2, X3, .... ,Xn => statistic s-1 (missing unit 1) • J2 = X1, X3, .... ,Xn => statistic s-2 (missing unit 2) • etc. • Convert into pseudovalues: • φ1 = n⋅s - (n-1)s-1 • φ2 = n⋅s - (n-1)s-2 • etc.
The Jackknife • The Jackknifed Estimate of s is then: • sJ = mean(φ1,...,φn) • SE(s) = SE(φ1,...,φn)
The Jackknife • Jackknifed Estimate removes bias • JackknifeSE “rough and ready” • usually “conservative” (overestimates SE) • Jackknife on blocks of units, if data not independent • Assumes normality for confidence intervals
Correlation between gill weight and body weight in 12 crabs Jackknife r = 0.878 [Mean φi] SE 0.0768 [SD(φi)/12)] Gill(mg) Body(g) r-i φi 1590 14.40 0.888 0.607 1790 15.20 0.884 0.656 10 11.30 0.892 0.570 450 2.50 0.830 1.249 3840 22.70 0.811 1.452 23 14.90 0.863 0.879 10 1.41 0.875 0.751 32 15.81 0.872 0.779 8 4.19 0.845 1.078 22 15.39 0.867 0.843 32 17.25 0.858 0.940 21 9.52 0.877 0.725 r = 0.865
Parametric Bootstrap • Assume Data produced by Model with some Parameters unknown, which need to be estimated: • Model => Data => Parameter estimates (s) • The Bootstrap process: • Model + Parameter estimates (s) => Random data => Bootstrap replicate estimates (s*) • Distribution of Bootstrap replicate estimates (s*s) give distribution, confidence intervals and standard errors of s (plus indicator of bias) • Usually use 100-10,000 bootstrap replicates
Parametric Bootstrap–an exampleMark-Recapture Estimate Mark 25 animals Recapture 46 of which 12 Marked What is population size? “Petersen” estimate is 25x46/12=95.8 What is confidence in this estimate, expected bias?
Parametric Bootstrap–an exampleMark-Recapture Estimate • Mark 25 animals; Recapture 46, 12 Marked • “Petersen” estimate is 25x46/12=95.8 • What is confidence, expected bias? • Parametric Bootstrap Replicates: • 96 Animals, mark 25, recapture 46 • How many marked? • From simulation (ms=): • 9 14 14 9 14 13 12 13 12 14 ... • Calculate population estimates (ns= 25x46/ms) • 127.8 82.1 82.1 127.8 82.1 88.5 95.8 88.5 95.8 82.1..
Parametric Bootstrap–an exampleMark-Recapture Estimate • “Petersen” estimate is 25x46/12=95.8 • Bootstrap population estimates (assuming n=96) • 127.8 82.1 82.1 127.8 82.1 88.5 95.8 88.5 95.8 82.1.. • Expected Bias: • mean(ns) - 96= 99.7 - 96 = 3.7 • Estimated standard error: • SD(ns) = 20.4 • So population estimate is: 92.1 (SE 20.4)
Non-Parametric Bootstrap(A.K.A. “The Bootstrap”) • Data D = X1, X2, X3, .... ,Xn => statistic s • Bootstrap replicate: • D*1 = X*1, X*2, X*3, .... ,X*n=> statistics*1 • D*2 = X*1, X*2, X*3, .... ,X*n=> statistics*2 • ... • X*1, X*2, X*3, .... ,X*n are randomly selected with replacement, from X1, X2, X3, .... ,Xn • Distribution, confidence interval and SE of s estimated from the distribution, confidence interval and standard error of the s*’s • Usually use 100-10,000 bootstrap replicates
Non-Parametric Bootstrap: an example:Median Gill Weight in Crabs Gill weights (in mg): 159 179 100 45 384 230 100 320 80 220 320 210 Median = 195mg Median Real 159 179 100 45 384 230 100 320 80 220 320 210 195 Bootstrap replicates: B1 320 159 45 320 100 320 100 320 100 230 100 210 185 B2 384 384 45 384 45 384 100 80 45 179 230 230 205 B3 159 320 80 45 45 80 220 210 230 320 230 220 215 B4 220 179 384 100 80 100 230 230 179 230 384 45 200 B5 320 220 210 100 159 320 220 210 100 80 100 210 210 B6 80 100 230 100 210 384 159 220 320 45 45 210 185 B7 179 210 80 320 100 230 159 320 100 45 384 320 195 B8 384 159 100 159 100 179 100 179 220 384 220 159 169 B9 320 210 45 320 179 159 100 210 159 45 210 100 169 ...
Non-Parametric Bootstrap: an example:Median Gill Weight in Crabs Gill weights (in mg): 159 179 100 45 384 230 100 320 80 220 320 210 Median = 195mg Bootstrap mean(1000 samples) median = 188mg 95% c.i. = 100-275mg [b(25) -b(975)]
Bootstraps in Molecular Genetics • Calculate tree based on genetic data • (e.g. 20 species and 300 loci) • For each bootstrap replicate: • Resample loci with replacement • (20 species with 300 loci, some repeats) • Calculate tree • Look at agreement between original and bootstrap trees
Bootstrapped spanning tree Glazko & Nei Mol. Biol. Evol. 2003
Bootstraps “Better” estimate of confidence Variable n Self-comparisons a problem e.g. Mean of associations Gives SE’s, confidence intervals and profile of confidence Jackknives “Worse” estimate of confidence Usually conservative underestimates precision Fixed n Self-comparisons not a problem Reduces Bias Only directly gives SE Confidence intervals need assumption of normality
Bootstraps and Jackknives • Give estimates of confidence (and bias) when: • distributions unknown, approximate, or intractable • Parametric bootstrap • very useful if model known • needs programming • Non-parametric bootstrap • widely applicable (except self-referencing situations) • few assumptions • Jackknife • approximate • only standard error given directly • useful when bootstrap not applicable