510 likes | 678 Views
Introduction to Bootstrapping. James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005. What’s it all about?. Actuaries compute points estimates of statistics all the time. Loss ratio/claim frequency for a population Outstanding Losses
E N D
Introduction to Bootstrapping James Guszcza, FCAS, MAAA CAS Predictive Modeling Seminar Chicago September, 2005
What’s it all about? • Actuaries compute points estimates of statistics all the time. • Loss ratio/claim frequency for a population • Outstanding Losses • Correlation between variables • GLM parameter estimates … • A point estimate tells us what the data indicates. • But how can we measure our confidence in this indication?
More Concisely… • Point estimate says: “what do you think?” • Variability of the point estimate says: “how sure are you?” • Traditional approaches • Credibility theory • Use distributional assumptionsto construct confidence intervals • Is there an easier – and more flexible – way?
Enter the Bootstrap • In the late 70’s the statistician Brad Efron made an ingenious suggestion. • Most (sometimes all) of what we know about the “true” probability distribution comes from the data. • So let’s treat the data as a proxy for the true distribution. • We draw multiple samples from this proxy… • This is called “resampling”. • And compute the statistic of interest on each of the resulting pseudo-datasets.
Philosophy • “[Bootstrapping has] requires very little in the way of modeling, assumptions, or analysis, and can be applied in an automatic way to any situation, no matter how complicated”. • “An important theme is the substitution of raw computing power for theoretical analysis” --Efron and Gong 1983 • Bootstrapping fits very nicely into the “data mining” paradigm.
The Basic Idea Theoretical Picture • Any actual sample of data was drawn from the unknown “true” distribution • We use the actual data to make inferences about the true parameters (μ) • Each green oval is the sample that “might have been” The “true” distribution in the sky μ Sample 1 Y11, Y12… Y1k Sample 2 Y21, Y22… Y2k Sample 3 Y31, Y32… Y3k Sample N YN1, YN2… YNk … Y1 Y2 Y3 YN • The distribution of our estimator (Y) depends on both the true distribution and the size (k) of our sample
The Basic Idea The Bootstrapping Process • Treat the actual distribution as a proxy for the true distribution. • Sample with replacement your actual distribution N times. • Compute the statistic of interest on each “re-sample”. The actual sample Y1, Y2… Yk Y Re-sample 1 Y*11, Y*12… Y*1k Re-sample 2 Y*21, Y*22… Y*2k Re-sample 3 Y*31, Y*32… Y*3k Re-sample N Y*N1, Y*N2… Y*Nk … Y*1 Y*2 Y*3 Y*N • {Y*} constitutes an estimate of the distribution of Y.
Sampling With Replacement • In fact, there is a chance of (1-1/500)500≈ 1/e≈ .368 that any one of the original data points won’t appear at all if we sample with replacement 500 times. any data point is included with Prob ≈ .632 • Intuitively, we treat the original sample as the “true population in the sky”. • Each resample simulates the process of taking a sample from the “true” distribution.
Theoretical vs. Empirical • Graph on left: Y-bar calculated from an ∞ number of samples from the “true distribution”. • Graph on right: {Y*-bar} calculated in each of 1000 re-samples from the empirical distribution. • Analogy: μ : Y :: Y : Y*
Summary • The empirical distribution – your data – serves as a proxy to the “true” distribution. • “Resampling” means (repeatedly) sampling with replacement. • Resampling the data is analogous to the process of drawing the data from the “true distribution”. • We can resample multiple times • Compute the statistic of interest T on each re-sample • We get an estimate of the distribution of T.
Motivating Example • Let’s look at a simple case where we all know the answer in advance. • Pull 500 draws from the n(5000,100) dist. • The sample mean ≈ 5000 • Is a point estimate of the “true” mean μ. • But how sure are we of this estimate? • From theory, we know that:
Visualizing the Raw Data • 500 draws from n(5000,100) • Look at summary statistics, histogram, probability density estimate, QQ-plot. • … looks pretty normal
Sampling With Replacement Now let’s use resampling to estimate the s.d. of the sample mean (≈4.47) • Draw a data point at random from the data set. • Then throw it back in • Draw a second data point. • Then throw it back in… • Keep going until we’ve got 500 data points. • You might call this a “pseudo” data set. • This is not merely re-sorting the data. • Some of the original data points will appear more than once; others won’t appear at all.
Resampling • Sample with replacement 500 data points from the original dataset S • Call this S*1 • Now do this 999 more times! • S*1, S*2,…, S*1000 • Compute X-bar on each of these 1000 samples.
R Code norm.data <- rnorm(500, mean=5000, sd=100) boots <- function(data, R){ b.avg <<- c(); b.sd <<- c() for(b in 1:R) { ystar <- sample(data,length(data),replace=T) b.avg <<- c(b.avg,mean(ystar)) b.sd <<- c(b.sd,sd(ystar))} } boots(norm.data, 1000)
Results • From theory we know that X-bar ~ n(5000, 4.47) • Bootstrapping estimates this pretty well! • And we get an estimate of the whole distribution, not just a confidence interval.
Two Ways of Looking at a Confidence Interval • Approximate normality assumption • X-bar ±2*(bootstrap dist s.d.) • Percentile method • Just take the desired percentiles of the bootstrap histogram. • More reliable in cases of asymmetric bootstrap histograms. mean(norm.data) - 2 * sd(b.avg) [1] 4986.926 mean(norm.data) + 2 * sd(b.avg) [1] 5004.661
And a Bonus • Note that we can calculate both the mean and standard deviation of each pseudo-dataset. • This enables us to estimate the correlation between the mean and s.d. • Normal distribution is not skew mean, s.d. are uncorrelated. • Our bootstrapping experiment confirms this.
More Interesting Examples • We’ve seen that bootstrapping replicates a result we know to be true from theory. • Often in the real world we either don’t know the ‘true’ distributional properties of a random variable… • …or are too busy to find out. • This is when bootstrapping really comes in handy.
Severity Data • 2700 size-of-loss data points. • Mean = 3052, Median = 1136 • Let’s estimate the distributions of the sample mean & 75th %ile. • Gamma? Lognormal? Don’t need to know.
What about the 90th %ile? • So far so good – bootstrapping shows that many of our sample statistics – even average severity! – are approximately normally distributed. • But this breaks down if our statistics is not a “smooth” function of the data… • Often in the loss reserving we want to focus our attention way out in the tail… • 90th %ile is an example.
Variance Related to the Mean • As with the normal example, we can calculate both the sample average and s.d. on each pseudo-dataset. • This time (as one would expect) the variance is a function of the mean.
Bootstrapping a Correlation Coefficient #1 • About 700 data points • Credit on a scale of 1-100 • 1 is worst; 100 is best • Age, credit are linearly related • See plot • R2≈.08 ρ≈.28 • Older people tend to have better credit • What is the confidence interval around ρ?
Bootstrapping a Correlation Coefficient #1 • ρ appears normally distributed. • ρ≈ .28 • s.d.(ρ) ≈ .028 • Both confidence interval calculations agree fairly well: > quantile(boot.avg,probs=c(.025,.975)) 2.5% 97.5% 0.2247719 0.3334889 > rho - 2*sd(boot.avg); rho + 2*sd(boot.avg) 0.2250254 0.3354617
Bootstrapping a Correlation Coefficient #2 • Let’s try a different example. • ≈1300 zip-code level data points • Variables: population density, median #vehicles/HH • R2≈.50 ; ρ≈ -.70
Bootstrapping a Correlation Coefficient #2 • ρ more skew. • ρ≈ -.70 • 95% conf interval: (-.75, -.67) • Not symmetric around ρ • Effect becomes more pronounced the higher the value of ρ.
Bootstrapping Loss Ratio • Now for what we’ve all been waiting for… • Total loss ratio of a segment of business is our favorite point estimate. • Its variability depends on many things: • Size of book • Loss distribution • Accuracy of rating plan • Consistency of underwriting… • How could we hope to write down the true probability distribution? • Bootstrapping to the rescue…
Bootstrapping Loss Ratio & Frequency • ≈50,000 insurance policies • Severity dist from previous example • LR = .79 • Claim frequency = .08 • Let’s build confidence intervals around these two point estimates. • We will resample the data 500 times • Compute total LR and freq on each sample • Plot the histogram
Results: Distribution of total LR • A little skew, but somewhat close to normal • LR ≈ .79 • s.d.(LR) ≈ .05 conf interval ≈ ±0.1 • Confidence interval calculations disagree a bit: > quantile(boot.avg,probs=c(.025,.975)) 2.5% 97.5% 0.6974607 0.8829664 > lr - 2*sd(boot.avg); lr + 2*sd(boot.avg) 0.6897653 0.8888983
Dependence on Sample Size • Let’s take a sub-sample of 10,000 policies • How does this affect the variability of LR? • Again re-sample 500 times • Skewness, variance increase considerably • LR: .79 .78 • s.d.(LR): .05 .13
Distribution of Capped LR • Capped LR is analogous to trimmed mean from robust statistics • Remove leverage of a few large data points • Here we cap policy-level losses at $30,000 • Affects 50 out of 2700 claims • Closer to frequency • distribution less skew – close to normal • s.d. cut in half! .05 .025
Results: Distribution of Frequency • Much less variance than LR; very close to normal • freq ≈ .08 • s.d.(freq) ≈ .017 • Confidence interval calculations match very well: > quantile(boot.avg,probs=c(.025,.975)) 2.5% 97.5% 0.07734336 0.08391072 > lr - 2*sd(boot.avg); lr + 2*sd(boot.avg) 0.07719618 0.08388898
When are LRs statistically different? • Example: Divide our 50,000 policies into two sub-segments: {clean drivers, other} • LRtot = .79 • LRclean = .58 LLRclean = -27% • LRother = .84 LRRother = +6% • Clean drivers appear to have ≈ 30% lower LR than non-clean drivers • How sure are we of this indication? • Let’s use bootstrapping.
Bootstrapping the difference in LRs • Simultaneously re-sample the two segments 500 times. • At each iteration, calculate LRc*, LRo*, (LRc*- LRo*), (LRc*/ LRo*) • Analyze the resulting empirical distributions. • What is the average difference in loss ratios? • what percent of the time is the difference in loss ratios greater than x%?
Final Example: loss reserve variability • A major issue in the loss reserving community is reserve variability • Predictive variance of your estimate of outstanding losses. • Bootstrapping is a natural way to tackle this problem. • Hard to find an analytic formula for variability of this o/s losses. • Approach here: bootstrap cases, not residuals.
Bootstrapping Reserves • S = database of 5000 claims • Sample with replacement all policies in S • Call this S*1 • Same size as S • Now do this 499 more times! • S*1, S*2,…, S*500 • Estimate o/s reserves on each sample • Get a distribution of reserve estimates
Simulated Loss Data • Simulate database of 5000 claims • 500 claims/year; 10 years • Each of the 5000 claims was drawn from a lognormal distribution with parameters • μ=8; σ=1.3 • Build in loss development patterns. • Li+j = Li * (link + ε) • ε is a random error term • See CLRS presentation (2005) for more details.
Bootstrapping Reserves • Compute our reserve estimate on each S*k • These 500 reserve estimates constitute an estimate of the distribution of outstanding losses • Notice that we did this by resampling our original dataset S of claims. • Note: this bootstrapping method differs from other analyses which bootstrap the residuals of a model. • These methods rely on the assumption that your model is correct.
Distribution of Outstanding Losses • Blue bars: the bootstrapped distribution • Dotted line: kernel density estimate of the distribution • Pink line: superimposed normal
Distribution of Outstanding Losses • The simulated dist of outstanding losses appears ≈ normal. • Mean: $21.751M • Median: $21.746M • σ : $0.982M • σ/μ ≈ 4.5% • 95% confidence interval • (19.8M, 23.7M) • Note: the 2.5 and 97.5 %iles of the bootstrapping distribution • roughly agree with $21.75 ± 2σ
Distribution of Outstanding Losses • We can examine a QQ plot to verify that the distribution of o/s losses is approximately normal. • However, the tails are somewhat heavier than normal. • Remember – this is just simulated data! • Real-life results have been consistent with these results.
References • Bootstrap Methods and their Applications --Davison and Hinkley • An Introduction to the Bootstrap --Efron and Tibshirani • “A Leisurely Look at the Bootstrap” --Efron and Gong American Statistician 1983 • “Bootstrap Methods for Standard Errors” -- Efron and Tibshirani Statistical Science 1986 • “Applications of Resampling Methods in Actuarial Practice” -- Derrig, Ostaszewski, Rempala PCAS 2000