530 likes | 856 Views
Don't Be Loopy: Re-Sampling and Simulation the SAS® Way. David L. Cassell Design Pathways Corvallis, OR. Introduction. Bootstrapping Jackknifing Cross-validation Simulations Monte Carlo … and on and on…. First, the BAD WAY. The typical bootstrap code – a huge macro loop Slow Awkward
E N D
Don't Be Loopy: Re-Sampling and Simulation the SAS® Way David L. Cassell Design Pathways Corvallis, OR
Introduction Bootstrapping Jackknifing Cross-validation Simulations Monte Carlo … and on and on… David L. Cassell, Design Pathways
First, the BAD WAY The typical bootstrap code – a huge macro loop Slow Awkward Very complex code Log-filling Output-clogging Did I mention ‘slow’? David L. Cassell, Design Pathways
The typical BAD bootstrap code – a huge macro loop %do i = 1 %to &REPS ; %* steps to generate one data set; %* the proc to do the analysis; %* some way of appending the new results; %end; %* a proc to compute the bootstrap estimates; %mend; David L. Cassell, Design Pathways
Interlude – What is a bootstrap? Types of Re-sampling: Random draws Designed subsets Exchange labels David L. Cassell, Design Pathways
Interlude – What is a bootstrap? Want to approximate sampling distribution Simple: SRS with replacement from original sample Non-parametric (mostly) Want: bias, std error, CI, or … Assumptions: exchangeability, … David L. Cassell, Design Pathways
Interlude – What is a bootstrap? We’ll start with the simple bootstrap Get a URS sample of size N Compute your statistic Repeat B=1000 or 10,000 or … times Look at the behavior of your B values David L. Cassell, Design Pathways
Interlude – What is a bootstrap? Warning: do not forget exchangeability! The simple / naïve bootstrap doesn’t work right on: Time series data Repeated measures data Survey sample data Data with analytic weights . . . . David L. Cassell, Design Pathways
Interlude – What is a bootstrap? A common approach is the bootstrap percentile interval: Take your B values from before Pull the 2.5th and 97.5th percentiles to get a 95% percentile interval as your CI David L. Cassell, Design Pathways
The typical BAD bootstrap code – a huge macro loop %macro bootie ( input=, reps= ); %do i = 1 %to &REPS ; %* steps to generate one data set; %* the proc to do the analysis; %* some way of appending the new results; %end; %* a proc to compute the bootstrap estimates; %mend; David L. Cassell, Design Pathways
A Better Bootstrap 1. Generate ALL of the bootstrap samples as one data set 2. Use the same proc as before, but use by-processing 3. Use the same computations to get the bootstrap estimates David L. Cassell, Design Pathways
A Better Bootstrap proc surveyselect data=YourData out=outboot seed=30459584 method=urs samprate=1 outhits rep=1000; run; David L. Cassell, Design Pathways
A Better Bootstrap proc univariate data=outboot; var x; by Replicate; output out=out1 q1=q1 median=med q3=q3; run; data out2; set out1; trimean = (q1 + 2*med + q3) / 4; run; David L. Cassell, Design Pathways
A Better Bootstrap proc univariate data=out2; var trimean; output out=final pctlpts=2.5, 97.5 pctlpre=ci; run; David L. Cassell, Design Pathways
A Better Bootstrap – More sasfile YourData load; proc surveyselect data=YourData out=outboot seed=30459584 method=urs samprate=1 outhits rep=1000; run; sasfile YourData close; David L. Cassell, Design Pathways
A Better Bootstrap – More ods listing close; proc univariate data=outboot; var x; by Replicate; output out=out1 q1=q1 median=med q3=q3; run; ods listing; David L. Cassell, Design Pathways
A Better Bootstrap – ODS OUTPUT ods output Modes=modal; proc univariate data=outboot modes; var YourVariable; by Replicate; run; ods output close; David L. Cassell, Design Pathways
Case Resampling Simple bootstrap, as before Apply to: PROC REG, PROC LOGISTIC, …. The approach can be criticized on several grounds David L. Cassell, Design Pathways
Case Resampling data test; x=1; y=45; output; do x = 2 to 29; y = 3*x + 6*rannor(1234); output; end; x=30; y=45; output; run; David L. Cassell, Design Pathways
y 90 80 70 60 50 40 30 20 10 0 0 10 20 30 x Case Resampling David L. Cassell, Design Pathways
Case Resampling ods listing close; proc surveyselect data=temp1 out=boot1 seed=38474 method=urs samprate=1 outhits rep=1000; run; proc reg data=boot1 outest=est1(drop=_:); model y=x; by replicate; run; ods listing; David L. Cassell, Design Pathways
Case Resampling proc univariate data=est1; var x; output out=final pctlpts=2.5, 97.5 pctlpre=ci; run; David L. Cassell, Design Pathways
Case Resampling proc robustreg data=temp1 method=MM; model y=x; run; David L. Cassell, Design Pathways
Case Resampling PROC REG (1.74, 2.80) bootstrap (case resampling) (1.65, 2.90) PROC ROBUSTREG (2.39, 3.13) David L. Cassell, Design Pathways
Resampling residuals Fit the model Bootstrap sample for the residuals Add the randomly resampled e to Y-hat Fit the model for each of the B reps Compute bootstrap estimates David L. Cassell, Design Pathways
Resampling residuals 1 perform the regression, get Y-hat and e 2 split the data 3 copy the FIT data set repeatedly 4 URS sample of residuals for each replicate 5 merge residuals with records 6 fit the model on each replicate 7 compute bootstrap estimates David L. Cassell, Design Pathways
Resampling residuals proc reg data=test; model y=x; output out=out1 p=yhat r=res; run; David L. Cassell, Design Pathways
Resampling residuals data fit(keep=yhat x order) resid(keep=res); set out1; order+1; run; David L. Cassell, Design Pathways
Resampling residuals proc surveyselect data=fit out=outfit method=srs samprate=1 rep=1000; run; David L. Cassell, Design Pathways
Resampling residuals data outres2; do replicate = 1 to 1000; do order = 1 to numrecs; p = ceil( numrecs * ranuni(394747373) ); set resid nobs=numrecs point=p; output; end; end; stop; run; David L. Cassell, Design Pathways
Resampling residuals data prepped; merge outfit outres2; by replicate order; new_y = yhat + res; run; David L. Cassell, Design Pathways
Resampling residuals proc reg data=prepped outest=est1( drop=_: ); model new_y = x; by replicate; run; David L. Cassell, Design Pathways
Resampling residuals proc univariate data=est1; var x; output out=final pctlpts=2.5, 97.5 pctlpre=ci; run; David L. Cassell, Design Pathways
?The? Bootstrap? Simple bootstrap Residual resampling Parametric bootstrap Smooth bootstrap Wild bootstrap Double bootstrap Various ‘adjusted’ bootstraps . . . . . David L. Cassell, Design Pathways
The Jackknife Non-parametric N systematic samples of size N-1 Less general than the bootstrap Easier to apply to complex sampling schemes David L. Cassell, Design Pathways
The Jackknife data outb; do replicate = 1 to numrecs; do rec = 1 to numrecs; set test nobs=numrecs point=rec; if replicate ^= rec then output; end; end; stop; run; David L. Cassell, Design Pathways
The Jackknife ods listing close; proc univariate data=outb; var y; by replicate; output out=outall kurtosis=curt; run; ods listing; David L. Cassell, Design Pathways
The Jackknife proc univariate data=outall; var curt; output out=final mean=jmean std=jstd; run; David L. Cassell, Design Pathways
Randomization Tests Resampling plan Re-label the data points randomly Compare against original Random subset of full permutation test David L. Cassell, Design Pathways
Cross-Validation Another type of resampling plan K replicate samples Each sample uses (K-1)/K to model and 1/K for testing David L. Cassell, Design Pathways
Cross-Validation LOOCV – Leave-One-Out Cross-Validation K-fold Cross-Validation Random K-fold Cross-Validation David L. Cassell, Design Pathways
Random K-Fold Cross-Validation %let K=10; %let rate= %sysevalf( (&K-1) / &K ); proc surveyselect data=temp1 out=xv seed=495857 samprate=&RATE outall rep=&K ; run; data xv; set xv; if selected then new_y=y; run; David L. Cassell, Design Pathways
Random K-Fold Cross-Validation proc reg data=xv; model new_y=x; by replicate; output out=out1(where=(new_y=.)) p=yhat; run; David L. Cassell, Design Pathways
Random K-Fold Cross-Validation data out2; set out1; d=y-yhat; absd=abs(d); run; proc summary data=out2; var d absd; output out=out3 std(d)=rmse mean(absd)=mae; run; David L. Cassell, Design Pathways
Monte Carlo Simulations Sample from theoretical distributions Sample from population of data points David L. Cassell, Design Pathways
Simulations proc surveyselect data=largefile out=process_set seed=45884743 method=srs sampsize=1000; run; data processor; array{5,5} a1-a25; set process_set; . . . . . run; David L. Cassell, Design Pathways
Simulations proc plan seed=4958584; factors replicate=100 ordered SiteNo = 30 of 200 / noprint; output out=plan9; run; David L. Cassell, Design Pathways
CONCLUSIONS Cassell’s “7 Habits of Highly Effective SAS-ers” • KNOW YOUR PROBLEM • USE THE RIGHT TOOL • FEWER STEPS GET YOU FARTHER • STAY TALL AND THIN • TOO MUCH OF A GOOD THING IS BAD • SKIP THE EXPENSIVE STUFF • SHARPEN THE SAW David L. Cassell, Design Pathways
CONCLUSIONS SAS is great at resampling and simulations. You just have to code it in SAS instead of something else! Don’t run 5003 steps when 3 steps will do it. Don’t assume everything is a macro problem. David L. Cassell, Design Pathways
CONCLUSIONS Resampling methods and simulations do not solve all your problems. Use your brain before you use your keyboard. David L. Cassell, Design Pathways