1 / 38

Case Selection and Resampling

Learn about case selection influences, regression diagnostics, and sampling procedures such as bootstrap and jackknife. Detect and handle outliers, leverage, and influence in data analysis. Explore methodologies for detecting influential cases, ensuring model predictions, and evaluating performance through case removal and prediction assessment.

slancaster
Download Presentation

Case Selection and Resampling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Case Selection and Resampling Lucila Ohno-Machado HST951

  2. Topics • Case selection (influence detection) • Regression diagnostics • Sampling procedures • Bootstrap • Jackknife • Cross-validation

  3. Unusual Data • Outlier (discrepancy, unusual observation that may change parameters) • Leverage (far from mean or centroid of other observations, unusual combinations of independent variable values X that may change parameters) • Influence = discrepancy x leverage

  4. Detecting Outliers: Residuals • Measure of error • Studentized residuals can be calculated by removing one observation at a time • Obs: High-leverage observations may have small residuals

  5. Assessing Leverage • Hat values measure the distance of an observation to the means (or centroid) of all observations • Dependent variables are not involved in determining leverage

  6. Measuring Influence • Impact on coefficient of deleting an observation • DFBETA • COX’s D • DFFITS • Impact on standard error • COVRATIO

  7. Case selection • Not all cases are created equal • Some influential cases are good • Some are bad • “Outliers” • Some non-influential cases are redundant • It would be nice to keep “minimal” set of good cases in training sets for fast on-line training

  8. Classical Diagnostics • Unicase selection is determined by removing one observation and inspecting results • Unicase influence on • Estimated parameters (coefficients) • Fitted value (Y-hat) • Residuals (error)

  9. When outcomes are binary • Residuals may not reflect discriminatory performance, but rather calibration • Remember that a model with good discriminatory performance may be recalibrated • Same rationale for coefficients

  10. Influence • Definition of influence is not fixed • If the main reason for building models is prediction • Then evaluating model performance given different subsets of original sample might point to good, redundant, and bad cases

  11. Qualifying a case • Bad cases, when removed, should result in models with better predictions • Redundant cases, when removed, should not affect predictions • Good cases, if removed, would result in models with worse predictions

  12. Defining prediction performance • Use, for example, areas under ROC curves (or mean square error or cross entropy error) • For each set of samples: • Evaluate performance on training and holdout sets • Determine which cases to remove • Determine performance on test or validation sets

  13. Sequential Multicase Selection • Sequential procedure • remove most influential case • remove second-most influential case (conditioned on the first) • and so on… Si(C(n,m)), for all i=1 to m, where C(.) represents the number of subsets of size m that can be built from n cases. • Problem: cases are not considered en bloc

  14. Alternatives • Multicase selection that is not sequential, yet not exhaustive (e.g., genetic algorithm search) • Analogous to variable selection

  15. Genetic Algorithm • Given a training set C, and a selection of cases v, we construct a logistic regression model lC(v). We evaluate the model using the AUC, and represent this evaluation as a(lC(v)). For a total number of cases n, and m cases in selection v, we use the following fitness function: • f(v,C) = a(lC(v)) + r (n - m)/n.

  16. Resampling

  17. Bootstrap Motivation • Sometimes it is not possible to collect many samples from a population • Sometimes it is not correct to assume a certain distribution for the population • Goal: Assess sampling variation

  18. Bootstrap • Efron (Stanford biostats) late 80’s • “Pulling oneself up by one’s bootstraps” • Nonparametric approach to statistical inference • Uses computation instead of traditional distributional assumptions and asymptotic results • Can be used to derive standard errors, confidence intervals, and test hypothesis

  19. Example • Adapted from Fox (1997) “Applied Regression Analysis” • Goal: Estimate mean difference between Male and Female finding X • Four pairs of observations are available:

  20. Mean Difference • Sample mean is (6-3+5+3)/4 = 2.75 • If Y were normally distributed, 95% CI • But we do not know s

  21. Estimates • Estimate of s is • Estimate of standard error is • Assuming population is normally distributed, we can use t-distribution as

  22. Confidence Interval m = 2.75 ± 4.30 (2.015) = 2.75 ± 8.66 -5.91 < m < 11.41 HUGE!!!

  23. Sample mean and variance • Use distribution Y* of sample to estimate distribution Y in population y* p*(y*) 6 .25 -3 .25 E*(Y*) = S y* p(y*) = 2.75 5 .25 V*(Y*) = S [y*-E*]2p(y*) 3 .25 = 12.187

  24. Sample with Replacement

  25. Calculating the CI • Mean of 256 bootstrap means is 2.75, but SE is (no hat since SE is not estimated, but known)

  26. So what? • We already knew that! • But with bootstrap • Confidence intervals can be more accurate • Can be used for non-linear statistics without known standard error formulas

  27. The population is to the sample as the sample is to the bootstrap samples In practice (as opposed to previous example), not all bootstrap samples are selected

  28. Procedure • 1. Specify data-collection scheme that results in observed sample Collect(population) -> sample • 2. Use sample as if it were population (with replacement) Collect(sample) -> bootstrap sample1 bootstrap sample 2 etc…

  29. Cont. • 3. For each bootstrap sample, calculate the estimate you are looking for • 4. Use the distribution of the bootstrap estimates to estimate the properties of the sample

  30. Bootstrap Confidence Intervals • Normal Theory • Percentile Intervals Example • 95% percentile is calculated by taking • Lower = 0.025 x bootstrap replicates • Upper = 0.975 x bootstrap replicates • There are corrections for bootstrap intervals

  31. Bootstrapping Linear Regression Observed estimate is usually the coefficient(s) - (at least) 2 ways of doing this • Resample observations (usual) and re-regress (X will vary) • Resample residuals (X are fixed, Y*=Y+E* is new dependent variable, re-regress X fixed) • Assumes errors are identically distributed • High-leverage outlier impact may be lost

  32. Bootstrap for other methods • Used in other classification methods (neural networks, classification trees, etc.) • Usually useful when sample size is small and no distribution assumptions can be made • Same principles apply

  33. Other resampling methods • Jackknife (take one out) is a special case of bootstrap • Resamples without one case and without replacement (samples have size n-1) • Cross-validation • Divides data into training and test • Generally used to estimate confidence intervals on predictions for “full” model (i.e., model that utilized all cases)

More Related