380 likes | 400 Views
Learn about case selection influences, regression diagnostics, and sampling procedures such as bootstrap and jackknife. Detect and handle outliers, leverage, and influence in data analysis. Explore methodologies for detecting influential cases, ensuring model predictions, and evaluating performance through case removal and prediction assessment.
E N D
Case Selection and Resampling Lucila Ohno-Machado HST951
Topics • Case selection (influence detection) • Regression diagnostics • Sampling procedures • Bootstrap • Jackknife • Cross-validation
Unusual Data • Outlier (discrepancy, unusual observation that may change parameters) • Leverage (far from mean or centroid of other observations, unusual combinations of independent variable values X that may change parameters) • Influence = discrepancy x leverage
Detecting Outliers: Residuals • Measure of error • Studentized residuals can be calculated by removing one observation at a time • Obs: High-leverage observations may have small residuals
Assessing Leverage • Hat values measure the distance of an observation to the means (or centroid) of all observations • Dependent variables are not involved in determining leverage
Measuring Influence • Impact on coefficient of deleting an observation • DFBETA • COX’s D • DFFITS • Impact on standard error • COVRATIO
Case selection • Not all cases are created equal • Some influential cases are good • Some are bad • “Outliers” • Some non-influential cases are redundant • It would be nice to keep “minimal” set of good cases in training sets for fast on-line training
Classical Diagnostics • Unicase selection is determined by removing one observation and inspecting results • Unicase influence on • Estimated parameters (coefficients) • Fitted value (Y-hat) • Residuals (error)
When outcomes are binary • Residuals may not reflect discriminatory performance, but rather calibration • Remember that a model with good discriminatory performance may be recalibrated • Same rationale for coefficients
Influence • Definition of influence is not fixed • If the main reason for building models is prediction • Then evaluating model performance given different subsets of original sample might point to good, redundant, and bad cases
Qualifying a case • Bad cases, when removed, should result in models with better predictions • Redundant cases, when removed, should not affect predictions • Good cases, if removed, would result in models with worse predictions
Defining prediction performance • Use, for example, areas under ROC curves (or mean square error or cross entropy error) • For each set of samples: • Evaluate performance on training and holdout sets • Determine which cases to remove • Determine performance on test or validation sets
Sequential Multicase Selection • Sequential procedure • remove most influential case • remove second-most influential case (conditioned on the first) • and so on… Si(C(n,m)), for all i=1 to m, where C(.) represents the number of subsets of size m that can be built from n cases. • Problem: cases are not considered en bloc
Alternatives • Multicase selection that is not sequential, yet not exhaustive (e.g., genetic algorithm search) • Analogous to variable selection
Genetic Algorithm • Given a training set C, and a selection of cases v, we construct a logistic regression model lC(v). We evaluate the model using the AUC, and represent this evaluation as a(lC(v)). For a total number of cases n, and m cases in selection v, we use the following fitness function: • f(v,C) = a(lC(v)) + r (n - m)/n.
Bootstrap Motivation • Sometimes it is not possible to collect many samples from a population • Sometimes it is not correct to assume a certain distribution for the population • Goal: Assess sampling variation
Bootstrap • Efron (Stanford biostats) late 80’s • “Pulling oneself up by one’s bootstraps” • Nonparametric approach to statistical inference • Uses computation instead of traditional distributional assumptions and asymptotic results • Can be used to derive standard errors, confidence intervals, and test hypothesis
Example • Adapted from Fox (1997) “Applied Regression Analysis” • Goal: Estimate mean difference between Male and Female finding X • Four pairs of observations are available:
Mean Difference • Sample mean is (6-3+5+3)/4 = 2.75 • If Y were normally distributed, 95% CI • But we do not know s
Estimates • Estimate of s is • Estimate of standard error is • Assuming population is normally distributed, we can use t-distribution as
Confidence Interval m = 2.75 ± 4.30 (2.015) = 2.75 ± 8.66 -5.91 < m < 11.41 HUGE!!!
Sample mean and variance • Use distribution Y* of sample to estimate distribution Y in population y* p*(y*) 6 .25 -3 .25 E*(Y*) = S y* p(y*) = 2.75 5 .25 V*(Y*) = S [y*-E*]2p(y*) 3 .25 = 12.187
Calculating the CI • Mean of 256 bootstrap means is 2.75, but SE is (no hat since SE is not estimated, but known)
So what? • We already knew that! • But with bootstrap • Confidence intervals can be more accurate • Can be used for non-linear statistics without known standard error formulas
The population is to the sample as the sample is to the bootstrap samples In practice (as opposed to previous example), not all bootstrap samples are selected
Procedure • 1. Specify data-collection scheme that results in observed sample Collect(population) -> sample • 2. Use sample as if it were population (with replacement) Collect(sample) -> bootstrap sample1 bootstrap sample 2 etc…
Cont. • 3. For each bootstrap sample, calculate the estimate you are looking for • 4. Use the distribution of the bootstrap estimates to estimate the properties of the sample
Bootstrap Confidence Intervals • Normal Theory • Percentile Intervals Example • 95% percentile is calculated by taking • Lower = 0.025 x bootstrap replicates • Upper = 0.975 x bootstrap replicates • There are corrections for bootstrap intervals
Bootstrapping Linear Regression Observed estimate is usually the coefficient(s) - (at least) 2 ways of doing this • Resample observations (usual) and re-regress (X will vary) • Resample residuals (X are fixed, Y*=Y+E* is new dependent variable, re-regress X fixed) • Assumes errors are identically distributed • High-leverage outlier impact may be lost
Bootstrap for other methods • Used in other classification methods (neural networks, classification trees, etc.) • Usually useful when sample size is small and no distribution assumptions can be made • Same principles apply
Other resampling methods • Jackknife (take one out) is a special case of bootstrap • Resamples without one case and without replacement (samples have size n-1) • Cross-validation • Divides data into training and test • Generally used to estimate confidence intervals on predictions for “full” model (i.e., model that utilized all cases)