170 likes | 213 Views
A comparison of K-fold and leave-one-out cross-validation of empirical keys. Alan D. Mead, IIT mead@iit.edu. What is “Keying”?. Many selection tests do not have demonstrably correct answers Biodata, SJT, some simulations, etc. Keying is the constructing of a valid key
E N D
A comparison of K-fold and leave-one-out cross-validation of empirical keys Alan D. Mead, IIT mead@iit.edu
What is “Keying”? • Many selection tests do not have demonstrably correct answers • Biodata, SJT, some simulations, etc. • Keying is the constructing of a valid key • What the “best” people answered is probably “correct” • Most approaches use a correlation, or something similar
Correlation approach • Create 1-0 indicator variables for each response • Correlate indicators with a criterion (e.g., job performance) • If r > .01, key = 1 • If r < -.01, key = -1 • Else, key = 0 • Little loss by using 1,0,-1 key
How valid is my key? • Now that I have a key, I want to compute a validity… • But I based my key on the responses of my “best” test-takers • Can/should I compute a validity in this sample? • No! Cureton (1967) showed that very high validities will result even for invalid keys • What shall I do?
Validation Approaches • Charge ahead! • “Sure, .60 is an over-estimate; there will be shrinkage. But even half would still be substantial” • Split my sample into “calibration” and “cross-validation” samples • Fine if you have a large N… • Resample
LOOCV procedure • Leave one out cross validation (LOOCV) resembles Tukey’s jackknife resampling procedure • Hold out one person 1 • Compute a key on remaining N-1 • Score the held-out person • Repeat with person 2, 3, 4, … • Produces N scores that do not capitalize on chance • Correlate the N scores with the criterion • (But use the total sample key for scoring)
Mead & Drasgow, 2003 • Simulated test responses & criterion • Three approaches • Charge ahead • LOOCV • True cross-validation • Varying sample sizes: • N=50,100,200,500,1000
LOOCV Conclusions • LOOCV was much better than simply “charging ahead” • But consistently slightly worse than actual cross-validation • LOOCV has a large standard error • An elbow appeared at N=200
K-fold keying • LOOCV is like using crossvalidation samples of N=1 • Break sample into K groups • E.g., N=200 and k=10 • Compute key 10 times • Each calibration sample N=190 • Each crossvalidation sample N=10 • Does not capitalize on chance • Potentially much more stable results
Present study • Simulation study • Four levels of sample size • N=50, 100, 200, 500 • Several levels of K • K=2, 5, 10, 25, 50, 100, 200, 500 • K=2 is double cross validation • True validity = 0.40 • 35 item test with four responses
Main Effect of Sample Size Note: Mean (Standard Error)
Summary • N=50 is really too small a sample for empirical keying • Using a k that produces hold out samples of 4-5 seemed best • N=100, k= 20 • N=200, k= 50 • N=500, k= 100 • Traditional double cross validation was almost as good for N>100