220 likes | 397 Views
Predictive Tests. Overview. Introduction Some theoretical issues The failings of human intuitions in prediction Issues in formal prediction Inference from class membership: The individual versus group problem (and its only solution) Some predictive tests. Predictive Tests.
E N D
Overview • Introduction • Some theoretical issues • The failings of human intuitions in prediction • Issues in formal prediction • Inference from class membership: The individual versus group problem (and its only solution) • Some predictive tests
Predictive Tests • Many tests are used to make predictions, of levels of achievement or success, or of likelihood of recidivism, or diagnostic category • Two kinds of predictions: • Categorical: Predict which category this subject will fall into (diagnosis, occupation) • Numerical: predict the value of a relevant numerical value (GPA, economic return to company)
The failings of human intuition • We have already seen many ways in which humans succumb to errors in numerical reasoning • Kahneman & Tversky: Asked subjects about areas of graduate specialization: base rate estimation, estimates (from a description) of similarity to other students in each field, and predictive estimate (also from a description)
Results • Results: • Similarity and prediction correlate at 0.97 • Similarity and base rates correlate at -0.65 • What does this result remind you of? • What do these subjects need to be taught?
6 Errors discussed by Kahneman & Tversky • Representativeness error: Assumes predictions are not different from assessments of similarity • Insufficient regression error: People fail to take into account that when predictive validity is less than perfect, correlations between predictors and performance should be < 1 • Central tendency error: Subjects making judgments tend to avoid extremes, and compress their judgments into a smaller range than the phenomenon being judged
6 Errors discussed by Kahneman & Tversky • Discounting of prior probabilities: Human predictors will throw out base rate info for almost any reason • Overweighting of coherence: There is greater confidence in predictions based on consistent input than inconsistent input with the same average (i.e. two B's is better than a B & C for predicting a B average) • Overweighting of extremes: Confidence in judgment is over-weighted at extremes, especially positive extremes (= j-shaped confidence function)
What do we need to make good predictions? • We need three pieces of information: • 1.) Base rates • 2.) Relevant predictors in the individual case • 3.) Bounds on accuracy (cutting scores) • Kahneman & Tversky's experimental evidence (previous slides) show that subjects usually fail to weight any of these three properly
Measuring validation error [Repeat slide] • Coefficient of alienation = k = (1 - r2)0.5, where r is correlation of test score with some predicted performance • k = the proportion of the error inherent in guessing that your estimate has • If k = 1.0, you have 100% of the error you’d have had if you just guessed (since this means your r was 0) • If k = 0, you have achieved perfection = your r was 1, and there was no error at all* • If k = 0.6, you have 60% of the error you’d have had if you guessed * N.B. This never happens.
Why should we care? [Repeat slide] • We care because k is useful in interpreting accuracy of an individual’s scores • r = 0.6 (good), k = 0.80 (not good) • r = 0.7 (great), k = 0.71 (not so great) • r = 0.95 (fantastic!), k = 0.31 (so so) • Since even high values of r give us fairly large error margins, the prediction of any individual’s criterion score is always accompanied by a wide margin of error • The moral: Predicting individual performance is really hard to do!
What can we infer from class membership? • Some commentators have suggested that inference from class membership is inherently fallacious • i.e. 25% of first-degree relatives of those diagnosed with malignant melanoma (skin cancer) will also develop melanoma • I am a first-degree relative of a person diagnosed with melanoma, so I take my odds of developing the disease to be 25% • Critics of the inference say: No, it is either 0% (I don't develop the disease) or 100% (I do): i.e. group probabilities don't apply to individuals
Do group probabilities apply to individuals? • Meehl's response: "If nothing is rationally inferable from membership in a class, no empirical prediction is ever possible" • The argument is a re-statement of the necessity of inference: even in the case of predicting individual behavior from that individual's data, we need to consider the pattern over past data • Moreover, claim of 'certainty' is philosophical, not real: in the absence of knowing which group you are in, there is only probability, not knowledge
Some Predictive Tests • Scholastic Aptitude Tests (SAT, GREs) • Highly reliable tests (0.9) developed to painstaking psychometric standards • One strength is that they have specific regression equation by college: i.e. they can predict future performance at a particular college independently
Some Predictive Tests • Scholastic Aptitude Tests (SAT, GREs) • How well do they do? • SAT: r = 0.4 with university GPA • By comparison, high school grade r = 0.48 • Together, r = 0.55 • GRE: correlations between various combinations of GRE scores and grad school performance are only between 0.25 and 0.35, and only marginally better (0.4) if you include undergraduate grades
Can you beat the standards? • Notwithstanding the huge industry waiting to take money from anxious high school students, studying for the SAT doesn't help much • SAT coaching increases scores by about 15 points, which is 0.15 SDs • Repeat testing increases it a little less, about 12 points or 0.12 SDs
Some Predictive Tests • Professional school tests (MCAT, GRE subject tests, LSAT) • MCAT r = low .80s • LSAT r > 0.9 • There is relatively little evidence of validity • They predict performance about as well as undergraduate GPA alone: r = 0.25 - 0.3
Some Predictive Tests • The Strong (1927) Interest Inventory (Strong-Campbell, 1981): widely used test of interests as predictors of professional aptitude • Empirically constructed with concurrent validity, comparing each vocational group to the overall average • Has 325 items, 162 scales covering 85 occupations • Reliability is high • 0.9+test/retest over weeks; 0.6-0.7 over years unless they were old (= 25+years!) at first test, then 0.8+ even after 20 years • Does not predict success or satisfaction in a profession • Does predict likelihood of entering and remaining in a profession: chances of 50% that a person will end up in a profession most strongly predicted (A score), and only 12% that he will end in one least predicted (C score)
Prediction in scientific psychology • Prediction & scientific explanation are related • We admire Newton's laws precisely because they are accurate in predicting real phenomena • Many cognitive models in psychology are purely descriptive: they fail to make an effort to predict how a person will perform on unseen stimuli • There are many ways to do so, if you have sufficient variation in predictors: multiple regression, neural networks, 'cheap' methods (i.e. best single predictor)
Predicting lexical decision RTs • Lexical decision (= time to decide if a string is a word or not) is a simple task to perform • Many well-specified variables can be calculated for words: frequency, similarity to other words, frequency of components • This allows for predictive testing: How well can we predict how long it will take (average reaction time = RT) to reach a decision about wordness? • We used 35 predictors, and a non-linear method of combining them (genetic programming) to predict average RTs
Some lessons about scientific prediction • models can 'cheat' by using variance in the input data set that does not transfer to unseen data = you must test your predictions on unseen data • Some models that are very good may be very good precisely because they are very good at using this 'within-set' variation • Very simple (3-variable) non-linear models may do as well or better than than much more complex models, especially linear models, and may exclude highly-correlated variables • Different measures of successful prediction may yield quite different results (i.e. test correlation versus 0.5 SD correlation)