Assessing Personality 75 Years After Likert: Thurstone Was Right!

Assessing Personality 75 Years After Likert:Thurstone Was Right! (And some implications for I/O)

Colleagues • Sasha Chernyshenko • Steve Stark

Thurstone • In a series of papers in the late 1920s, Thurstone asserted “Attitudes Can Be Measured” and provided several methods for their measurement • He assumed that a conscientious person would endorse a statement that reflected his/her attitude…but • “as a result of imperfections, obscurities, or irrelevancies in the statement, and inaccuracy or carelessness of the subjects” not everyone will endorse a statement, even when it matches their attitude

Thurstone, Psych Review, 1929 • For N1 people with attitude S1, all should endorse a statement with scale value S1 if they were conscientious and the item was perfect; but only n1 actually endorse the item • These people will endorse another statement with scale value S2 with a probability p that is a function of |S1-S2| • Figure from Thurstone’s paper:

Thurstone 1929

Thurstone 1928 Attitudes Can Be Measured • Gave an example of an attitude variable, militarism-pacifism, with six statements representing a range of attitudes:

Thurstone 1928

Thurstone 1928 • A pacifist “would be willing to indorse all or most of the opinions in the range d to e and … he would reject as too extremely pacifistic most of the opinions to the left of d, and would also reject the whole range of militaristic opinions.” • “His attitude would then be indicated by the average or mean of the range that he indorses”

Implications • On Thurstone’s pacificism-militarism scale, three people might endorse two items each: • Person 1 endorses f and d, and is very pacifistic • Person 2 endorses e and b, and is neutral • Person 3 endorses c and a, and is very militaristic • Thus, it is crucial to know which items are endorsed!

Likert 1932 • Proposed a much simpler approach: A five-point response scale with options “Strongly Approve”, “Approve”, “Neutral”, “Disapprove”, and “Strongly Disapprove”. • The numerical values 1 to 5 were assigned to the different response options • And an individual’s score was the sum or mean of the numerical scores

Likert 1932 • Likert evaluated his scales by • Split-half reliability • Item-total correlations • To make this work, he hit upon the idea of reverse scoring, e.g., statements like d and f from Thurstone needed to be scored in the opposite direction of statements like a and c.

Likert 1932 • When computing item-total correlations, “if a zero or very low correlation coefficient is obtained, it indicates that the statement fails to measure that which the rest of the statements measure.” (p. 48) • “Thus item analysis reveals the satisfactoriness of any statement so far as its inclusion in a given attitude scale is concerned”

Likert 1932 • Likert discarded intermediate statements like “Compulsory military training in all countries should be reduced but not eliminated” • Such a statement is “double-barreled and of little value because it does not differentiate persons in terms of their attitudes” (p. 34)

Likert Scaling • Although Likert didn’t articulate a psychometric model for his procedure, his analysis implies what Coombs (1964) called a dominance response process. • Specifically, someone high on the trait or attitude measured by a scale is likely to “Strongly Agree” with a positively worded item and “Strongly Disagree” with a negatively worded item

Person Item Example of a Dominance Process Person endorses item if her standing on the latent trait, theta, is more extreme than that of the item.

Thurstone Scaling • Thurstone assumed people endorse items reflecting attitudes close to their own feelings • Coombs (1964) called this an ideal point process • Sometimes called an unfolding model

Item TooIntroverted TooExtraverted Example of an Ideal Point Process • Person endorses item if his standing on the latent trait is near that of the item. • “I enjoy chatting quietly with a friend at a cafe.” • Disagree either because: Toointroverted (uncomfortable in public places) Tooextraverted (chatting over coffee is boring)

Important Point: • The item-total correlation of intermediate ideal point items will be close to zero!

Which Process is Appropriate for Temperament Assessment? • In a series of studies, we’ve • Examined appropriateness of dominance process by fitting models of increasing complexity to data from two personality inventories • Compared fits of dominance and ideal point models of similar complexity to 16PF data • Compared fits of dominance and ideal point models to sets of items not preselected to fit dominance models

Fitting Traditional Dominance Models to Personality Data • Data • 16PF 5th Edition • 13,059 examinees completed 16 noncognitive scales • Goldberg’s Big Five factor markers • 1,594 examinees completed 5 noncognitive scales • Models examined • Parametric – 2PLM, 3PLM • Nonparametric – Levine’s Maximum Likelihood Formula Scoring (MFSM)

Three-Parameter Logistic Model

Two-Parameter Logistic Model

Methods for Assessing Fit: Fit Plots

Methods for Assessing Fit: Chi-Squares • Chi-squares typically computed for single items • Very important to examine item pairs and triplets • May indicate violations of local independence or misspecified model

Methods for Assessing Fit: Chi-Squares To aid interpretation of chi-squares: • Adjust to sample size of 3,000 • Compare groups of different size • The expected value of a non-central chi-square is equal to its df plus N times the noncentrality parameter d • where N is the sample size. So an estimate of the noncentrality parameter is

Adjusted Chi-square • To adjust to a sample size of, say, 250, use • For IRT, we usually adjust to N = 3000, and divide by the df to get an adjusted chi-square/df ratio • Less than 2 is great, less than 3 is OK

AdjChf < 3 Adjusted Chi-square/df for an Ability Test

Results for 16 PF Sensitivity Scale: Mean Chi-sq/df Ratios

What if Items Assessed Trait Values Along the Whole Continuum? • Items on existing personality scales have been pre-screened on item-total correlation • We speculate that items measuring intermediate trait values are systematically deleted • So, what happens if a scale includes some intermediate items?

TAPAS Well-being Scale • Tailored Adaptive Personality Assessment System • Assesses up to 22 facets of the Big Five • Well-being is a facet of emotional stability • We wrote items reflecting low, moderate, and high well-being

For example, TAPAS Well-Being Scale • WELL04, “I don’t have as many happy moments in my life as others have • WELL17, “My life has had about an equal share of ups and downs • WELL41, “Most days I feel extremely good about myself • In total, 20 items. 5 negative items, 9 positive, and 6 neutral

Traditional Analysis Results

Fit Plot for 2PL WELL17

An Ideal Point Model: The Generalized Graded Unfolding Model (GGUM) • Roberts, Donoghue, & Laughlin (2000). Applied Psychological Measurement. • The model assumes that the probability of endorsement is higher the closer the item to the person • GGUM software provides maximum likelihood estimates of item parameters

GGUM • The probability of disagree is: and the probability of agree is

GGUM Estimated IRF for Moderate Item IRF for Agree response to TAPAS Well-being item “My life has had about an equal share of ups and downs.”

TAPAS Well-being Scale 2PL Results: GGUM Results:

Summary of Findings • 2PLM and 3PLM fit scales developed by traditional methods OK, but if moderate items are included • Chi-square doublets and triplets can be large, especially when moderate items are included • Discrimination parameter estimates are uniformly small for moderate items (and item-total correlations are near zero). • GGUM fits all items, including moderate items • Adj. chi-square to df ratios are small for doubles and triples • GGUM discrimination parameter estimates are large for the moderate items!

So, for Well-Being • Fitting a dominance item response theory model (the 2-parameter logistic) produced an adjusted Chi-Square to df ratio of 2.955 for pairs • The ideal point model yielded an adjusted Chi-square/df ratio of 0.997 for pairs

Conclusion • Ideal point model seems more appropriate for temperament assessment • BUT there’s a “Fly in the ointment” for I/O • Correct specification of response process does not guarantee more accurate assessment, because … • Traditional items are easily FAKED

Examples of “Traditional” Itemsthat are Easily Faked In each case, the positively keyed response is obvious. • I get along well with others. (A+) • I try to be the best at everything I do. (C+) • I insult people. (A-) • My peers call me “absent minded.” (C-) Because these items consist of individual statements, theyare commonly referred to as “single stimulus” items.

Army Assessment of Individual Motivation (AIM) • Uses tetrads: • I get along well with others. (A+) • I set very high standards for myself. (C+) • I worry a lot. (ES-) • I like to sit on the couch and eat potato chips. (Physical condition-) • Respondent picks the statement that is Most Like Me and the statement that is Least Like Me • Army AIM has shown less score inflation • What psychometric model would describe this type of data????

So… • US Army researchers Len White and Mark Young (and others) found some fake resistance and criterion-related validity for the tetrad format • But modeling four-dimensional items was too hard for me! • How about two-dimensional items?

Multidimensional Pairwise Preference (MDPP) Format • Create items by pairing stimuli that are similar in desirability, but representing different dimensions • “Which is more like you?” • I get along well with others. (A+) • I always get my work done on time. (C+) • This led to my work on personality assessment over the past 10 years • And the result is:

Tailored Adaptive Personality Assessment System (TAPAS) • TAPAS is designed to overcome existing limitations of personality assessment for selection by incorporating recent advancements in: • Temperament/personality assessment • Item response theory (IRT) • Computerized adaptive testing (CAT) • Our goal is for TAPAS to be innovative in both how we assess (IRT, CAT) and what we assess (facets of personality)

Assessing Personality 75 Years After Likert: Thurstone Was Right!