320 likes | 594 Views
Applying Ideal Point IRT Models to Score Single Stimulus and Pairwise Preference Personality Items. Stephen Stark (USF) Oleksandr S. Chernyshenko (UC, NZ) Fritz Drasgow (UIUC). Overview. “Problems” with current personality assessment procedures
E N D
Applying Ideal Point IRT Models to Score Single Stimulus and Pairwise Preference Personality Items Stephen Stark (USF) Oleksandr S. Chernyshenko (UC, NZ) Fritz Drasgow (UIUC)
Overview • “Problems” with current personality assessment procedures • The case for ideal point response process assumptions in personality • Ideal point IRT models for single statement and pairwse preference items • Score comparability study
Personality Scale Construction Today • Rooted in Classical Test Theory (CTT) and Common Factor Theory (CFT) • Uses single stimulus format, fixed length scales and total scores in all analyses and interpretations • Existing inventories • Are static • Contain a large number of relatively short scales
Problem # 1 • Current scales worked well for research purposes, where the interest is to “understand the relationship” between constructs • But, these measures are not well-suited for adaptive formats or feedback purposes • Item parameters are scale dependent • Item difficulties do not directly correspond to item content, because of reverse scoring • Scales are too short to have good precision • More flexible test construction technology is needed
Problem # 2 • CTT and CFT make dominance response process assumption • This has been “adopted” from cognitive ability testing • To satisfy constraints of the dominance assumption • Reverse scoring of negative items is introduced • Neutral or extreme items are deleted from items pools because they have low item-total correlations (loadings) • This results in depleted item pools and scales with properties more suitable for scholarship exams
Item Dominance Response Process and Personality Items (MBR, 2001; JAP, 2006) • Person endorses item if her standing on the latent trait, theta, is moreextremethan that of the item. • Only appropriate for moderately positive/negative items (e.g., “I like/dislike parties”) Person
Item TooIntroverted TooExtraverted Ideal Point Process: A More Flexible Alternative? • Person endorses item if her standing on the latent trait, theta, is near that of the item. • “My social skills are about average.” • Disagree either because: Too introverted (uncomfortable talking to people) Too extraverted (great skills)
Ideal Point Process and Personality (JAP, 2006; Psych Assessment, in press) • Ideal point IRT models provided better fit to a wider variety of personality items than dominance IRT models • Many nonmonotonic, but highly discriminating items have been found • 30% more items were retained in item pools • More items are available for scale construction
Conclusions and Further Basic Research • Ideal point process offers numerous advantages for improving current measures • More research is needed • Only few ideal point models are available; more flexibility is needed • Item and person parameter estimation must be improved (APM, 2005) • Responses to adaptive scales may be more complicated than we think • Note that this research carries limited applied value, because traditional items are easily FAKED
Single Stimulus Response Format • Items consist of individual statements • I get along well with others. (A+) • I try to be the best at everything I do. (C+) • I insult people. (A-) • My peers call me “absent minded.” (C-) • Agree/Disagree or Likert type (SD,D,N,A,SA) response options are used • In each case, socially desirable response is obvious.
How to Deal With Faking? • Social Desirability (SD) scales often used to “detect” and “correct” for faking • Adjustments made to content scale scores • Little effect on validity • Correcting for faking using SD scores is problematic, because… • SD scales may function differently across testing situations (JAP, 2001) • Need to develop fake-resistant items
Search for Fake-Resistant Formats • Empirically keyed, nontransparent items • But problems with construct and face validity data • Biodata or situational judgments • Do not measure personality directly • Can be easily faked as soon as respondents told personality is being assessed • Forced-choice (FC) items • Halo and other biases are reduced (Borman et al., 2001) • Intuitively, should reduce faking (Jackson et al., 2000)
Unidimensional Pairwise Preference Format • Create items by pairing stimuli that are on the same dimension, but representing different locations on the trait continuum • Sociability item: • I talk a lot. (+3) • My social skills are about average . (0) • Respondent chooses statement that is “More Like Me” • Navy Computer Adaptive Personality Scales (NCAPS) uses this format
Multidimensional Pairwise Preference Format • Create items by pairing stimuli that are similar in desirability, but representing different dimensions • Positive item: • I get along well with others. (A+) • I set very high standards for myself. (C+) • Negative item: • I insult people. (A-) • I work just enough to pass my classes. (C-) • Variation of this approach is the tetrad format (Army AIM or SHL’s OPQ-32-i)
Scoring Forced Choice Measures • Traditional scoring of FC items is problematic • Unidimensional FC scale scores have bi-modal distributions • Multidimensional FC scores are ipsative • Inter-individual comparisons not possible • Scale scores correlate negatively (even facets of Big 5) • Scoring lacks a formal psychometric model • Difficult to evaluate scoring accuracy • Does not provide insight about item construction • Not usable for adaptive testing
Are Forced Choice Scores Equivalent to Traditional Scores? • FC measures are gaining popularity • But, direct comparisons of traditional FC and SS scores not possible • “Score inflations” can only be evaluated within measures • Correlations between measures are low • Before evaluating FC measures in operational settings: • Scores must be normative • Under honest conditions,FC and SS scores should be the same
Response Format Study(in review) • Used advances in IRT to obtain normative scores for Order, Self Control and Sociability • 36-item Single Stimulus measure • 36-pair Unidimensional Pairwise Preference measure • 36-pair Multidimensional Pairwise Preference measure • All scores were estimated using IRT • All items administered under honest conditions (N=602 for self reports and N=110 for observers)
IRT Model for Single Stimulus Items • Generalized Graded Unfolding Model (GGUM; Roberts et al., 1998) • GGUM fit personality items well (Chernyshenko, 2002) • No reverse scoring needed
IRT Model for Scoring Unidimensional Pairwise Preferences (Stark & Drasgow,2002) • Zinnes and Griggs (1974) Probabilistic Unfolding Model (ZG model) • Idea: Respondent has ideal point representing his/her perception of typical behavior (trait level) • Task: On each trial, respondent chooses the statement that better describes him/her
IRT Model for Scoring Multidimensional Pairwise Preferences (Stark, 2002; Stark, Chernyshenko, & Drasgow, 2005) 1 = Agree 0 = Disagree • Respondent evaluates each stimulus (personality statement) separately and makes independent decisions about endorsement. • Stimuli may be on different dimensions. • Single stimulus response probabilities P{0} and P{1} computed using a unidimensional ideal point model for “traditional” items (GGUM) Refer to new pairwise preference model as MDPP
Normative Score Recovery • Roberts et al. (2000) and Stark (1998, 2002) showed in simulations studies: • Accurate normative scores could be recovered for GGUM, ZG and MDPP models • 10 items or pairs per dimension are sufficient to obtain reasonable estimates • But, no empirical study has compared scores from these 3 formats, even under “honest” conditions
Results for Conscientiousness Facets Positive correlation for MDPP facet scores. Correlations = reliability
Results for Order and Sociability Correlations = reliability
Criterion Validities Criterion validities are comparable
Conclusions • Under honest conditions, MDPP, ZG, and SS versions of the questionnaire provided equivalent measurement and can be viewed as alternate forms • Moving toward FC formats did not affect the validity of personality scores. • Observing a positive correlation between Order and Self Control MDPP scales provided empirical evidence for normative scoring
Current Research • Results of this study speak in favor of using ZG and MDPP IRT models for scoring FC scales • Having IRT models makes transition to adaptive testing easy • Adaptive format may offer additional benefit of fake resistance (see NCAPS presentations for recent IMTA talks) • Current studies: • How to best pair stimuli? • How many unidimensional parings needed? • Will increasing # of dimensions lead to more fake resistant scores? • Can we better detect faking using forced choice than traditional format?