1 / 30

Applying Ideal Point IRT Models to Score Single Stimulus and Pairwise Preference Personality Items

Applying Ideal Point IRT Models to Score Single Stimulus and Pairwise Preference Personality Items. Stephen Stark (USF) Oleksandr S. Chernyshenko (UC, NZ) Fritz Drasgow (UIUC). Overview. “Problems” with current personality assessment procedures

content
Download Presentation

Applying Ideal Point IRT Models to Score Single Stimulus and Pairwise Preference Personality Items

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applying Ideal Point IRT Models to Score Single Stimulus and Pairwise Preference Personality Items Stephen Stark (USF) Oleksandr S. Chernyshenko (UC, NZ) Fritz Drasgow (UIUC)

  2. Overview • “Problems” with current personality assessment procedures • The case for ideal point response process assumptions in personality • Ideal point IRT models for single statement and pairwse preference items • Score comparability study

  3. Personality Scale Construction Today • Rooted in Classical Test Theory (CTT) and Common Factor Theory (CFT) • Uses single stimulus format, fixed length scales and total scores in all analyses and interpretations • Existing inventories • Are static • Contain a large number of relatively short scales

  4. Problem # 1 • Current scales worked well for research purposes, where the interest is to “understand the relationship” between constructs • But, these measures are not well-suited for adaptive formats or feedback purposes • Item parameters are scale dependent • Item difficulties do not directly correspond to item content, because of reverse scoring • Scales are too short to have good precision • More flexible test construction technology is needed

  5. Problem # 2 • CTT and CFT make dominance response process assumption • This has been “adopted” from cognitive ability testing • To satisfy constraints of the dominance assumption • Reverse scoring of negative items is introduced • Neutral or extreme items are deleted from items pools because they have low item-total correlations (loadings) • This results in depleted item pools and scales with properties more suitable for scholarship exams

  6. Item Dominance Response Process and Personality Items (MBR, 2001; JAP, 2006) • Person endorses item if her standing on the latent trait, theta, is moreextremethan that of the item. • Only appropriate for moderately positive/negative items (e.g., “I like/dislike parties”) Person

  7. Item TooIntroverted TooExtraverted Ideal Point Process: A More Flexible Alternative? • Person endorses item if her standing on the latent trait, theta, is near that of the item. • “My social skills are about average.” • Disagree either because: Too introverted (uncomfortable talking to people) Too extraverted (great skills)

  8. Ideal Point Process and Personality (JAP, 2006; Psych Assessment, in press) • Ideal point IRT models provided better fit to a wider variety of personality items than dominance IRT models • Many nonmonotonic, but highly discriminating items have been found • 30% more items were retained in item pools • More items are available for scale construction

  9. Conclusions and Further Basic Research • Ideal point process offers numerous advantages for improving current measures • More research is needed • Only few ideal point models are available; more flexibility is needed • Item and person parameter estimation must be improved (APM, 2005) • Responses to adaptive scales may be more complicated than we think • Note that this research carries limited applied value, because traditional items are easily FAKED

  10. Single Stimulus Response Format • Items consist of individual statements • I get along well with others. (A+) • I try to be the best at everything I do. (C+) • I insult people. (A-) • My peers call me “absent minded.” (C-) • Agree/Disagree or Likert type (SD,D,N,A,SA) response options are used • In each case, socially desirable response is obvious.

  11. How to Deal With Faking? • Social Desirability (SD) scales often used to “detect” and “correct” for faking • Adjustments made to content scale scores • Little effect on validity • Correcting for faking using SD scores is problematic, because… • SD scales may function differently across testing situations (JAP, 2001) • Need to develop fake-resistant items

  12. Search for Fake-Resistant Formats • Empirically keyed, nontransparent items • But problems with construct and face validity data • Biodata or situational judgments • Do not measure personality directly • Can be easily faked as soon as respondents told personality is being assessed • Forced-choice (FC) items • Halo and other biases are reduced (Borman et al., 2001) • Intuitively, should reduce faking (Jackson et al., 2000)

  13. Unidimensional Pairwise Preference Format • Create items by pairing stimuli that are on the same dimension, but representing different locations on the trait continuum • Sociability item: • I talk a lot. (+3) • My social skills are about average . (0) • Respondent chooses statement that is “More Like Me” • Navy Computer Adaptive Personality Scales (NCAPS) uses this format

  14. Multidimensional Pairwise Preference Format • Create items by pairing stimuli that are similar in desirability, but representing different dimensions • Positive item: • I get along well with others. (A+) • I set very high standards for myself. (C+) • Negative item: • I insult people. (A-) • I work just enough to pass my classes. (C-) • Variation of this approach is the tetrad format (Army AIM or SHL’s OPQ-32-i)

  15. Scoring Forced Choice Measures • Traditional scoring of FC items is problematic • Unidimensional FC scale scores have bi-modal distributions • Multidimensional FC scores are ipsative • Inter-individual comparisons not possible • Scale scores correlate negatively (even facets of Big 5) • Scoring lacks a formal psychometric model • Difficult to evaluate scoring accuracy • Does not provide insight about item construction • Not usable for adaptive testing

  16. Are Forced Choice Scores Equivalent to Traditional Scores? • FC measures are gaining popularity • But, direct comparisons of traditional FC and SS scores not possible • “Score inflations” can only be evaluated within measures • Correlations between measures are low • Before evaluating FC measures in operational settings: • Scores must be normative • Under honest conditions,FC and SS scores should be the same

  17. Response Format Study(in review) • Used advances in IRT to obtain normative scores for Order, Self Control and Sociability • 36-item Single Stimulus measure • 36-pair Unidimensional Pairwise Preference measure • 36-pair Multidimensional Pairwise Preference measure • All scores were estimated using IRT • All items administered under honest conditions (N=602 for self reports and N=110 for observers)

  18. IRT Model for Single Stimulus Items • Generalized Graded Unfolding Model (GGUM; Roberts et al., 1998) • GGUM fit personality items well (Chernyshenko, 2002) • No reverse scoring needed

  19. Example: “Ideal Point IRT” Order Scale

  20. IRT Model for Scoring Unidimensional Pairwise Preferences (Stark & Drasgow,2002) • Zinnes and Griggs (1974) Probabilistic Unfolding Model (ZG model) • Idea: Respondent has ideal point representing his/her perception of typical behavior (trait level) • Task: On each trial, respondent chooses the statement that better describes him/her

  21. Equation for ZG Item Response Functions

  22. IRF for Stimulus-Pair j = 17, k = 18(m17 = 5.6, m18 = 3.8)

  23. IRT Model for Scoring Multidimensional Pairwise Preferences (Stark, 2002; Stark, Chernyshenko, & Drasgow, 2005) 1 = Agree 0 = Disagree • Respondent evaluates each stimulus (personality statement) separately and makes independent decisions about endorsement. • Stimuli may be on different dimensions. • Single stimulus response probabilities P{0} and P{1} computed using a unidimensional ideal point model for “traditional” items (GGUM) Refer to new pairwise preference model as MDPP

  24. Model Notation

  25. Normative Score Recovery • Roberts et al. (2000) and Stark (1998, 2002) showed in simulations studies: • Accurate normative scores could be recovered for GGUM, ZG and MDPP models • 10 items or pairs per dimension are sufficient to obtain reasonable estimates • But, no empirical study has compared scores from these 3 formats, even under “honest” conditions

  26. Results for Conscientiousness Facets Positive correlation for MDPP facet scores. Correlations = reliability

  27. Results for Order and Sociability Correlations = reliability

  28. Criterion Validities Criterion validities are comparable

  29. Conclusions • Under honest conditions, MDPP, ZG, and SS versions of the questionnaire provided equivalent measurement and can be viewed as alternate forms • Moving toward FC formats did not affect the validity of personality scores. • Observing a positive correlation between Order and Self Control MDPP scales provided empirical evidence for normative scoring

  30. Current Research • Results of this study speak in favor of using ZG and MDPP IRT models for scoring FC scales • Having IRT models makes transition to adaptive testing easy • Adaptive format may offer additional benefit of fake resistance (see NCAPS presentations for recent IMTA talks) • Current studies: • How to best pair stimuli? • How many unidimensional parings needed? • Will increasing # of dimensions lead to more fake resistant scores? • Can we better detect faking using forced choice than traditional format?

More Related