320 likes | 597 Views
Crash Course in Psychometric Theory. David B. Flora SP Area Brownbag February 8, 2010. Research in social and personality psychology is about abstract concepts of theoretical importance, called “constructs.”
E N D
Crash Course in Psychometric Theory David B. Flora SP Area Brownbag February 8, 2010
Research in social and personality psychology is about abstract concepts of theoretical importance, called “constructs.” • Examples include “prejudice,” “self-esteem,” “introversion,” “forgiveness,” and on and on… • The success of a research study depends on how well constructs of interest are measured. • The field of “Test Theory” or “Psychometrics” is concerned with the theory and accompanying research methods for the measurement of psychological constructs.
Psychometric theory evolved from the tradition of intelligence, or “mental ability”, testing. • Spearman (1904) invented factor analysis to aid in the measurement of intelligence. • The psychophysics tradition is also foundational to psychometric theory, as per Thurstone’s (1928) law of comparative judgment for scaling of social stimuli. • A test question is a stimulus; the answer to the question is a behavioural response to the stimulus.
Classical True Score Model xi = ti + ei xi is the observed value for person i from an operationalization of a construct (e.g., a test score). ti is that person’s true score on the construct. ei is measurement error. The variable t is a latent variable: An unobservable variable that is measured by the observable variable x.
Lord & Novick’s (1968) preferred definition of the true score (paraphrased): For a given person, there is a “propensity” distribution of possible outcomes of a measurement that reflects the operation of processes such momentary fluctuations in memory and attention or in strength of an attitude. The person’s true score is the mean of this propensity distribution. Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores.
Validity xi = ti + ei or ti = xiei • Validity denotes the scientific utility of the scores, x, obtained with a measuring instrument (i.e., a test). • But there is more to it than just the size of ei. • Validity is mostly concerned with whether x measures the t that we want it to… • Note that validity is a property of the scores obtained from a test, not the test itself.
Nunnally & Bernstein (1994), Psychometric Theory (3rd ed.), p. 84: “Validation always requires empirical investigations, with the nature of the measure and form of validity dictating the needed form of [empirical] evidence.” “Validation usually is a matter of degree rather than an all-or-none property, and validation is an unending process.” “Strictly speaking, one validates the use to which a measuring instrument is put rather than the instrument itself. Tests are often valid for one purpose but not another.”
You may have heard of • Internal validity • External validity • Face validity • Content validity • Construct validity • Criterion validity • Predictive validity • Postdictive validity • Concurrent validity • Factorial validity • Convergent validity • Discriminant validity • Incremental validity • Ecological validity
Standards • Standards for Educational and Psychological Testing (1966; 1974; 1985; 1999) is developed jointly by AERA, APA, and NCME. • The Standards view validity as a unitary concept. • Rather than there being separate types of validity, there are three main types of validity evidence. 1. Content-related evidence 2. Construct-related evidence 3. Criterion-related evidence
Content-related validity evidence • Content validity refers to the extent to which a set of items (or stimuli) adequately reflects a content domain. • E.g., selection of vocabulary words for Grade 6 vocabulary test from the domain of all words taught to 6th graders. • Evidence is based on theoretical judgment. • Same as face validity? - self-report judgment of overall health
Construct-related validity evidence • Cronbach, L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. • Mainly concerned with associations between test scores and other variables that are dictated by theory. • Multi-trait multi-method correlation matrix (Campbell & Fiske, 1959): Is the test strongly correlated with other measures of the same construct? (convergent validity) Is the test less strongly correlated with measures of different constructs than with measures of the same construct? (discriminant validity)
Floyd & Widaman (1995), p. 287: • “Construct validity is supported if the factor structure of the [instrument] is consistent with the constructs the instrument purports to measure.” • “If the factor analysis fails to detect underlying constructs [i.e., factors] that explain sufficient variance in the [items] or if the constructs detected are inconsistent with expectations, the construct validity of the scale is compromised.” Floyd, F. J., & Widaman, K. F. (1995). Factor analysis in the development and refinement of clinical assessment instruments. Psychological Assessment, 7, 286-299.
Criterion-related validity evidence • Evidence is based on empirical association with some important “gold standard” criterion. • Encompasses predictive and concurrent validity. • Difficult to distinguish from construct validity - Theoretical reason for association is critical for construct validity, less important for criterion validity. • E.g., relationship between a stress measure and physical health?
Do we really need your new scale? Does it have incremental validity? “Incremental validity is defined as the degree to which a measure explains or predicts a phenomenon of interest, relative to other measures. Incremental validity can be evaluated on several dimensions, such as sensitivity to change, diagnostic efficacy, content validity, treatment design and outcome, and convergent validity.” Haynes, S. N., & Lench, H. (2003). Incremental validity of new clinical assessment measures. Psychological Assessment, 15, 456-466.
Reliability • Reliability is necessary, but not sufficient, for construct validity. • Lack of reliability (i.e., measurement error) introduces bias in analyses and reduces statistical power. • What exactly is reliability? xi = ti + ei Reliability = Var(ti) / Var(xi) Reliability is the proportion of true score variance to total observed variance.
Since we can’t directly observe Var(ti) , we must turn to other methods for estimating reliability… • Parallel-forms reliability • Split-half reliability • Internal consistency reliability (coefficient alpha) • Test-retest reliability • Inter-rater reliability Each is an estimate of the proportion of true score variability to total variability.
Coefficient alpha () • Original formula actually given by Guttman (1945), not Cronbach (1951)! • An average of all inter-item correlations, weighted by the number of items, k: • The expected correlation of one test with an alternate form containing the same number of items.
Coefficient alpha () • The more items, the larger . • A high does NOT imply unidimensionality (i.e., that items all measure a single factor). • is a lower-bound estimate of true reliability…
How does factor analysis fit in? “Common factor model” for a “congeneric” set of items measuring a single construct: xij = jfi + uij xij is the jthitem on a multi-item test fiis the common factor score on the factor, or latent variable for person i. jis the factor loading of test item j. uij is the factor score unique factorj for person i. It represents a mixture of systematic influence on random error influence on item x: uij = (sij + eij )
If we define tij= jfi and assume that the systematic unique influence is negligible, so that uij ≈ (0 + eij )… • …then the common factor model gives the Classical True Score model for scores on item j: xij = jfi + uij xij = ti + eij • Coefficient will be underestimated to the extent that the factor loadings, j, vary across items. • More accurate reliability estimates can be calculated using the factor loadings. -Perspective shifts from internal consistency to latent variable relationship
Tangential things you should know… • Principal components analysis (PCA) is NOT factor analysis. When you run a PCA, you are NOT estimating the common factor model. • Situations where PCA is appropriate are quite rare in social and personality psychology. • The Pearson product-moment correlation is often NOT adequate for describing the relationships among item-level categorical variables! • When factor analyzing items, we should usually use something other than product-moment correlations. • One approach is to analyze polychoric correlations.
Modern Psychometric Theory • Another approach that properly models item-level variables as categorical is Item Response Theory (IRT). • IRT represents a collection of models for relating individual items within a test or scale to the latent variable(s) they measure. • IRT leads to test scores with smaller measurement error than traditional item sums or means.
IRT • The properties of each item are summarized with an item characteristic curve (ICC). • The slope of the curve indicates item discrimination, i.e., the strength of relationship between the item and the latent construct. • The horizontal location of the curve indicates item difficulty or severity.
Item characteristic curves (ICCs) for four binary items with equal discrimination but varying “difficulty.” • X-axis, “theta,” represents latent trait or construct. • Y-axis represents probability of a positive item response.
Item characteristic curves (ICCs) for four binary items with varying discrimination and varying difficulty. 1 2 3 • Items 1 and 2 have stronger discrimination than 3 and 4. • Item 1 has the lowest difficulty, item 4 the highest. 4
A “test information function” • Shows precision of measurement as a function of latent trait level
IRT scores • Scale scores constructed using IRT - take into account item discrimination, whereas simple sum (or mean) scores assume all items measure the construct equally well - have a proper interval scale of measurement, whereas simple sum scores are typically ordinal, strictly speaking - have measurement error that varies across the range of the construct, whereas simple sum scores assume a single reliability value for the whole range
The big picture • IRT was often presented as an alternative approach to test theory at odds with classical test theory (CTT). • Current perspective is that CTT and IRT complement and enhance each other. -For example, the mathematical link between IRT and factor analysis is now well understood. • A well validated test will still produce scores with measurement error. • Ideas from CTT, IRT, and structural equation modeling can be implemented to produce powerful results that account for measurement error, thus modeling relationships among the constructs themselves rather than the operational variables.