Composite scores

Composite scores Paul K. Crane, MD MPH Dan M. Mungas, PhD

Disclaimer • Funding for this conference was made possible, in part by Grant R13 AG030995 from the National Institute on Aging. • The views expressed do not necessarily reflect the official policies of the Department of Health and Human Services; nor does mention by trade names, commercial practices, or organizations imply endorsement by the U.S. Government. • Drs. Harvey and Crane have no conflicts of interest to report.

Outline • Neuropsychological practice and the utility of z scores • Why composite scores? Drinking from the fire hose • Z scores head to head with IRT scores • Conclusions

Neuropsychological practice • Often focused on patterns of cognitive deficits across different domains • Useful in differential diagnosis of the cognitively impaired individual • May emphasize a premorbid estimate of ability • Multiple determinants, including occupation, educational attainment, military rank (for vets) • Vocabulary preserved in early AD, so it may be used as well

Scores and communication • Neuropsychological batteries contain many tests, each with a different scoring metric • Familiarity permits experts to understand what a Mattis score of 130 and 4 errors on the Clock and a Trails B time of 142 seconds imply about the examinee’s cognitive functioning • Difficult to communicate these scores to less experienced colleagues

Clinical use of z scores • Z scores facilitate short-hand communication • Relatively easy to calculate • Requires some average score and some standard deviation to calculate; age-specific or race-specific or education-specific norms? • May not matter much within an individual, unless different tests have much different demographic impacts • Makes it much easier for individuals with less experience with the tests to identify domains with deficits

Rationale for composite scores • Summary scores are very helpful for analyses • Better measurement properties together than any instrument on its own • Avoid problems from multiple hypotheses • True signal scenario with multiple bad tests: a few show p<0.05, a few don’t, looks like the work of chance • True signal scenario with one good test: may work • Depends on measurement properties, and whether the items pooled together measure the thing intended (dimensionality and validity issues)

Logical next step in the z score story • Very simple extension to average the z scores of the tests within a domain and use that average z score in analyses • Commonly done, even considered relatively sophisticated by study sections in 2008 • But: may not be the best thing to do from a psychometrics perspective

Assumptions of z scores • Each item / scale / test has equal weight on the overall score • Is letter fluency with 3 letters 1 test or 3 tests? This matters (1/n influence on total score, or 3/n influence on total score) • The scale determined by the standard deviation • Highly variable items / scales / tests are weighted less • Less variable items / scales / tests are weighted more • Is this what we would want? • Wouldn’t we want to incorporate information about the relative difficulty of different tests?

Linearity • Hidden in z scores is an assumption that 1 SD difference in scores has the same meaning related to underlying domain measured by the test in all regions (Usually “ability” for neuropsychological tests) • A z score is a transformed sum score • Tests constructed without modern psychometrics tend to have a common structure: most of the items are in the middle

Global cognitive tests

Curvilinear scaling

Curvilinearity in a longitudinal study • Where you start on the curve matters a great deal in how much change there appears to be

Linear scaling 1 Low ability High ability

Linear scaling 2 High ability (difficult items) Low ability (easy items) XXXXXXXXX XXXX XXXXX XX

Linear scaling 3 Low ability High ability XXXXXXXXX XXXX XXXXX XX B0 A0

Linear scaling 4 Low ability High ability XXXXXXXXX XXXX XXXXX XX B0 A1 B1 A0

Linear scaling 5 Low ability High ability XXXXXXXXX XXXX XXXXX XX B0 A1 B1 A0 11 “at risk” points 1 “at risk” point

Linear scaling 6 Low ability High ability XXXXXXXXX XXXX XXXXX XX +2 -2… -1 +1 0 …+3 Mean=7, SD=5

Same example with a different population Low ability High ability XXXXXXXXX XXXX XXXXX XX -6 -5-4-3-2-1 0 +1 +2 +3 Mean=13, SD=2

Same issue with Fluency • Let’s say in a population the mean of /F/ is 12 in 1 minute, SD 3 • Using a z score implies difference in implication between 3 and 6 words is the same as the difference in implication between 33 and 36 words (1 SD unit difference) • 3: really awful. 6: pretty bad. • 33 and 36: both really good. Certainly 33 and 36 not qualitatively as different as 3 and 6 are • Similarly, difference between 3 and 12 (awful and average) is the same as between 36 and 45 (really good and superb) (3 SD units)

Bias in the rate of change

Zero in z scores • Average score is 0 for each test • Weights for scores different from 0 determined by the variability (in the form of the SD), not the relative difficulty of the test • Is this what we would want?

Summary: issues with z scores • Dimensionality: Should we lump these items / scales / tests together? • Equal weighting: Should each item / scale / test receive equal weight in the overall composite score? • Scales based on variability: Is it appropriate to base the scale on the observed SD in the population? • Equal difficulty: Are all of the items / scales / tests equally difficult? • Linearity: Is the relationship with the underlying construct measured by the test the same across the entire spectrum?

Z scores vs. IRT scores • IRT scores offer more flexibility; linear scaling • Weighting based on relative difficulties of different tests • (Different handling of demographic heterogeneity) • Facilitates specific attention to measurement error / precision

2 head to head studies: Study 1 • FH 2005, in press at JINS • Executive functioning battery added to SENAS • Subset had MRI evaluations • Compared IRT to z scores head to head in terms of strength of relationship with neuroimaging parameters • Demographic heterogeneity

Ability … … i i i i 1 n n+1 n+m Demographics Items Items with without DIF DIF Composite Score MRI Conceptual model

Findings • Strength of relationship of executive functioning composite with MRI was similar for IRT scores as for composite z score • Accounting for heterogeneity in ages using adjusted z scores decreased strength of relationship • Accounting for ethnicity / language, education, and gender did not impair strength of relationship • Accounting for heterogeneity using IRT and DIF did not impact strength of relationship

Study 2 • Convenience sample with three known groups: AD, impaired cognition with no dementia, and normal cognition • Neuropsychological battery administered, including several measures of executive functioning

Digits backwards from the CASI • I think it’s items in the CASI • How to score these items? (not clear) • Does it matter? (absolutely)

Digits backwards Score 1: more credit for 4 digits than 3 digits Score 2: equal credit for 4 digits and 3 digits Score 3: more credit for 4 digits than 3 digits, lots of points for both Score 4: more credit for 4 digits than 3 digits, LOTS of points for both

Strength of relationship with cognitive impairment IRT scores were at least as good as z scores Demographic adjustment in the z score framework was a bad idea Demographic adjustment in the IRT framework was not as bad an idea

Conclusions • Many theoretical reasons latent trait scores (such as IRT) would be preferred to classical test theory scores (such as z scores) • Here 2 specific examples of relative validity of executive functioning composites • Theoretically and practically better approach to demographic heterogeneity

Composite scores

Composite scores

Presentation Transcript

Composite scores based on Media use: special attention to Quality of Composite Scores and External Validation

The Child and Family Services Review Composite Scores: A “Great Start”

Standard Scores

Standardized Scores (Z-Scores)

Carotenoid Levels and Correlating Health Composite Scores in USU Wellness Expo Participants

ACT Results 5 year data Composite Scores

Composite

Average Quality Scores Scores Overview

Z scores

Composite Scores of Asthma Control

SNP Scores

CFSR2 National Standard/Composite Scores

Figure 5 b) Person semantic composite z-scores in the MCI patients.

Credit Scores

scores

Reliability, the Properties of Random Errors, and Composite Scores

Z-SCORES (STANDARD SCORES)

Composite

Reliability, the Properties of Random Errors, and Composite Scores

Composite Functions Composite Functions

Composite