Reliability & Validity

Reliability & Validity • More is Better • Properties of a “good measure” • Standardization • Reliability • Validity • Reliability • Inter-rater • Internal • External • Validity • Criterion-related • Face • Content • Construct

Another bit of jargon… In sampling… We are trying to represent a population of individuals We select participants The resulting sample of participants is intended to represent the population In measurement… We are trying to represent a domain of behaviors We select items The resulting scale/test of items is intended to represent the domain For both  “more is better” “more give greater representation”

Whenever we’ve considered research designs and statistical conclusions, we’ve always been concerned with “sample size” • We know that larger samples (more participants) leads to ... • more reliable estimates of mean and std, r, F & X2 • more reliable statistical conclusions • quantified as fewer Type I and II errors • The same principle applies to scale construction - “more is better” • but now it applies to the number of items comprising the scale • more (good) items leads to a better scale… • more adequately represent the content/construct domain • provide a more consistent total score (respondent can change more items before total is changed much)

Desirable Properties of Psychological Measures Interpretability of Individual and Group Scores Population Norms Validity Reliability Standardization

Reliability (Agreement or Consistency) • Inter-rater or Inter-observers reliability • do multiple observers/coders score an item the same way ? • critical whenever using subjective measures • dependent upon Standardization • Internal reliability-- do the items measure a central “thing” • Cronbach’s alpha  α = .00 – 1.00  higher is better • more correlated items & more items improve  higher α • External Reliability– stability of scale/test scores over time • test-retest reliability – correlate scores from same test given 3-18 weeks apart • alternate forms reliability – correlate scores from two “versions” of the test

Assessing  Item corrected alpha if item-total rdeleted i1 .1454 .65 i2 .2002 .58 i3 -.2133 .71 i4 .1882 .59 i5 .1332 .68 i6 .2112 .56 i7 .1221 .60 Coefficient Alpha =.58 • Correlation between each item and a total comprised of all the other items • negative item-total correlations indicate either... • very “poor” item • reverse keying problems • What the alpha would be if that item were dropped • drop items with alpha if deleted larger than alpha Tells the  for this set of items Usually do several “passes” rather that drop several items at once.

Assessing  Item corrected alpha if item-total r deleted i1 .0854 .65 i2 .2002 .58 i3 -.2133 .71 i4 .1882 .59 i5 .0832 .68 i6 .0712 .56 i7 .0621 .60 Coefficient Alpha = .58 • Pass #1 • All items with “-” item-total correlations are “bad” • check to see that they have been keyed correctly • if they have been correctly keyed -- drop them • i3 would likely be dropped

Assessing  Item corrected alpha if item-total r deleted i1 .0812 .73 i2 .2202 .68 i4 .1822 .70 i5 .0877 .74 i6 .2343 .64 i7 .0621 .78 Coefficient Alpha = .71 • Pass #2, etc • Look for items with alpha-if-deleted values that are substantially higher than the scale’s alpha value • don’t drop too many at a time • probably i7 • probably not drop i1 & i5 • recheck on next “pass” • it is better to drop 1-2 items on each of several “passes”

Validity (Consistent Accuracy) • Criterion-related Validity-- does test correlate with “criterion”? • statistical -- requires a criterion that you “believe in” • predictive, concurrent, postdictive validity • Face Validity-- do the items come from “domain of interest” ? non-statistical -- decision of “target population” • Content Validity-- do the items come from “domain of interest”? non-statistical -- decision of “expert in the field” • Construct Validity-- does test relate to other measures it should? • nonstatistical – does measure match the theory of the construct? • statistical -- Discriminant validity • convergent validity - +/-r with selected tests as should ? • divergent validity – r=0 correlate with others as should ?

“Is the test valid?” • Jum Nunnally (one of the founders of modern psychometrics) claimed this was “silly question”! The point wasn’t that tests shouldn’t be “valid” but that a test’s validity must be assessed relative to… • the construct it is intended to measure • the population for which it is intended (e.g., age, level) • the application for which it is intended (e.g., for classifying folks into categories vs. assigning them quantitative values) • So, the real question is, “Is this test a valid measure of this construct for this population in this application?” • That question can be answered!

Criterion-related Validity Do the test scores correlate with criterion behavior scores?? • concurrent-- test taken now “replaces” criterion measured now • often the goal is to substitute a “shorter” or “cheaper” test • e.g., the written drivers test replaces road test • predictive -- test taken now predicts criterion measured later • want to estimate what will happen before it does • e.g., your GRE score (taken now) predicts grad school (later) • postdictive– test taken now captures behavior & affect of before • most of the behavior we study “has already happened” • e.g., adult memories of childhood feelings or medical history When criterion behavior occurs Before Now Later concurrent postdictive predictive Test taken now

Conducting a Predictive Validity Study • example -- test designed to identify qualified “front desk personnel” for a major hotel chain -- 200 applicants - and 20 position openings • A “proper” predictive validity study… • give each applicant the test (and “seal” the results) • give each applicant a job working at a front desk • assess work performance after 6 months (the criterion) • correlate the test (predictor) and work performance (criterion) • Anybody see why the chain might not be willing to apply this design?

Substituting concurrent validity for predictive validity • assess work performance of all folks currently doing the job • give them each the test • correlate the test (predictor) and work performance (criterion) • Problems? • Not working with the population of interest (applicants) • Range restriction -- work performance and test score variability are “restricted” by this approach • current hiring practice probably not “random” • good workers “move up” -- poor ones “move out” • Range restriction will artificially lower the validity coefficient (r)

What happens to the sample ... Applicant pool -- target population • Selected (hired) folks • assuming selection basis is somewhat reasonable/functional • Sample used in concurrent validity study • worst of those hired have been “released” • best of those hired have “changed jobs”

What happens to the validity coefficient -- r Applicant pool r = .75 Hired Folks Sample used in validity study r = .20 Criterion - job performance Predictor -- interview/measure

Face Validity • Does the test “look like” a measure of the construct of interest? • “looks like” a measure of the desired construct to a member of the target population • will someone recognize the type of information they are responding to? • Possible advantage of face validity .. • If the respondent knows what information we are looking for, they can use that “context” to help interpret the questions and provide more useful, accurate answers • Possible limitation of face validity … • if the respondent knows what information we are looking for, they might try to “bend & shape” their answers to what they think we want -- “fake good” or “fake bad”

“Continuum of content expertise” Content Experts Target population members Researchers Target population members assess Face Validity Content experts assess Content Validity Researchers – “should evaluate the validity evidence provided about the scale, rather than the scale items !! – unless they are truly a content expert”

Content Validity • Does the test contain items from the desired “content domain”? • Based on assessment by “subject matter experts” (SMEs) in that content domain • Is especially important when a test is designed to have low face validity • e.g.,tests of “honesty” used for hiring decisions • Is generally simpler for “achievement tests” than for “psychological constructs” (or other “less concrete” ideas) • e.g.,it is a lot easier for “math experts” to agree whether or not an item should be on an algebra test than it is for “psychological experts” to agree whether or not an items should be on a measure of depression. • Content validity is not “tested for”. Rather it is “assured” by the informed item selections made by experts in the domain.

Construct Validity • Does the test correspond with the theory of the construct & dos it interrelate with other tests as a measure of this construct should ? • We use the term construct to remind ourselves that many of the terms we use do not have an objective, concrete reality. • Rather they are “made up” or “constructed” by us in our attempts to organize and make sense of behavior and other psychological processes • attention to construct validity reminds us that our defense of the constructs we create is really based on the “whole package” of how the measures of different constructs relate to theory and to each other • So, construct validity “begins” with content validity (are these the right types of items) and then adds the question, “does this test relate as it should to other tests of similar and different constructs?

The statistical assessment of Construct Validity … • Discriminant Validity • Does the test show the “right” pattern of interrelationships with other variables? -- has two parts • Convergent Validity-- test correlates with other measures of similar constructs • Divergent Validity -- test isn’t correlated with measures of “other, different constructs” • e.g., a new measure of depression should … • have “strong” correlations with other measures of “depression” • have negative correlations with measures of “happiness” • have “substantial” correlation with measures of “anxiety” • have “minimal” correlations with tests of “physical health”, “faking bad”, “self-evaluation”, etc.

Evaluate this measure of depression…. New Dep Dep1 Dep2 Anx Happy PhyHlth FakBad New Dep Old Dep1 .61 Old Dep2 .49 .76 Anx .43 .30 .28 Happy -.59 -.61 -.56 -.75 PhyHlth .60 .18 .22 .45 -.35 FakBad .55 .14 .26 .10 -.21 .31 Tell the elements of discriminant validity tested and the “conclusion”

Evaluate this measure of depression…. New Dep Dep1 Dep2 Anx Happy PhyHlth FakBad New Depconvergent validity (but bit lower than r(dep1, dep2) Old Dep1 .61 Old Dep2 .49 .76 more correlated with anx than dep1 or dep2 Anx .43 .30 .28 corr w/ happy about same as Dep1-2 Happy -.59 -.61 -.56 -.75 “too” r with PhyHlth PhyHlth .60 .18 .22 .45 -.35 “too” r with FakBad FakBad .55 .14 .26 .10 -.21 .31 This pattern of results does not show strong discriminant validity !!

Population Norms • In order to interpret a score from an individual or group, you must know what scores are typical for that population • Requires a large representative sample of the target population • preferably  random, research-selected & stratified • Requires solid standardization  both administrative & scoring • Requires great inter-rater reliability (if subjective items) • The Result ?? • A scoring distribution of the population. • lets us identify “normal,” “high” and “low” scores • lets us identify “cutoff scores” to define important populations and subpopulations (e.g., 70 for MMPI & 80 for WISC)

Desirable Properties of Psychological Measures Interpretability of Individual and Group Scores Population Norms Scoring Distribution & Cutoffs Validity Face, Content, Criterioin-Related, Construct Reliability Inter-rater, Internal Consistency, Test-Retest & Alternate Forms Standardization Administration & Scoring

Reliability & Validity