630 likes | 1.29k Views
Chapter 3 Reliability and Objectivity. Chapter 3 Outline. Selecting a Criterion Score Types of Reliability Reliability Theory Estimating Reliability – Intraclass R Spearman-Brown Prophecy Formula Standard Error of Measurement Objectivity Reliability of Criterion-referenced Tests
E N D
Chapter 3 Outline • Selecting a Criterion Score • Types of Reliability • Reliability Theory • Estimating Reliability – Intraclass R • Spearman-Brown Prophecy Formula • Standard Error of Measurement • Objectivity • Reliability of Criterion-referenced Tests • Reliability of Difference Scores
Objectivity • Interrater Reliability • Agreement of competent judges about the value of a measure.
Reliability • Dependability of scores • Consistency • Degree to which a test is free from measurement error.
Selecting a Criterion Score • Criterion score – the measure used to indicate a person’s ability. • Can be based on the mean score of the best score. • Mean Score – average of all trials. • Usually a more reliable estimate of a person’s true ability. • Best Score – optimal score a person achieves on any one trial. • May be used when criterion score is to be used as an indicator of maximum possible performance.
Potential Methods to Select a Criterion Score • Mean of all trials. • Best score of all trials. • Mean of selected trials based on trials on which group scored best. • Mean of selected trials based on trials on which individual scored best (i.e., omit outliers). Appropriate method to use depends on the situation.
Norm-referenced Test • Designed to reflect individual differences.
In Norm-referenced Framework • Reliability - ability to detect reliable differences between subjects.
Types of Reliability • Stability • Internal Consistency
Stability (Test-retest) Reliability • Each subject is measured with same instrument on two or more different days. • Scores are then correlated. • An intraclass correlation should be used.
Internal Consistency Reliability • Consistent rate of scoring throughout a test or from trial to trial. • All trials are administered in a single day. • Trial scores are then correlated. • An intraclass correlation should be used.
Sources of Measurement Error • Lack of agreement among raters (i.e., objectivity). • Lack of consistent performance by person. • Failure of instrument to measure consistently. • Failure of tester to follow standardized procedures.
Reliability Theory X = T + E Observed score = True score + Error 2X = 2t + 2e Observed score variance = True score variance + Error variance Reliability = 2t ÷ 2X Reliability = (2X - 2e) ÷ 2X
Reliability depends on: • Decreasing measurement error • Detecting individual differences among people • ability to discriminate among different ability levels
Reliability • Ranges from 0 to 1.00 • When R = 0, there is no reliability. • When R = 2, there is maximum reliability.
Reliability from Intraclass R • ANOVA is used to partition the variance of a set of scores. • Parts of the variance are used to calculate the intraclass R.
Estimating Reliability • Intraclass correlation from one-way ANOVA: • R = (MSA – MSW) MSA • MSA = Mean square among subjects (also called between subjects) • MSw = Mean square within subjects • Mean square = variance estimate • This represents reliability of the mean test score for each person.
Estimating Reliability • Intraclass correlation from two-way ANOVA: • R = (MSA – MSR) MSA • MSA = Mean square among subjects (also called between subjects) • MSR = Mean square residual • Mean square = variance estimate • Used when trial to trial variance is not considered measurement error (e.g., Likert type scale).
What is acceptable reliability? • Depends on: • age • gender • experience of people tested • size of reliability coefficients others have obtained • number of days or trials • stability vs. internal consistency coefficient
What is acceptable reliability? • Most physical measures are stable from day- to-day. • Expect test-retest Rxx between .80 and .95. • Expect lower Rxx for tests with an accuracy component (e.g., .70). • For written test, want RXX > .70. • For psychological instruments, want RXX > .70. • Critical issue: time interval between 2 test sessions for stability reliability estimates. 1 to 3 days apart for physical measures is usually appropriate.
Factors Affecting Reliability • Type of test. • Maximum effort test expect Rxx .80 • Accuracy type test expect Rxx .70 • Psychological inventories expect Rxx .70 • Range of ability. • Rxx higher for heterogeneous groups than for homogeneous groups. • Test length. • Longer test, higher Rxx
Factors Affecting Reliability • Scoring accuracy. • Person administering test must be competent. • Test difficulty. • Test must discriminate among ability levels. • Test environment, organization, and instructions. • favorable to good performance, motivated to do well, ready to be tested, know what to expect.
Factors Affecting Reliability • Fatigue • decreases Rxx • Practice trials • increase Rxx
Coefficient Alpha • AKA Cronbach’s alpha • Most widely used with attitude instruments • Same as two-way intraclass R through ANOVA • An estimate of Rxx of a criterion score that is the sum of trial scores in one day
Coefficient Alpha Ralpha = [K / (K-1)] x [(S2x - S2trials) / S2x] • K = # of trials or items • S2x = variance for criterion score (sum of all trials) • S2trials = sum of variances for all trials
Kuder-Richardson (KR) • Estimate of internal consistency reliability by determining how all items on a test relate to the total test. • KR formulas 20 and 21 are typically used to estimate Rxx of knowledge tests. • Used with dichotomous items (scored as right or wrong). • KR20 = coefficient alpha
KR20 • KR20 = [K / (K-1)] x [(S2x - pq) / S2x] • K = # of trials or items • S2x = variance of scores • p = percentage answering item right • q = percentage answering item wrong • pq = sum of pq products for all k items
KR20 Example Item p q 1 .50 .50 2 .25 .75 3 .80 .20 4 .90 .10 If Mean = 2.45 and SD = 1.2, what is KR20? pq .25 .1875 .16 .09 pq = 0.6875 KR20 = (4/3) x (1.44 – 0.6875)/1.44 KR20 = .70
KR21 • If assume all test items are equally difficult, KR20 can be simplified to KR21 KR21 =[(K x S2)-(Mean x (K - Mean)] ÷ [(K-1) x S2] • K = # of trials or items • S2 = variance of test • Mean = mean of test
Equivalence Reliability (Parallel Forms) • Two equivalent forms of a test are administered to same subjects. • Scores on the two forms are then correlated.
Spearman-Brown Prophecy formula • Used to estimate rxx of a test that is changed in length. • rkk = (k x r11) ÷ [1 + (k - 1)(r11)] • k = number of times test is changed in length. • k = (# trials want) ÷ (# trials have) • r11 = reliability of test you’re starting with • Spearman-Brown formula will give an estimate of maximum reliability that can be expected (upper bound estimate).
Standard Error of Measurement (SEM) • Degree you expect test score to vary due to measurement error. • Standard deviation of a test score. • SEM = Sx1 - Rxx • Sx = standard deviation of group • Rxx = reliability coefficient • Small SEM indicates high reliability
SEM • example: written test: Sx = 5 Rxx = .88 • SEM = 5 1 - .88 = 1.73 • Confidence Interval: 68% X ± 1.00 (SEM) 95% X ± 1.96 (SEM) • If X =23 23 + 1.73 = 24.73 23 - 1.73 = 21.27 • 68% confident true score is between 21.27 and 24.73
Objectivity (Rater Reliability) • Degree of agreement between raters. • Depends on: • clarity of scoring system. • degree to which judge can assign scores accurately. • If test is highly objective, objectivity is obvious and rarely calculated. • As subjectivity increases, test developer should report estimate of objectivity.
Two Types of Objectivity: • Intrajudge objectivity • consistency in scoring when test user scores same test two or more times. • Interjudge objectivity • consistency between two or more independent judgments of same performance. • Calculate objectivity like reliability, but substitute judges scores for trials.
Criterion-referenced Test • A test used to classify a person as proficient or nonproficient (pass or fail).
In Criterion-referenced Framework: • Reliability - defined as consistency of classification.
Reliability of Criterion-referenced Test Scores • To estimate reliability, a double-classification or contingency table is formed.
Contingency Table(Double-classification Table) Day 2 Pass Fail Pass A B Day 1 Fail C D
Proportion of Agreement (Pa) • Most popular way to estimate Rxx of CRT. • Pa = (A + D) ÷ (A + B + C + D) • Pa does not take into account that some consistent classifications could happen by chance.
Example for calculating Pa Day 2 Pass Fail Pass 45 12 Day 1 Fail 8 35
Day 2 Pass Fail Pass 45 12 Day 1 Fail 8 35 Pa = (A + D) ÷ (A + B + C + D) Pa = (45 + 35) ÷ (45 + 12 + 8 + 35) Pa = 80 ÷ 100 = .80
Kappa Coefficient (K) • Estimate of CRT Rxx with correction for chance agreements. K = (Pa - Pc) ÷ (1 - Pc) • Pa = Proportion of Agreement • Pc = Proportion of Agreement expected by chance Pc = [(A+B)(A+C)+(C+D)(B+D)]÷(A+B+C+D)2
Example for calculating K Day 2 Pass Fail Pass 45 12 Day 1 Fail 8 35
Day 2 Pass Fail Pass 45 12 Day 1 Fail 8 35 • K = (Pa - Pc) ÷ (1 - Pc) • Pa = .80
Day 2 Pass Fail Pass 45 12 Day 1 Fail 8 35 Pc = [(A+B)(A+C)+(C+D)(B+D)]÷(A+B+C+D)2 Pc = [(45+12)(45+8)+(8+35)(12+35)]÷(100)2 Pc = [(57)(53)+(43)(47)]÷(10,000) = 5,042÷10,000 Pc = .5042
Kappa (K) • K = (Pa - Pc) ÷ (1 - Pc) • K = (.80 - .5042) ÷ (1 - .5042) • K = .597
Modified Kappa (Kq) • Kq may be more appropriate than K when proportion of people passing a criterion-referenced test is not predetermined. • Most situations in exercise science do not predetermine the number of people who will pass.