1 / 47

Chapter 3 Reliability and Objectivity

Chapter 3 Reliability and Objectivity. Chapter 3 Outline. Types of Reliability Reliability Theory Estimating Reliability – Intraclass R Spearman-Brown Prophecy Formula Standard Error of Measurement Objectivity Reliability of Criterion-referenced Tests. Objectivity. Interrater Reliability

fell
Download Presentation

Chapter 3 Reliability and Objectivity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3Reliability and Objectivity

  2. Chapter 3 Outline • Types of Reliability • Reliability Theory • Estimating Reliability – Intraclass R • Spearman-Brown Prophecy Formula • Standard Error of Measurement • Objectivity • Reliability of Criterion-referenced Tests

  3. Objectivity • Interrater Reliability • Agreement of competent judges about the value of a measure.

  4. Reliability • Dependability of scores • Consistency • Degree to which a test is free from measurement error.

  5. Norm-referenced Test • Designed to reflect individual differences.

  6. In norm-referenced framework • Reliability -- ability to detect reliable differences between subjects.

  7. Types of Reliability • Stability • Internal Consistency

  8. Stability (Test-retest) Reliability • Each subject is measured with same instrument on two or more different days. • Scores are then correlated.

  9. Internal Consistency Reliability • Consistent rate of scoring throughout a test. • All trials are administered in a single day. • Trial scores are then correlated.

  10. Reliability Theory X = T + E Observed score = True score + Error 2X = 2t + 2e Observed score variance = True score variance + Error variance Reliability = 2t ÷ 2X Reliability = (2X - 2e) ÷ 2X

  11. Sources of Measurement Error • Lack of agreement among raters (i.e., objectivity). • Lack of consistent performance by person. • Failure of instrument to measure consistently. • Failure of tester to follow standardized procedures.

  12. Reliability depends on: • Decreasing measurement error • Detecting individual differences among people • ability to discriminate among different ability levels

  13. Reliability from Intraclass R • ANOVA is used to partition the variance of a set of scores. • Parts of the variance are used to calculate the intraclass R.

  14. Estimating Reliability • Intraclass correlation from one-way ANOVA: • R = (MSA – MSW)  MSA • MSA = Mean square among subjects (also called between subjects) • MSw = Mean square within subjects • Mean square = variance estimate • This represents reliability of the mean test score for each person.

  15. Sample SPSS One-way Reliability Analysis Mean Std Dev Cases 1. MODPU1 6.8710 4.8128 62.0 2. MODPU2 7.2581 4.8652 62.0 Analysis of Variance Source of Variation Sum of Sq. DF Mean Square F Prob. Between People 2818.4839 61 46.2047 Within People 43.0000 62 .6935 Between Measures 4.6452 1 4.6452 7.3877 .0085 Residual 38.3548 61 .6288 Total 2861.4839 123 23.2641 Grand Mean 7.0645 Intraclass Correlation Coefficient One-way random effect model: People Effect Random Single Measure Intraclass Correlation = .9704 95.00% C.I.: Lower = .9515 Upper = .9820 F = 66.6207 DF = (61,62.0) Sig. = .0000 (Test Value = .0000 ) Average Measure Intraclass Correlation = .9850 95.00% C.I.: Lower = .9752 Upper = .9909 F = 66.6207 DF = (61,62.0) Sig. = .0000 (Test Value = .0000 )

  16. Sample SPSS Two-way Reliability Analysis Mean Std Dev Cases 1. V1 5.5552 2.0231 571.0 2. V2 5.2907 2.0051 571.0 3. V3 4.7636 2.0499 571.0 4. V4 5.1384 2.0614 571.0 5. V5 4.9124 2.1241 571.0 6. V6 4.2084 2.1692 571.0 7. V7 4.3660 2.2667 571.0 Analysis of Variance Source of Variation Sum of Sq. DF Mean Square F Prob. Between People 12325.7933 570 21.6242 Within People 6109.4286 3426 1.7833 Between Measures 810.9942 6 135.1657 87.2459 .0000 Residual 5298.4343 3420 1.5492 Total 18435.2219 3996 4.6134 Grand Mean 4.8907 Two-Way Random Effect Model (Consistency Definition): People and Measure Effect Random Single Measure Intraclass Correlation = .6493* 95.00% C.I.: Lower = .6185 Upper = .6799 F = 13.9579 DF = (570,3420.0) Sig. = .0000 (Test Value = .0000 ) Average Measure Intraclass Correlation = .9284 95.00% C.I.: Lower = .9190 Upper = .9370 F = 13.9579 DF = (570,3420.0) Sig. = .0000 (Test Value = .0000 ) Alpha = .9284

  17. What is acceptable reliability? • Depends on: • age • gender • experience of people tested • size of reliability coefficients others have obtained • number of days or trials • stability vs. internal consistency coefficient

  18. What is acceptable reliability? • Most physical measures are stable from day- to-day. • Expect test-retest Rxx between .80 and .95. • Expect lower Rxx for tests with an accuracy component (e.g., .70). • For written test, want RXX > .70. • For psychological instruments, want RXX > .70. • Critical issue: time interval between 2 test sessions. 1 to 3 days apart for physical measures.

  19. Factors Affecting Reliability • Type of test. • Maximum effort test expect Rxx .80 • Accuracy type test expect Rxx .70 • Psychological inventories expect Rxx .70 • Range of ability. • Rxx higher for heterogeneous groups than for homogeneous groups. • Test length. • Longer test, higher Rxx

  20. Factors Affecting Reliability • Scoring accuracy. • Person administering test must be competent. • Test difficulty. • Test must discriminate among ability levels. • Test environment, organization, and instructions. • favorable to good performance, motivated to do well, ready to be tested, know what to expect.

  21. Factors Affecting Reliability • Fatigue • decreases Rxx • Practice trials • increase Rxx

  22. Coefficient Alpha • AKA Cronbach’s alpha • Most widely used with attitude instruments • Same as two-way intraclass R through ANOVA • An estimate of Rxx of a criterion score that is the sum of trial scores in one day

  23. Coefficient Alpha Ralpha = [K / (K-1)] x [(S2x - S2trials) / S2x] • K = # of trials or items • S2x = variance for criterion score (sum of all trials) • S2trials = sum of variances for all trials

  24. Kuder-Richardson (KR) • Estimate of internal consistency reliability by determining how all items on a test relate to the total test. • KR formulas 20 and 21 are typically used to estimate Rxx of knowledge tests. • Used with dichotomous items (scored as right or wrong). • KR20 = coefficient alpha

  25. KR20 • KR20 = [K / (K-1)] x [(S2x - pq) / S2x] • K = # of trials or items • S2x = variance of scores • p = percentage answering item right • q = percentage answering item wrong • pq = sum of pq products for all k items

  26. KR20 Example Item p q 1 .50 .50 2 .25 .75 3 .80 .20 4 .90 .10 If Mean = 2.45 and SD = 1.2, what is KR20? pq .25 .1875 .16 .09 pq = 0.6875 KR20 = (4/3) x (1.44 – 0.6875)/1.44 KR20 = .70

  27. KR21 • If assume all test items are equally difficult, KR20 can be simplified to KR21 KR21 =[(K x S2)-(Mean x (K - Mean)] ÷ [(K-1) x S2] • K = # of trials or items • S2 = variance of test • Mean = mean of test

  28. Equivalence Reliability (Parallel Forms) • Two equivalent forms of a test are administered to same subjects. • Scores on the two forms are then correlated.

  29. Spearman-Brown Prophecy formula • Used to estimate rxx of a test that is changed in length. • rkk = (k x r11) ÷ [1 + (k - 1)(r11)] • k = number of times test is changed in length. • k = (# trials want) ÷ (# trials have) • r11 = reliability of test you’re starting with • Spearman-Brown formula will give an estimate of maximum reliability that can be expected (upper bound estimate).

  30. Standard Error of Measurement (SEM) • Degree you expect test score to vary due to measurement error. • Standard deviation of a test score. • SEM = Sx1 - Rxx • Sx = standard deviation of group • Rxx = reliability coefficient • Small SEM indicates high reliability

  31. SEM • example: written test: Sx = 5 Rxx = .88 • SEM = 5  1 - .88 = 1.73 • Confidence Interval: 68% X ± 1.00 (SEM) 95% X ± 1.96 (SEM) • If X =23 23 + 1.73 = 24.73 23 - 1.73 = 21.27 • 68% confident true score is between 21.27 and 24.73

  32. Objectivity (Rater Reliability) • Degree of agreement between raters. • Depends on: • clarity of scoring system. • degree to which judge can assign scores accurately. • If test is highly objective, objectivity is obvious and rarely calculated. • As subjectivity increases, test developer should report estimate of objectivity.

  33. Two Types of Objectivity: • Intrajudge objectivity • consistency in scoring when test user scores same test two or more times. • Interjudge objectivity • consistency between two or more independent judgments of same performance. • Calculate objectivity like reliability, but substitute judges scores for trials.

  34. Reliability of Criterion-referenced Test Scores • Person is classified as proficient or nonproficient (pass or fail). • Reliability is defined as consistency of classification. • To estimate reliability, a double-classification or contingency table is formed.

  35. Contingency Table(Double-classification Table) Day 2 Pass Fail Pass A B Day 1 Fail C D

  36. Proportion of Agreement (Pa) • Most popular way to estimate Rxx of CRT. • Pa = (A + D) ÷ (A + B + C + D) • Pa does not take into account that some consistent classifications could happen by chance.

  37. Example for calculating Pa Day 2 Pass Fail Pass 45 12 Day 1 Fail 8 35

  38. Day 2 Pass Fail Pass 45 12 Day 1 Fail 8 35 Pa = (A + D) ÷ (A + B + C + D) Pa = (45 + 35) ÷ (45 + 12 + 8 + 35) Pa = 80 ÷ 100 = .80

  39. Kappa Coefficient (K) • Estimate of CRT Rxx with correction for chance agreements. K = (Pa - Pc) ÷ (1 - Pc) • Pa = Proportion of Agreement • Pc = Proportion of Agreement expected by chance Pc = [(A+B)(A+C)+(C+D)(B+D)]÷(A+B+C+D)2

  40. Example for calculating K Day 2 Pass Fail Pass 45 12 Day 1 Fail 8 35

  41. Day 2 Pass Fail Pass 45 12 Day 1 Fail 8 35 • K = (Pa - Pc) ÷ (1 - Pc) • Pa = .80

  42. Day 2 Pass Fail Pass 45 12 Day 1 Fail 8 35 Pc = [(A+B)(A+C)+(C+D)(B+D)]÷(A+B+C+D)2 Pc = [(45+12)(45+8)+(8+35)(12+35)]÷(100)2 Pc = [(57)(53)+(43)(47)]÷(10,000) = 5,042÷10,000 Pc = .5042

  43. Kappa (K) • K = (Pa - Pc) ÷ (1 - Pc) • K = (.80 - .5042) ÷ (1 - .5042) • K = .597

  44. Modified Kappa (Kq) • Kq may be more appropriate than K when proportion of people passing a criterion-referenced test is not predetermined. • Most situations in exercise science do not predetermine the number of people who will pass.

  45. Modified Kappa • Interpreted same as K. • When proportion of masters = .50, Kq = K. • Otherwise, Kq > K.

  46. Interpretation of Rxx for CRT • Pa (Proportion of Agreement) • Affected by chance classifications • Pa < .50 are unacceptable • Pa should be > .80 in most situations. • K and Kq (Kappa and Modified Kappa) • Interpretable range: 0.0 to 1.0 • Minimum acceptable value = .60

  47. When reporting results: • Report both indices of Rxx.

More Related