180 likes | 392 Views
Classroom Assessment Reliability. Reliability = Assessment Consistency.Consistency within tests across examinees.Consistency within tests over multiple administrations to the same examinees.Consistency across alternative forms of the same tests for same examinees.. Three Types of Reliability. Sta
E N D
1. Measurement: Reliability
2. Classroom Assessment Reliability Reliability = Assessment Consistency.
Consistency within tests across examinees.
Consistency within tests over multiple administrations to the same examinees.
Consistency across alternative forms of the same tests for same examinees.
3. Three Types of Reliability Stability reliability.
Alternate form reliability.
Internal consistency reliability. “Reliability refer to the degree to which test scores are free from errors of measurement” Standards for Educational and Psychological Testing.
Some synonyms include:
Dependability
Stability
Consistency (main connotation)
THREE types of reliability
Test-Retest Reliability (Stability):
Consistency of results over different testings
Assumption: No significant (I.e., performance enhancing) events occur between tests.
Typically requires computing a correlation.
Alternate-Form Reliability: Consistency of results over different forms
Requires two or more equivalent forms of the test.
Equivalent forms are most typically found in high stakes testing situations (e.g., EOG and EOC tests).
Also requires a correlation coefficient.
Internal Consistency Reliability: Consistency of results over items within the test
Concerned with inter-item consistency (homogeneity).
Or, with unidimensionality.
Requires only one administration of a test.
For dichotomous items use Kuder-Richardson formulas
For polytomous items use Cronbach’s Coefficient Alpha.
The three types of reliability are truly different from each other“Reliability refer to the degree to which test scores are free from errors of measurement” Standards for Educational and Psychological Testing.
Some synonyms include:
Dependability
Stability
Consistency (main connotation)
THREE types of reliability
Test-Retest Reliability (Stability):
Consistency of results over different testings
Assumption: No significant (I.e., performance enhancing) events occur between tests.
Typically requires computing a correlation.
Alternate-Form Reliability: Consistency of results over different forms
Requires two or more equivalent forms of the test.
Equivalent forms are most typically found in high stakes testing situations (e.g., EOG and EOC tests).
Also requires a correlation coefficient.
Internal Consistency Reliability: Consistency of results over items within the test
Concerned with inter-item consistency (homogeneity).
Or, with unidimensionality.
Requires only one administration of a test.
For dichotomous items use Kuder-Richardson formulas
For polytomous items use Cronbach’s Coefficient Alpha.
The three types of reliability are truly different from each other
4. Stability Reliability Stability Reliability
Concerned with the question:
Are assessment results consistent over time (over occasions).
Think of some examples where stability reliability might be important.
Why might test results NOT be consistent over time? Reliability coefficients are given as correlations.
A correlation ranges from 0 to 1
Reliability is defined as the correlation between two simultaneous administrations of a test to the same group of respondents.
A theoretical notion
Reliability coefficients close to 0 imply weak reliability.
Reliability coefficients close to 1 imply strong reliability.
Classroom (teacher-made) tests commonly have reliabilities in the 50’s or lower.
Commercial ...
achievement tests have reliabilities in the 90’s (or at least high 80’s).
Aptitude tests have reliabilities in the 70’s and 80’s
Incidentally, achievement tests and aptitude tests are called cognitive tests.
In general tests (especially cognitive tests) are less reliable for younger childrenReliability coefficients are given as correlations.
A correlation ranges from 0 to 1
Reliability is defined as the correlation between two simultaneous administrations of a test to the same group of respondents.
A theoretical notion
Reliability coefficients close to 0 imply weak reliability.
Reliability coefficients close to 1 imply strong reliability.
Classroom (teacher-made) tests commonly have reliabilities in the 50’s or lower.
Commercial ...
achievement tests have reliabilities in the 90’s (or at least high 80’s).
Aptitude tests have reliabilities in the 70’s and 80’s
Incidentally, achievement tests and aptitude tests are called cognitive tests.
In general tests (especially cognitive tests) are less reliable for younger children
5. Evaluating Stability Reliability Test-Retest Reliability.
Compute the correlation between a first and later administration of same test.
Classification-consistency.
Compute the percentage of consistent student classifications over time.
Main concern is with the stability of the assessment over time.
6. Example of Classification Consistency
7. Example of Classification Consistency (Good Reliability)
8. Example of Classification Consistency (Poor Reliability)
9. Alternate-form Reliability Are two, supposedly equivalent, forms of an assessment in fact actually equivalent?
The two forms do not have to yield identical scores.
The correlation between two or more forms of the assessment should be reasonably substantial.
10. Evaluating Alternate-form Reliability Administer two forms of the assessment to the same individuals and correlate the results.
Determine the extent to which the same students are classified the same way by the two forms.
Alternate-form reliability is established by evidence, not by proclamation.
11. Example of Using a Classification Table to Assess Alternate-Form Reliability
12. Example of Using a Classification Table to Assess Alternate-Form Reliability
13. Internal Consistency Reliability Concerned with the extent to which the items (or components) of an assessment function consistently.
To what extent do the items in an assessment measure a single attribute?
For example, consider a math problem-solving test. To what extent does reading comprehension play a role? What is being measured?
14. Evaluating Internal Consistency Reliability Split-Half Correlations.
Kuder-Richardson Formua (KR20).
Used with binary-scored (dichotomous) items.
Average of all possible split-half correlations.
Cronbach’s Coefficient Alpha.
Similar to KR20, except used with non-binary scored (polytomous) items (e.g., items that measure attitude.
15. ReliabilityComponents of an Observation
O = T + E
Observation = True Status + Measurement Error. Reliability is a function of the E(rror) component.
The T(rue) score is assumed fixed (stable, invariant).
Hence the greater the variability in O(bservation), the greater the variability in E.
The more variability in E, the more unreliable the test.Reliability is a function of the E(rror) component.
The T(rue) score is assumed fixed (stable, invariant).
Hence the greater the variability in O(bservation), the greater the variability in E.
The more variability in E, the more unreliable the test.
16. Standard Error of Measurement Provides an index of the reliability of an individual’s score.
The standard deviation of the theoretical distribution of errors (i.e. the E’s).
The more reliable a test, the smaller the SEM.
The SEM is smallest near the average score on a test. SEM gives the margin of error in a test score.
Similar to the margin of error in a survey response.
Typically we would multiply the SEM by 1.96 to get a 95% confidence band around the true score.
If SEM = 2.5,
1.96 x 2.5 = 4.9
Given some score, X, for an individual, we would state, with a
95% level of confidence, that the individual’s true score lies
somewhere between X-4.9 and X+4.9.
If X = 79 then 74.1 <= T <= 83.9.
Generally, the larger the reliability of a test, the smaller the SEM
SEM gives the margin of error in a test score.
Similar to the margin of error in a survey response.
Typically we would multiply the SEM by 1.96 to get a 95% confidence band around the true score.
If SEM = 2.5,
1.96 x 2.5 = 4.9
Given some score, X, for an individual, we would state, with a
95% level of confidence, that the individual’s true score lies
somewhere between X-4.9 and X+4.9.
If X = 79 then 74.1 <= T <= 83.9.
Generally, the larger the reliability of a test, the smaller the SEM
17. Things to Do toImprove Reliability Use more items or tasks.
Use items or tasks that differentiate among students.
Use items or tasks that measure within a single content domain.
Keep scoring objective.
Eliminate (or reduce) extraneous influences
Use shorter assessments more frequently.
18.
End