220 likes | 334 Views
Issues of Reliability, Validity and Item Analysis in Classroom Assessment by Professor Stafford A. Griffith. Jamaica Teachers Association Education Conference Assessment in Education Ritz Carlton Resort & Spa, Montego Bay April 2-4, 2013. Concept of a Test.
E N D
Issues of Reliability, Validity and Item Analysis in Classroom AssessmentbyProfessor Stafford A. Griffith Jamaica Teachers Association Education ConferenceAssessment in EducationRitz Carlton Resort & Spa, Montego BayApril 2-4, 2013
Concept of a Test • Some of the earliest forms of assessment or testing may be noted in biblical references. Adam and Eve, for example, were subjected to a simple test in the Garden of Eden based on a test item presented in a negative form. • Another account, is taken from Judges 12: 4 - 6. It was an oral examination (shibolleth) devised by the Gileadite army to identify members of the defeated Ephraimite army who were attempting to escape under cover of a false identity. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
Outside of the biblical accounts, historians generally agree that the Chinese were the first to use large scale testing • These were introduced as early as 2000 B.C. to measure the proficiency of candidates for public office and to reduce patronage • Today, we think of a test as an item/question, problem or task or a mix of these, administered under prescribed conditions Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
It is designed to elicit responses that provide information to make judgements about a candidate. • It is a systematic procedure for measuring a sample of a candidate’s behaviour that can give an accurate and truthful account of a candidate’s skills, knowledge or ability, or other characteristics, at the time the test was administered. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
Reliability of Test Scores • Two essential requirements for a technically sound test are reliability and validity. • Reliability is the extent to which test scores are consistent or dependable. • Only to the extent that scores are reliable can they be useful in conveying information about a student’s performance. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
From a more technical standpoint, reliability is the extent to which scores are free from errors of measurement. • Classical Test Theory (CTT) defines reliability as a property that is based on three considerations: • observed scores, • true scores and • measurement errors. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
In Classical Test Theory, a person’s observed score is a function of that person’s true score, plus error. • This may be represented simply as: Xo = Xt + Xe Where Xo represents the observed score; Xt represents the true score; and Xerepresents the error. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
The level of confidence we can have in test scores hinges on how much error we have in the observed scores of students. • Reliability, or level of confidence we can have in test scores, is expressed as a index ranging from 0 to 1. It may therefore be .99 (high) or .10 (low). Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
The reliability coefficients commonly used to determine and report on the consistency with which a test measures are derived from various approaches: • test-retest, • alternative form, • internal consistency, • split-half and • inter-rater (a special form of reliability). Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
Validity of Test Scores • Validity is the extent to which a test does the job for which it is intended. • Essentially, validity is about what inference can be made from the scores obtained on an instrument. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
The most widely encountered discussions refer to three lines of validity evidence: • content validity (representativeness of the domain); • criterion-related validity (correlation with/prediction of scores from another instrument); • construct validity (association with some theoretical construct). Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
Validity is the most important technical quality of a test. • An important way of assuring, or assessing validity is to use a subject matter by behaviour grid called a specifications table or a table of specifications. • It helps to define the weighting to be given to various subject matter and behaviours (or objectives or skills). • It helps to avoid the testing of extraneous material. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
Example of a Table of Specifications Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
It is important to work out the types of items/questions, their psychometric characteristics, the number of items and questions and how these will be scored. • The specifications for test construction should be so clear that two test constructors would produce tests that are comparable and interchangeable. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
Item Analysis • In writing and analysing test tasks, two critical indicators of goodness of the tasks should be considered: • the facility (or difficulty) and • the discrimination. • The facility level for a task is the percentage of candidates responding correctly or satisfactory to it. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
It is expressed as an index: • an f-value, or • a p-value (which is really the probability of a person in a particular group responding correctly or satisfactorily). • The formula for calculating p is very simple: p = R/T, that is, the number of students responding correctly to an item divided by the number of students responding to the item. • Its value ranges from 0 to 1.00. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
The discrimination level for a task is the extent to which performance on the task separates the better candidates from the poorer ones. • The calculation of this d-index is generally more complex than the calculation of the facility index and is often represented by a biserial or a point-biserial correlation index (r). • It ranges from -1.00 to +1.00. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
Easier and relatively accurate estimates of the extent of discrimination of a task scored dichotomously are, however, obtained by: • comparing the way the top performing students perform on the task with • the way the bottom performing students perform on that task. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
The discrimination index for an item is calculated by: • ranking students according to performance on the test; • separating the top performing students and the bottom performing students; • finding the p value of the item for the top performing students and the p value for the bottom performing students; • subtracting the p value for the low performing students from the p value of the high performing students Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
The table indicates how students performed on an item with four possible responses (A, B, C and D). The correct response is C. Response A B C D Upper Group - 2 8 - Lower Group 4 3 2 1 • The facility index of the item is (a) 1.00 (b) .10 (c) .05 (d) .50 • The discrimination index of the item is (a) 6 (b) .60 (c) .06 (d) .66 Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
Summary • Based on our discussions, I trust that in developing and using tests for assessment in the classroom, you will consider the need to: • provide scores that are reliable • provide scores that are valid • develop and use items/tasks that are at the right difficulty level • develop and use items/tasks that can discriminate between those who have the desired competences and those who do not. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona
Thank you. Professor Stafford A. Griffith, Director of the School of Education, UWI, Mona