LG675 Session 5 : Reliability II

LG675Session 5: Reliability II Sophia Skoufaki sskouf@essex.ac.uk 15/2/2012

What is item analysis? • How can we conduct item analysis for a) norm-referenced data-collection instruments? • Only statistical analyses provided through SPSS b) criterion-referenced data-collection measures? • How can we examine the reliability of criterion-referenced data-collection instruments? • Work with some typical scenarios

Item analysis: definition • The kind of reliability analysis used to identify items in a data-collection instrument (e.g., questions in a questionnaire, tasks/questions in a language test) which do not measure the same thing as the other items. • It is conducted on data from the pilot study. The aim is to improve our data-collection instrument by removing any irrelevant items.

NB: This item analysis is different from item analysis (also called ‘analysis by items’) which is part of data analysis in experiments. This analysis is done to ensure that the findings of an experiment are generalisable not only to people with similar characteristics to those who participated in the experiment but also to items similar to those in the experiment (Clark 1973). • If you plan to conduct an experiment, see Phil’s discussion of this term and SPSS how-to: http://privatewww.essex.ac.uk/~scholp/statsquibs.htm#item

Reminder: Classification of data-collection instruments according to the basis of grading 5

Item analysis for norm-referenced measures • According to the traditional approach to item analysis, items are examined in terms of: • Item facility: It is a measure of how easy an item is. High facility means easy item. • An easy way to assess it is by looking at the percentage of people who answer each item correctly. The data-collection instrumentas a whole should have facility of 0.5 and most items should have around such a level of facility.

Understanding item facility • This is an activity from http://www.caacentre.ac.uk/dldocs/BP2final.pdf • Input the file ‘three_tests_IF.sav’ into SPSS. • This file shows the item facility for each question in three tests. • Examine the item facilities in each test and try to spot problematic item facilities. • Which test seems to be the best in that it contains items which will be able to distinguish among students of various proficiency levels?

Item analysis for norm-referenced measures (cont.) • Item discrimination: It is a measure of how different performance on an item in comparison to performance on the other items. • It can be assessed via a correlation between the item’s score and the score of the whole measure. • It can also be assessed via Cronbach’s a if item deleted.

SPSS: Item analysis for norm-referenced measures Do the activity described in the box on pages 26-27 from Phil’s ‘Simple statistical approaches to reliability and item analysis’ handout. Then do the activity described in the box on pages 29-30. Calculate also item facility as a percentage of correct answers.

Item analysis for criterion-referenced measures (Brown 2003) Difference Index: Item facility in the post-test – item facility in the pre-test B-Index: Item facility for students who passed the test – item facility for those who failed it

SPSS: Item analysis for criterion-referenced measures • This is an activity from Brown (2003). He used excel to calculate DI and B-I on two data sets. • Download this article as a pdf file from http://jalt.org/test/bro_18.htm • Input the data from page 20 in SPSS. • Calculate DI via Transform…Compute.

Reliability of criterion-referenced measures • There are two basic approaches: • Threshold loss agreement This approach examines the proportion of people who consistently did better than the cut-off point (‘masters’) and the proportion of those who consistently did worse (‘non- masters’). It uses a test-retest method. Example statistic: Cohen’s Kappa (AKA ‘kappa coefficient’)

The structureof Cohen’s kappa table in this scenario (figure from Brown and Hudson 2002: 171)

Reliability of criterion-referenced measures (cont.) Squared error loss agreement These statistical tests are like the previous ones but they also assess how consistent the degree of mastery/non-mastery is. Example: phi(lamda) dependability index (Not available in SPSS, see Brown 2005: 206- 207)

SPSS: Assessing reliability of a criterion-referenced measure through Cohen’s Kappa Go to page 172 at http://books.google.co.uk/books?id=brDfGghl3qIC&pg=PA169&source=gbs_toc_r&cad=3#v=onepage&q&f=false. Input the data in SPSS. Conduct the Kappa test.

Next week Statistics for validity assessment ANOVA with one independent variable

References • Brown, J.D. 2003. Criterion-referenced item analysis (The difference index and B-index). Shiken: JALT Testing & Evaluation SIG Newsletter 7 (3) , 18-24. • Brown, J.D. 2005. Testing in language programs: a comprehensive guide to English language assessment. New York: McGraw Hill. • Clark, H.H. 1973. The language-as-fixed-effect fallacy. Journal of Verbal Learning and Verbal Behavior12, 335-359. • Scholfield, P. 2011. Simple statistical approaches to reliability and item analysis. LG675 Handout. University of Essex.

Suggested readings On the statistics used for item analysis • Brown, J.D. 2003. Criterion-referenced item analysis (The difference index and B-index). Shiken: JALT Testing & Evaluation SIG Newsletter 7 (3) , 18-24. • Scholfield, P. 2011. Simple statistical approaches to reliability and item analysis. LG675 Handout. University of Essex. (pp. 24-33) On the statistics used to assess the reliability of criterion-referenced measures • Brown, J.D. 2005. Testing in language programs: a comprehensive guide to English language assessment. New York: McGraw Hill. (chapter 9) • Brown, J.D. and Hudson, T. 2002. Criterion-referenced language testing. Cambridge: Cambridge University Press. (chapter 5)

LG675 Session 5 : Reliability II

LG675 Session 5 : Reliability II

Presentation Transcript

Workshop Session II

Session II

LG675 Session 4: Reliability I

Session II

Session II

Session 5

Session II

Host Session II

Session II

Session II

Session 5

Session II

5. Assess Reliability

Introduction Session II

II. Grand Session

PEP-II Reliability and Uptime

HCA Session II

Introduction Session II

Session II

Session II