10 likes | 239 Views
Evaluating and Restructuring Science Assessments: An Example Measuring Student’s Conceptual Understanding of Heat. All authors contributed equally to this manuscript. Please address all inquiries to Kelly D. Bradley 131 Taylor Education Building Lexington, KY 40506.
E N D
Evaluating and Restructuring Science Assessments: An Example Measuring Student’s Conceptual Understanding of Heat All authors contributed equally to this manuscript. Please address all inquiries to Kelly D. Bradley 131 Taylor Education Building Lexington, KY 40506 Kelly D. Bradley, Jessica D. Cunningham, & Shannon O. Sampson Newton’s Universe is supported by the National Science Foundation under Grant No. 0437768. For more information see, http://www.as.uky.edu/NewtonsUniverse/. UNIVERSITY OF KENTUCKY Department of Educational Policy Studies & Evaluation • Conclusion • Following the reconstruction process, the committee was asked to develop a new theoretical hierarchy of item difficulty based on the pilot results and any revisions made.Using the baseline assessment given in September 2006, the theoretical and empirical hierarchy of items will be compared again. • A strength of this study is the partnership of science educators with researchers in educational measurement to construct a quality assessment. • This study provides a model for assessing knowledge transferred to students through teacher training. • Findings will support other researchers’ attempts to link student performance outcomes to teacher training, classroom teachers’ construction of their own assessments and the continued growth of collaborative efforts between the measurement and science education communities. • Method • Response Frame • The target population was middle school science students in the rural Appalachian regions of Kentucky and Virginia. • Instrumentation • A student assessment was constructed by the Newton’s Universe research team to measure students’ conceptual understanding of heat. • The pilot assessment contained forty-one, multiple-choice items. • Data Collection • Student assessment piloted with a group of middle school students participating in a science camp during the summer 2006. • Data Analysis • The dichotomous Rasch model was applied to the data. • ZSTD fit statistics acceptable between -2 and 2, which indicates the fit statistics are within two standard deviations from the mean of zero (Wright & Masters, 1982). • Items with negative point measure correlations flagged for review. • Spread of items and students along the continuum examined for gaps. • Background • Although many measurement and testing textbooks present classical test theory as the only way to determine the quality of an assessment (Embretson & Hershberger, 1999), Item Response Theory offers a sound alternative to the classical test theory approach. • Reliability and various aspects of validity can be examined when applying the Rasch model (Smith, 2004). • To examine reliability, Rasch measurement places person ability and item difficulty along a linear scale. Rasch measurement produces a standard error (SE) for each person and item, specifying the range within which each person’s ‘true’ ability and each item’s ‘true’ difficulty fall. • Rasch fit statistics, which are “derived from a comparison of expected response patterns and the observed patterns” (Smith, 2004, p. 103), can be examined to assess the content validity of the assessment. • Bradley and Sampson (2006) applied a one-parameter Item Response Theory model, commonly known as the Rasch model, to investigate the quality of a middle school science teacher assessment and advised appropriate improvements to the instrument in an effort to ensure consistent and meaningful measures. • Discussion • The first item on the pilot student assessment was relocated to the fourth item in an effort to place an easier item first on the student assessment. • The item flagged for a high outfit ZSTD statistic was reworded because test developers felt students were overanalyzing the question. • The item with the negative point measure correlation (item 13) was deleted because the committee thought the item in general was confusing. • Item 19 was revised to replace item 18 from the student assessment since it tested the same concept. • Item 23 was removed from the student assessment because the course does not adequately cover the concept tested. • A more difficult foundations item was added to increase the span of foundation items along the ability continuum. • To fill one potential gap in the item spread, item 24 was changed to make the question clearer and in turn, less difficult. • The answer choices of temperature points were changed to increase the difficulty of the items 12 and 36. • For items 3 and 5, the answer options were revised because empirically they were not functioning as expected as distracters. • Items 4 and 40 were determined to be confusing for many higher ability students so adjustments were made. References Bond, T., & Fox, C. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum Associates. Bradley, K. D., & Sampson, S. O. (2006). Utilizing the Rasch model in the construction of science assessments: The process of measurement, feedback, reflection and change. In X. Liu & W. Boone (Eds.), Applications of Rasch measurement in science education (pp. 23-44). Maple Grove, MN: JAM Press. Embretson, S., & Hershberger, S. (1999). The new rules of measurement. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Hopkins, K. D. (1998). Educational and psychological measurement and evaluation (8th ed.) Needham Heights, MA: Allyn & Bacon. Linacre, J. (1999). A user’s guide to Facets Rasch measurement computer program. Chicago, IL: MESA Press. Linacre, J. M. (2005). WINSTEPS Rasch measurement computer program. Chicago: Winsteps.com. Smith, E. (2004). Evidence for the reliability of measures and validity of measure interpretation: A Rasch measurement perspective. In E. Smith & R. Smith (Eds.), Introduction to Rasch measurement (pp. 93-122). Maple Grove: JAM Press. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago, IL: MESA Press. Wright, B.D., & Stone, M.H. (2004). Making measures. Chicago, IL: The Phaneron. • Objectives • Apply dichotomous Rasch model to evaluate quality of assessment to measure student conceptual understanding of heat • Determine fit of data to the Rasch model • Restructure the assessment based on results coupled with theory • Results • Person separation and reliability were 2.31 and 0.84 respectively. Item separation and reliability were 1.56 and 0.71. • Item 13 resulted in a negative point measure correlation. • The first item was empirically estimated as more difficult than the theoretical item hierarchy. • Potential gaps existed between item 28 and items 9 and 12 as well as between item 11 and items 17, 18, 21, 23, 24, 39, and 8. • Four energy transfer items (18, 21, 23, 24) were located at the same difficulty level. • Unexpected functioning of distracters occurred for items 4, 13, 14, 30, 32, 38, and 40. • Items containing distracters not being used included 2, 3, 6, 12, 29, 31, 35, 36, 37, and 39. A special thanks to Newton’s Universe committee members integral in assessment development: Kimberly Lott, Rebecca McNall, Jeffrey L. Osborn, Sally Shafer, and Joseph Straley. Submit requests to: kdbrad2@uky.edu