Kinge Mbella Liz Burton Rob Keller Nambury Raju Psychometric Internship Measured Progress

The Accuracy of Small Sample Equating: An Investigative/ Comparative study of small sample Equating Methods. Kinge Mbella Liz Burton Rob Keller Nambury Raju Psychometric Internship Measured Progress July 24, 2009

Presentation Outline • Introduction • Background to Study • Research Hypothesis • Small Sample Equating • Identity Equating • Chained linear • Synthetic Linking Function • Chained Log linear Pre Smoothing • Circle-arc • Methodology • Research Design • Procedure • Results • Discussion and Conclusion.

Introduction • The primary motivation is from the 2007 paper by Livingston and Kim “Small Sample Equating by the Circle-arc method.” • Empirical research findings confirm that this method produces smaller random and systematic errors when equating with samples smaller than 50 per form (Darby & Mbella, NCME 2009). • Technological innovation is increasing the flexibility of test administration, and reporting. Most test have multiple forms taken by smaller samples of students at different test dates. The need to provide accurate equated scores in a timely manner is imminent. • Practical circumstances in most certification programs dictate the use of small samples.

Research Objectives • This research used empirical data to compare random and systematic errors associated with small sample equating methods. • The ultimate goal is to provide practitioners with objective and valid results to effectively examine the small sample equating dilemma. • It is my intention that these result will provide scientific and logical facts that “Yes we may be able to equate accurately with smaller samples”.

Background Into Equating • Mislevy (1992) • “Test construction and equating are inseparable, when they are applied in concert, equated scores from parallel test forms provide virtually exchangeable evidence about students’ behavior on some domain…” • Kolen and Brennan (2004, p. 269)

Form X : The test form administered to the 2007/08 examinees (New Form). Form Y: Test form administered to 2006/07 examinees (Old Form). Population: Scored responses for all students on a test form for a particular year. Small samples selected for this research are: 22, 35, 44, 70. Example SE_22_22 Experimental test form Y and X: Test forms assembled from an operational test form and response matrix. CING: The common item non equivalent group design Criterion estimate: The Equipercentile equating results of the Form X observed scores equated onto the Form Y observed scale scores for that particular grade level and subject area. Research and Equating Jargons

Linear Methods Identity Equating Chained Linear Chained Log linear Synthetic linking Function Non Linear Circle-arc Equipercentile Equating Methods

Chained Linear Function µ : Sample mean σ: Sample standard deviation yv= Anchor Old Form (Y) xv= Anchor New Form (X)

Identity Function Identity Equating function is a technical term for saying No equating is done. The equated score equal the observed score.

Synthetic Linking Function W = 0.5 The synthetic linking function is a weighted average between an equating function (in this case Chained Linear) with the Identity function.

Chained log Linear Using an adaptation of the log-linear function developed by Rosenbaum and Thayer (1987) the first two univariate moments of the observed score distribution are pre-smoothed before equating, In chained equipercentile, the linking is done through the common items. The percentile rank of a score on the common item for form X is linked to the equivalent percentile in form Y common scale. Then the corresponding form Y score at that percentile is the chained equipercentile equivalent for that particular form X observed score.

Circle-arc • Livingston and Kim in 2007 proposed an innovative method with potential to considerably reduce sampling error of equating in small samples while introducing very little systematic error. • Their rationale is based on the fact that the relationship between test forms is always curvilinear when forms differ in difficulty. • Empirical research has shown that the circle-arc method is the most accurate method in modeling the equipercentile relationship in small samples.

Circle-arc • Circle-arc is a very simplistic model. It relies entirely on the characteristics of the observed scores. • The main properties are: • The minimum and maximum possible observed scores are fixed for both test forms. • A middle point is empirically determined by carrying out any of the linear equating transformations based on the data collection method. • A combination of mathematical formulae which forces an arc of a circle to pass through these three points is used to produce the Circle-arc equating function. • .

The Circle-Arc Method

Empirical Equating Curves

Research Questions • How similar are the various small sample equating methods in terms of equating errors? • How do differences in test form difficulty affect the accuracy and consistency of the various equating methods? • What is the minimum sample size at which the standard error of equating becomes unacceptable?

Research Methodology Using real examinees’ responses on a Math and Reading Standardized test, two experimental test forms were created for each subject area and grade level. The Common Item Non Equivalent Group (CING) design with an internal anchor was used as the basis for collecting data for equating purposes.

Data Specification

Descriptive Statistics for Reading Grade 7

Procedure • Large Sample Equating • An equipercentile equating was done on the full population of Form Y and X for each subject and grade level. The unsmoothed equipercentile conversion was used as the base equating for comparison. • Small Sample Equating • Using a bootstrap sampling method without replacement, small samples were drawn from each population and concurrently equated using all 5 equating methods. The sampling and equating was repeated 250 times and the average equated score at each score point by method was used as the estimated equated score of form X on form Y observed scale.

Procedure_ Result Analysis • StandardError (SE) • (Error due to sampling variability) Conditional bias (Error due to method effect) ConditionalRMSE

Research Design Matrix

Bootstrap Mean Distribution

Standard Error Results (SE)

Preliminary Results • How similar are the various equating methods in terms of equating error? • The following conclusions have been reached based on these preliminary analyses: • On the average, the Circle-arc method appears to have the smallest random error across the entire scale. • The Synthetic linking function has the smallest random error variance for scores between -1 and 1 standard deviation around the mean. • For all methods, the general trend is that the overall random error variance tend to decrease as sample size increases.

Bias Summary for selected conditions

RMSE Summary (Reading Grade 7)

RMSE Summary (Math Grade 7)

Exploratory MANOVA

Graphical Manova Summary • Reading Grade 7 • Preliminary results suggest that the within error variance due to sample variability is not significant. • There appears to be a significant mean difference between the various equating methods in terms of the RMSE index. The mean RMSE for Circle-arc appears to be significantly different from the other methods • Math Grade 7 • The Exploratory Manova results from Math grade 7 leads to a slightly different conclusion. Both the within and between error variances appear not to be significantly different for all methods and sample conditions.

Results Summary _ Reading Grade 7

Conclusion • From this first phase of analyses, the Circle-arc method appears to produce on the average the smallest amount of systematic and random error. • However, the interpretation of which method produces the least amount of error depends on where the cut scores are set on the scale. • An important recommendation from this study is that if the cut score is set around the mean, then any of these methods will produce similar equating errors proportional to the difference in form difficulty.

Future Directions • I would like to look at the effects of differences in test form difficulty on the various methods. • I also intend to explore even smaller samples to estimate the minimum sample sizes for each method where equating becomes unrealistic. • My ultimate goal is to explore new ways to build test forms to meet predefined statistical and content characteristics in small sample situations.

Questions and Comments • I would like to thank everyone in the Psychometrics Department and Measured Progress for making the whole experience very enjoyable and the actual research as painless as possible. • Thank you • Kinge Mbella • Doctoral Student • UNC Greensboro

Kinge Mbella Liz Burton Rob Keller Nambury Raju Psychometric Internship Measured Progress

Kinge Mbella Liz Burton Rob Keller Nambury Raju Psychometric Internship Measured Progress

Presentation Transcript

Psychometric Testing of Students – a work in progress

keller

Internship Progress and Project Summary

Internship Progress (Part 2)

Psychometric assessment

Tim Burton

Psychometric testing

Mr. Burton

RAJU

Formative Assessment An Overview Stuart Kahl Measured Progress

PSYCHOMETRIC TESTING

2007 Fall Nutrition Progress: Rob Harrison

Mbella Sonne Dipoko

measured

The International wellbeing Index: A psychometric progress report

Quality in the Supply Chain Rob Burton - Intertek

Measured!

Psychometric Examination

Burton

BURTON

By vinayadhar raju

Measured!