Measurement 102

Measurement 102 Steven Viger Lead Psychometrician Michigan Department of Education Office of Educational Assessment and Accountability

Student Performance Measurement • The previous session discussed some basic mechanics involved in psychometric analysis. • Graphical and statistical methods • The focus of this session is on the interpretations of the data in light of the often used terms reliability. • Some attention will also be paid to some of the higher level psychometrics that go on behind the scenes. • How the scale scores are REALLY made!

Making inferences from measurements • The inferences one can make based solely on educational measurement are limited. • The extent of the limitation is largely a function of whether or not evidence of the valid use of scores is accumulated. • At times, the terms validity and reliability are confused. Unfortunately, these terms describe extremely different concepts.

Some basic validity definitions • Validity • The degree to which the assessment measures the intended construct(s) • Answers the question, “are you measuring what you think you are?” • More contemporary definitions focus on the accumulation of evidence for the validity of the inferences and interpretations made from the scores produced.

Some basic reliability definitions • Reliability • Consistency • The degree to which students would be ranked ordered the same if they were to be administered the same assessment numerous times. • Actually, the assumption is based on an ‘infinite amount’ of retesting with no memory of the previous administrations…an unrealistic scenario.

More about reliability • Reliability is one of the most fundamental requirements for measurement—if the measures are not reliable, then it is difficult to support claims that the measures can be valid for any particular decision. • Reliability refers to the degree to which instrument scores for a group of participants are consistent over repeated applications of a measurement procedure and are, therefore, dependable and repeatable.

Reliability and Classical Test Theory • X = T + E • True Score (T): A theoretic score for a person on an instrument that is equal to the average score for that person over an infinitely large number of ‘retakes’. • Error (E): The degree to which an observed score (X) varies from the person’s theoretical true score (T). • In this context, reliability refers to the degree to which scores are free of measurement errors for a particular group if we assume the relationship of observed and true scores are depicted as above.

‘Unreliability’ AKA the standard error of measurement • The standard error of measurement (SEM) is an estimate of the amount of error present in a student’s score. • If X= T + E, the SEM serves as a general estimate of the ‘E’ portion of the equation. • There is an inverse relationship between the SEM and reliability. Tests with higher reliability have smaller SEMs. • Reliability coefficients are indicators that reflect the degree to which scores are free of measurement error.

More on the Standard Error of Measurement • The smaller the SEM for a test (and, therefore, the higher the reliability), the greater one can depend on the ordering of scores to represent stable differences between students. • The higher the reliability, the more likely it is that the rank ordering of students by score is due to differences in true ability rather than random error. • The higher the reliability, the more confident you can be in the observed score, X, being an accurate estimate of the student’s true score, T.

Standards for Reliability • There are no mathematical ‘rules’ to determine what constitutes an acceptable reliability coefficient. • Some advice: • Individual based decisions should be based on scores produced from highly precise instruments. • The higher the stakes, the higher you will want your reliability to be. • Group-based decisions in a research setting typically allow lower reliability. • If you are making high-stakes decisions about individuals, you need reliabilities above .80 and preferably in the .90s.

Establishing validity • Past practice has been to treat validity as if there criterion related to an amount necessary to deem an instrument as valid. • That practice is outdated and inappropriate. • Does not acknowledge that numerous pieces of information need to come together to facilitate valid inferences. • Tends to discount some pieces of evidence and over emphasize others. • Leads to a narrowing of scope and can encourage one to be limited in their approach to gathering evidence.

Process vs. Product • Rather than speak of validity as a thing, we need to start approaching it as an on-going process that is fed from all aspects of a testing program; validation. • The current AERA and APA standards for validity tend to treat the validation process similar to a civil court proceeding. • A preponderance of the evidence is sought with the evidence coming from multiple sources.

Validation from item evidence • Focus is on elimination of “construct-irrelevant variance” • Some ways this is accomplished: • Well established item development/review procedures • Demonstrate alignment of individual items to standards • Show the items/assessments are free of bias; quantitatively and qualitatively • Simple item analyses: eliminate items with questionable stats (e.g. p-values too high, low point-biserial correlation, etc.)

Validation from scaled scores • Scale score level validity evidence includes but is not limited to: • Input from item-level validity evidence (the validity of the score scale depends upon the validity of the items that contribute to that score scale) • Convergent and divergent relationships with appropriate external criteria. • Reliability evidence • Appropriate use of a ‘strong’ measurement model for the production of student scores.

Low reliability Low validity Low reliability High validity High reliability High validity High reliability Low validity Is it valid, reliable, or both?

Measurement models • The measurement models used by MDE fall under the general category of Item Response Theory (IRT) models. • IRT models depict the statistical relationship that occurs as a result of person /item interactions. • Specifically, statistical information regarding the persons and the items are used to predict the probability of correctly responding to a particular item; if the item is constructed response it is the probability of a person receiving a specific score point from the rubric. • Like all statistically based models, IRT models carry with them some assumptions; some are theoretical whereas others are numerical.

IRT assumptions • Unidimensionality: there is a single underlying construct being measured by the assessment (i.e. mathematics achievement, writing achievement, etc.) • As a result of the assumption of the single construct, the model dictates we treats all sub-components (strand level, domain, subscales in general) as contributing to the single construct • Assumes that there is a high correlation between sub-components • It would probably be better to measure the sub-components separately, but that would require significantly more assessment items to attain decent reliability

IRT assumptions • Assumes that a more able person has a higher probability of responding correctly to an item than a less able person • Specifically, when a person’s ability is greater than the item difficulty, they have a better than 50% chance of getting the item correct. • Local independence: the response to one item is independent of and does not influence your probability of responding correctly to another item. • The data fit the model! • The item and person parameter estimates are reasonable representations of reality and the data collected meets the IRT model assumptions.

The Rasch Model(MEAP and ELPA)

The Rasch Model (1 parameter logistic model) • An item characteristic curve for a sample MEAP item

The 3 Parameter Logistic Model(MME and MEAP Writing)

The 3 Parameter Logistic Model • An item characteristic curve for a sample MME item.

Before I show you what a string of items looks like using IRT I’d like to first point out some differences in the model that will lead to some major differences in the way the items look graphically. • In particular, we need to pay attention to the differences in the formulas. • Are there features of the 3PL model that do not appear in the 1PL model?

In both models, the quantity driving the solution to the equation is the difference between person ability and item difficulty; θ - b. • However, in one model, that relationship is altered and we cannot rely on the difference between ability and difficulty alone to determine the probability of a correct response to an item.

1PL vs. 3PL • In the 1 parameter model, the item difficulty parameter (assuming the student’s ability is a known and fixed quantity), and its difference from student ability drives the probability of a correct response. All other elements are constants in the equation. • Hence the name, 1 parameter model • Therefore, when you see the plots of multiple items, they should only differ by a constant in terms of their location on the scale.

1PL vs. 3PL • In the 3 parameter model, there are still constants and the difference between ability and difficulty is still the critical piece. However, a, the discrimination parameter, has a multiplicative affect on the difference between ability and difficulty. Furthermore, the minimum possible result for the equation is influenced by the ‘c’ parameter. • If c > 0.00, the probability of correct response must be greater than 0. • Item characteristic curves will vary by location on the scale as well as by origin (c parameter) and slope (a parameter). • Knowing how difficult an item is compared to another is still relevant but is not the only piece of information that leads to item differences.

MEAP example (10 items scaled using Rasch)

MME example (10 items scaled using the 3-PL model)

MME Science 010100101111000111101 110101111000101100110 110011011000011101001 011010111011001010011 How do we get there? • Although the graphics and equations on the previous screens may make conceptual sense, you may have noticed that the solution to the equations depends on knowledge of the values of some of the variables. • We are psychometricians…not psychomagicians, so the numbers come from somewhere. • The item and person parameters have to be estimated. • We need a person by item matrix to begin the process.

IRT Estimation • The person by item matrix is fed into an IRT program to produce estimates of item parameters and person parameters. • An estimation algorithm is used, which is essentially a predefined process with ‘stop and go’ rules. The end products are best estimates of the item parameters and person ability estimates. • Item parameters are the ‘guessability’, discrimination and difficulty parameters • Person parameters are the ability estimates we use to create a student’s scale score.

Parameter Estimation • For single parameter (item difficulty) models, WINSTEPS is the industry standard. • More complex models like the 3 parameter model used in the MME require more specialized software such as PARSCALE. • The estimation process is iterative but happens very quickly; most programs converge in less than 10 seconds. • Typically, item parameters are estimated followed by person ability parameters.

Estimating Ability • Once item parameters are known, we can use the item responses for the individuals to estimate their ability (theta). • For the 3PL model, when people share the same response string (pattern of correct and incorrect responses) they will have the same estimate of theta. • In the 1PL model, the raw score is used to derive the thetas. • Essentially, the same raw score will generate different estimates of theta but they are close. The program will create a table that relates raw scores, to theta, to scale scores based on maximum likelihood estimation.

From theta to scale score • Remember the following formula? • y = mx + b • That is an example of a linear equation. • MDE uses linear equations to transform thetas to scale scores. • There is a different transformation for each grade and content area. • Performance levels are determined by the student’s scale score. • Cut scores are produced by standard setting panelists.

Summary • In this session you found out a bit about reliability and validity. • Two important pieces of information for any assessment. • Remember, it is the validity of the inferences we make that is important. • The evidence is accumulated and the process is ongoing. • There are no ‘types’ of validity. • You were also introduced to item response theory models and how they are used to produce MDE scale scores. • The hope is that you leave with a greater understanding of how MDE assessments are scored, scaled, and interpreted. • In addition, you now have some ‘tools’ that can assist you in your own analyses.

Contact Information Steve Viger Michigan Department of Education608 W. Allegan St.Lansing, MI 48909 (517) 241-2334VigerS@Michigan.gov

Measurement 102

Measurement 102

Presentation Transcript

COMPSCI 102

102, 102, Id Rather Have Jesus

COMPSCI 102

COMPSCI 102

Groundwater 102

Hotshot 102

102

Quilting 102

TECHNOLOGY 102

ENGL 102

MP-102

BCT 102

102

CS 102

102

102