Goals of this session

The measurement model: what does it mean and what you can do with it?Presented by Michael Nering, Ph. D.

Goals of this session • Present commonly used measurement models • IRT • Show how these models form the backbone of any large scale assessment program • Equating • Scaling • To discuss the meaning of the measurement models • Ability estimation • Item characteristics

Now, really …. • This session is an introduction to the world of psychometrics • My goal is for you to understand that psychometrics: • Is not a black box • Is really just a set of procedures

A little about me • Yes, I’m a psychometrican • B.A. in psychology at Kent State • Ph. D. in psychology at Univ of Minn • Working at Measured Progress since 1999 • Research areas of interest: IRT, equating, scaling, person fit, adaptive testing

Why Psychology? • Psychometricians typically come from: • Educational measurement programs • Psychometric programs • I/O programs • Ultimately, we are all after the pursuit of understanding people by way of “quantification”

Psychometrics Defined • Psycho metrics • The business of measuring psychological “things” Measurement Psychological

What are psychological things? • Any “latent” trait • Any “characteristic” that is not directly “observable” • Examples: • Depression, bi-polar, personality disorder • Math, reading, writing, science abilities • We don’t care – let’s use “q”

Counterparts to Psychometrics • Econometrics • Measurement of economic things • Sociometrics • Measurement of social things

All “metrics” are ultimately a blend of things

Quantification in Psychology • Deep roots that came originally from philosophy • Philosophy in the 1500s branched into several disciplines because of the need to quantify certain things to better understand human beings

Philosophy’s Many Branches • This desire to better understand humans lead to two primary areas of study • Physiology • 1543 Belgian physiologists practices the dissection of cadavers • Psychology • 1524Marco Marulik publishes The Psychology of HumanThought

Yes, I did just use the word “cadaver” … but trust me it’s okay

The last 100 years of psychometrics • Classical test theory and Spearman’s 1904 contribution • True score theory • Reliability theory, p-values, point biserial coefficients • Item response theory

Let’s talk about IRT • When I say “measurement model” I really do mean some sort of IRT model • Lots of historical developments • Lord & Novick text of 1968 • Many advantages over CTT

So, what is IRT? • A family of mathematical models that describe the interaction between examinees and test items • Examinee performance can be predicted in terms of the underlying trait • Provides a means for estimating scores for people and characteristics of items • Common framework for describing people and items

The ogive • Natural occurring form that describes something about people • Used throughout science, engineering, and the social sciences • Also, used in architecture, carpentry, engineering, photograph, art, and so forth

The ogive

A little jargon • The item characteristic curve (ICC) • Also called: • Item response function • Trace line • Etc. • Stochastic: 1) involving a random variable, or 2) involving chance or probability

The ICC • Does this one little function really do everything? • Scale items & people onto a common metric? • Help in standard setting? • Foundation of equating? • Some meaning in terms of student ability?

Does this one little function really do everything? Absolutely !! Let’s talk more about the ICC

The ICC • Any line in a Cartesian system can be defined by a formula • The simplest formula for the ogive is the logistic function:

The ICC • Where d is the item parameter, and bis the person parameter • The function represents the probability of responding correctly to item i given the ability of person j.

dis the inflection point Item i di=0.125

We can now use the item parameter to calculate p • Let’s assume we have a student with b =1.0, and we have ourd = 0.125 • Then we can simply plug in the numbers into our formula

Using the item parameters to calculate p p = 0.705 bi=1.00

Wait a minute • What do you mean a student with an ability of 1.0?? • Does an ability of 0.0 mean that a student has NO ability? • What if my student has a reading ability estimate of -1.2? What in the world does that mean????

The ability scale • Ability is on an arbitrary scale that just so happens to be centered around 0.0 • We use arbitrary all the time: • Fahrenheit • Celsius • Decibels • DJIA

Scaled Scores • Although ability estimates are centered around zero – reported scores are not • However, scaled scores are typically a linear transformation of ability estimates • Example of a linear transformation: • (Ability x Slope) + Intercept

The need for scaled scores ½ the kids will have negative ability estimates

Scaled Scores

Use of scaled scores • Student/parent level report • School/district report • Cross year comparisons • Performance level categorization

There’s a lot here • Scaled scores are surface level information • Behind the scenes: • we use fancy formulas to depict interaction between students and test items • there’s a “probabilistic” relationship between students and test items

Unfortunately, life can get a lot worse • Items vary from one another in a variety of ways: • Difficulty • Discrimination • Guessing • Item type (MC vs. CR)

Items can vary in terms of difficulty Easier item Harder item Ability of a student

Items can vary in terms of discrimination • Discrimination is reflected by the “pitch” in the ICC • Thus, we allow the ICCs to vary in terms of their slope

Good item discrimination Noticeable difference in p 2 close ability levels

Poor item discrimination smaller difference Same 2 ability levels

Guessing This item is asymptotically approaching 0.25

Polytomous Items

I’m sure by now you might be having a couple of thoughts I’m stuck in a “psycho”metric prison … help me! How can I get up, open the door, and walk out without anybody noticing?

But, trust me … I’m really trying to make a simple point

Items and people • Interact in a variety of ways • We can use IRT to show that there exists a nice little s-shaped curve that shows this interaction • As ability increases – the probability of a correct response increases

Advantages of IRT • Because of the stochastic nature of IRT there are many statistical principles we can take advantage of • A test is a sum of its parts

The test characteristic curve • A test is made up of many items • The TCC can be used to summarize across all of our items • The TCC is simply the summation of ICCs along our ability continuum • For any ability level we can use the TCC to estimate the overall test score for an examinee

A bunch of ICCs are on a test

The test characteristic curve

The test characteristic curve • From an observed test score (i.e., a student’s total test score) we can estimate ability • The TCC is used in standard setting to establish performance levels • The TCC can also be used to equate tests from one year to the next

Goals of this session