Interpretation: How to Use Psychometrics

Interpretation: How to Use Psychometrics

A Different Format • Previous talks were generally about one topic • Today’s presentation: Where does this stuff come up at MP, outside of the psychos? • A little bit of info on several different things

The goals • Understand various psychometric analyses as they arise in day-to-day work • See which stats are used in different applications • Answer questions

Topics Covered • Things you’d find in a key verification file • Classical stats (p-values, point-biserials) • Things you’d find at a form pulling • IRT stats (TCC’s, TIF’s) • Things you’d find in a technical manual • All sorts of info • A question you’d hear at a standard setting • IRT

1. Key Verification Files • Purpose: To check the correctness of answer keys (MC items) • A list of items whose stats are unusual or merit further investigation • Items identified based on their p-values and/or point-biserials

P-value: The proportion of students answering an item correctly • “How easy is the item?” • Point-biserial: The correlation between item score and total score • “If you do well on the item, do you tend to do well on the test?”

When might we be alarmed? • Not many kids are picking the right answer • The p-value is low (less than .25) • Low-performing kids are doing better on the item than high-performing kids • The point-biserial is low (less than .15) and/or • If an incorrect answer choice has strange stats

Distractor Stats • Distractor p-value: The proportion of students picking the distractor (say, choice C when the correct answer is B) • “How popular is choice C?” • Flag item if distractor p-value is higher than .3 • Distractor point-biserial: The correlation between picking the distractor and total test score • “If you picked C, how well did you tend to do on the test?” • Flag item if distractor PBS is positive

An Operational Example • A recent item had the following stats: • Key = D • P-value = 0.10 • Point-biserial = -0.02 • P-value for “C” = 0.60 • Point-biserial for “C” = 0.20 • So the key was wrong? Nope

How Can That Happen? • An example: What is the definition of the word travesty? A: Mockery B: Injustice C: Bellybutton D: Some even stupider answer than “bellybutton” • Actual definition: “Any grotesque or debased likeness or imitation” • The correct answer is “A”, but “travesty of justice” threw off the high-performing students

To sum up… • Psychometrics can help us identify items whose keys need to be checked • Stats used: • P-values • Point-biserials • Distractor p-values and point-biserials • P-values & point-biserials should be relatively high, distractor values should be relatively low • The key usually turns out to be right, but that’s OK

2. Form Pulling • Context: We are choosing items for next year’s exam • Clients like to look at psychometric info when picking items (e.g., MCAS) • We know the stats ahead of time because items were field-tested • Relevant stats: Test Characteristic Curves (TCC’s), raw score cut points, Test Information Functions (TIF’s)

This stuff relates to Item Response Theory (IRT) • TCC is a plot that tells you the expected raw score for each value of ability (denoted theta) • As ability increases, expected raw score increases

Example of a TCC: 5 Items

Raw Score Cut Points • Suppose test has 4 performance levels: Below Basic, Basic, Proficient, Advanced • How many points do you need in order to reach the Basic level? Proficient? Advanced? • Example: Test goes from 0 to 72. Need 35 to reach Basic; 51 to reach Proficient; 63 to reach Advanced • Standard Setting often tells us theta cut points; clients want to know raw score cuts

Using the TCC to find a cut point • Suppose theta cut is 0.4 • Find expected raw score at 0.4 using the TCC. It is 3.3 • Cut is placed between 3 and 4

Test Information Functions • TIF’s tell us the test precision at each level of ability • The higher the curve, the more precision • Easy items give us precision for low values of theta. Similarly: • Hard items give precision at high values • Medium items give precision at medium values

Example of a TIF

Why does the client care? • It is often desired that next year’s forms are similar to this year’s forms • Make sure tests are correct difficulty (TCC, RScut points) & precision (TIF) • Match TCC’s, cut points, TIF’s of the two years

Why should the forms be similar? • Theoretically, we should be able to account for differences through equating (Liz) • However, want the student experience to be similar from year to year • Don’t want to give easy test to Class of ’07, hard test to Class of ’08 • Don’t want to make this year’s test less precise than last year’s

Example: 2007 MCAS, Grade 10 Math • Proposed 2007 TCC was lower than last year’s • Solution: Replace some hard items with easy items

Example, Continued • Proposed 2007 TIF had less info at low abilities, more info at high abilities • Solution: • Replace some hard items with easy items • Use hard items with lower PBS, easy items with higher PBS

Example, Continued • Proposed 2007 raw score cuts lower than 2006 raw score cuts • Solution: Replace some hard items with easy items

Guide to making changes Some rules-of-thumb for different problems:

To sum up… • Item Response Theory is useful in form pulling • TCC’s, raw score cuts, TIF’s are often examined • Proposed values should be similar to current year’s • Tests shouldn’t be too easy or hard • Tests should be informative but not too informative • It’s helpful to know how we can change these things based on item stats

3. Technical Manuals • Things in Technical Manuals vary from program to program • Often see some of the following: • P-values and point-biserials (thanks Louis!) • Test reliabilities (thanks Louis!) • TCC’s and TIF’s (thanks Mike!) • DIF (thanks Won!) • Standard Setting (thanks Liz and Abdullah!) • Equating (thanks in advance Liz!) • Inter-rater reliability (thanks for nothing!) • Decision consistency and accuracy (ditto)

Technical Manuals: P-Values & Point-Biserials • You’ll often see a table like this:

Technical Manuals: Reliabilities (and other stats) • Louis said: Reliability is the correlation between scores on parallel forms. Higher reliability Greater consistency • You’ll often see a table like this:

Technical Manuals: TCC’s and TIF’s Give TCC, TIF of each grade / content area

Technical Manuals: DIF • Won said: An item has DIF if the probability of getting the item right is dependent on group membership (e.g., gender, ethnic group) • Measured Progress uses a method called the Standardized P-Difference • Comparing groups • Male-Female • White-Black • White-Hispanic • Minimum 200 examinees in each group

DIF, Continued • A: [-0.05 ~ 0.05] negligible • B: [-0.1 ~ -0.05) and (0.05 ~ 0.1] low • C: outside the [-0.1 ~ 0.1] high • A: [-0.05 ~ 0.05] negligible • B: [-0.1 ~ -0.05) and (0.05 ~ 0.1] low • C: outside the [-0.1 ~ 0.1] high C B A C B

DIF, Continued • You may see a table like this:

Technical Manuals: Standard Setting & Equating • Liz and Abdullah discussed Standard Setting • In technical manuals, you’ll often see: • Report / summary of standard setting process • Info about panelists (how many, who they are) • What method was used (e.g., bookmark / Body of Work) • Cut points • Info about panelist evaluations • Equating: Come next week and find out!

Inter-rater reliability • When constructed-response items are rated by multiple scorers, how well do raters agree? • The more agreement, the better • Exact agreement: What % of the time do they give the same score? • Adjacent agreement: What % of the time are they off by 1?

Decision Accuracy and Consistency: Introduction • For most programs, four achievement levels, e.g., Below Basic, Basic, Proficient, Advanced • Decision accuracy: degree to which observed categorizations match true categorizations • Decision consistency: degree to which observed categorizations match those of a parallel form

Intuitive examples of accuracy • TRUE LEVEL: Proficient • OBSERVED LEVEL: Proficient • DIAGNOSIS: ACCURATE (GOOD) • TRUE LEVEL: Proficient • OBSERVED LEVEL: Below Basic • DIAGNOSIS: INACCURATE (BAD). False negative • TRUE LEVEL: Basic • OBSERVED LEVEL: Advanced • DIAGNOSIS: INACCURATE (BAD). False positive

Intuitive examples of consistency • OBSERVED LEVEL, Form 1: Basic • OBSERVED LEVEL, Form 2: Basic • DIAGNOSIS: CONSISTENT (GOOD) • OBSERVED LEVEL, Form 1: Basic • OBSERVED LEVEL, Form 2: Advanced • DIAGNOSIS: INCONSISTENT (BAD)

Decision Accuracy and Consistency: Introduction • Livingston and Lewis (1995) proposed method of estimating decision accuracy/consistency • For most programs, many stats are computed. We will give an example of each • The stats are all based on joint distributions • A joint distribution gives the proportion of times that 2 things both happen. • What proportion of students are truly Basic and are observed as Below Basic?

Joint Distribution: True/Observed Achievement Levels Overall accuracy: 0.7484 True Status

Joint Distribution: Observed/Observed Achievement Levels Overall consistency: 0.6574 Observed Status:Form 1

Indices Conditional upon Level • Proportion of students correctly classified, given true level • Proportion of students consistently classified by parallel form, given observed level

Indices at Cut Points • Accuracy & consistency at specified cut point • Accuracy: What is the chance that a student is classified on the “correct side” of a cut point? • Consistency: What is the chance that a student is classified on the same side of a cut point twice?

To sum up… • Lots of stuff in technical manuals • Both classical test theory material (p-values, point-biserials, reliabilities) & IRT material (TCC’s, TIF’s, equating) are important to understand • Hopefully, these seminars have helped familiarize you with their contents

4. Standard Setting • Comes up all the time outside Psychoville • Should be a perfect topic for this talk, but… • Liz and Abdullah alreadydid a wonderful job

Cut point 3 Cut point 2 Cut point 1 4. Standard Setting • Standard Setting is the process of recommending cut scores between achievement levels • Advance (A) • Proficient (P) • Below Proficient (BP) • Failing (F) • Focus on one FAQ in bookmark: • How do we determine the arrangement of items in the ordered item booklets?

Brief Review of Bookmark • Each panelist makes use of the ordered item booklet • Items in the OIB are presented from easiest to hardest. One page per MC item • Panelists’ job is to place bookmark in OIB for each cut • For a given cut, where do panelists place a bookmark? • Where they think borderline students would no longer have a 2/3 chance (or better) of a correct answer • Abdullah said: cut points are derived from bookmark placements

A Very Frequently-Asked Question • First, a FMC: “You messed up the order of the items!” • Then, the FAQ: “Well, how did you determine the order?” • Important: Order is based on actual student performance • We use the concept of IRT

Two MC items: Which is easier? Easier item Harder item

Depending on IRT Model this issue can become quite complex

Interpretation: How to Use Psychometrics