1.06k likes | 2.96k Views
Introduction to Classical Test Theory (CTT). X = T + e. Meaning of X, T, and e. Basic assumptions Parallel tests Reliability Standard Error of Measurement p-value & point-biserial. X = T + e. X = observed score (this is obvious) T = “true” score e = error
E N D
X = T + e Meaning of X, T, and e. Basic assumptions Parallel tests Reliability Standard Error of Measurement p-value & point-biserial
X = T + e X = observed score (this is obvious) T = “true” score e = error T and e require some explanation
T, “true” score Take two forms of equal difficulty, you get two different scores. Suppose you take many such tests T is your mean score on all these tests T is an unobservable theoretical concept
e, error Does NOT refer to “error” as in baseball NOR to mistakes in testing or scoring. e is the difference between X and T. e is, thus,related to “standard error” If we have many samples: X isthe sample statistic T is the average of X over the samples standard error of X is SD of X over the samples, i.e., approximately the average of e
e, error Why does a student get different scores on two different tests of the same difficulty level? Short answer: Luck! Example: Spelling test, 1000-word pool. Suppose you know 90%. Imagine two tests, 10 words each, assembled to have sameaverage score for all students. On one test, by luck, you know all 10 words. On the other, by really bad luck, you know only 7!
Basic Assumptions of CTT No cheating. No copying between examinees Luck is completely random. Difficulty level of form does not affect luck How you did on another test does not affect luck If you take two forms, no “learning” occurred while taking the first
Parallel Tests According to CTT, two tests are parallel if: • All students have same “true” score on both tests. • SD of observed scores is the same on both tests. First condition relates to test difficulty. Second condition relates to reliability. So, we could simplify and say two tests are parallel if: • They have the same difficulty • They have the same reliability
Parallel Tests: Beyond CTT Of course, Parallel Tests must be much more than just statistically parallel: • Types of questions • Content • Time limit • Test-taking and administration directions • Legibility • Art work
Reliability A test is a measurement. Two parallel tests are two independent measurements. A student’s scores on two parallel tests are likely to be different. Roughly speaking, the degree to which such differences are minimized is Reliability. The greater the consistency of test scores between a test and its parallel form, the greater the reliability of the test. Definition: Reliability = correlation b/t scores on parallel forms.
Correlation Analysis of Data that come in pairs • Examples: • {Sodium/serving | Sugar/serving} in sample of 10 cereals • {Height | Weight} in a sample of 25 individuals • {Score on one item | sum score on rest of items} in a sample of 400 students (point-biserial) • {Score on test form | Score on parallel form} in a sample of 1000 students (reliability)
Interpreting Correlations • Correlations measure the degree to which one variable has a linear relationship with another. • 0 < Magnitude of Correlation < 1 0 means no linear relationship 1 means perfect linear relationship • Sign of correlation: • Positive increase in one gives increase in other • Negative increase in one gives decrease in other
Interpreting Reliability as a Correlation • The degree to which scores on a test are linearly related to scores on a parallel test. • 0 < Reliability < 1 • Reliability is typically about 0.9 for standardized tests.
Methods for Estimating Reliability • Parallel forms (never have parallel forms) • Test-Retest (same test twice? Forget it!) • Split-half • Cronbach’s alpha
Estimating Reliability: Split-half • Here’s a great idea: • Split the test in half (two parallel halves) • Correlate scores on the two halves • Scale up to get the correlation that corresponds to two full-length tests • Hard to get two parallel halves.
Estimating Reliability: Cronbach’s a • Another great idea: • Do all possible split halves • Take average of all the scaled-up correlations • That’s Cronbach’s alpha! • Sounds computationally intensive
Estimating Reliability: Cronbach’s aN= No. of items on testSD =SD of scores on testpi = p-value for item i
Standard Error of Measurement SEM is an estimate of the average size of e in a population. Beginning with X = T + e we can derive the following formula for SEM: SEM = SD x SQRT(1 – Reliability) CTT assumes every examinee has the same SEM If everyone took many parallel forms, the SD of their scores would be the same for everyone.
CTT Item Parameters P-value: the mean of the scored responses for an item Used as an indicator of item difficulty Point-biserial: the correlation between score on an item and the sum-score on all the other items on the test. Used as an indicator of item discrimination power. Simple, yet informative, statistics: accurately measured with sample sizes as small as 400
Limitations of CTT Item parameters change (even their order of difficulty!) with student population, making them hard to interpret. True scores change across test forms. Hard to compare students who took different forms. Test level model, not item model. SEM is same for all examinees. Reliability changes with student population
Summary: Intro to CTT Basic equation is: X = T + e Parallel Test Forms Examinees have same T on both forms Observed Score SD is same for both forms. Reliability: Correlation of X across Parallel Tests. Estimated by Cronbach’s a (modified split-half approach)
Summary: Intro to CTT SEM=Standard Error of Measurement SD of X for an examinee over many parallel forms Related to reliability by simple formula CTT statistics have severe limitations item statistics change with student population student statistics change with test forms SEM is same for all student scores Reliability changes with student population
Introduction to CTT Thanks again for coming! And for the nice comments on my Basic Stats presentation. Hope to see you next time when Liz will unravel the complexities of setting standards and show how CTT statistics and human judgment are used to yield a logical step-by-step process that makes sense of this complex enterprise.
Famous Two-Number Data Summary:Mean & Standard Deviation • Mean: the ordinary average of all the data. • If you had to pick one number to typify your data. • p-value is a mean • Standard Deviation (SD): average deviation from mean • Obviously, not all the data equal the mean • SD tells, on average, how spread out the data are from the mean • Can be used to identify extreme values
Mean & SDExample using p-values Mean = 0.63 1 SD = 0.18 2 SD’s 50 25 0 0 0.2 0.4 0.6 0.8 1
Mean & SDExample using Heights Mean = 67 1 SD = 3.48 40 2 SD’s 32 24 16 8 0 55 60 65 70 75 80
The “Standard Error” • Data are sampled from a population. • Sample Mean is calculated. • A 2nd sample would have a different Mean.
The “Standard Error” • How much would the Sample Mean vary, on average, over many samples? • What would the SD of the Sample Mean be over many samples? • That’s the Standard Error (SE)! • It tells you how reliable your Sample Mean is.
Standard Error • Ok, this is very nice. But in real life we can’t take 10,000 samples! • In real life we get ONE sample!How can we possibly figure out the SE for the sample mean for a real data sample? • Magic of Statistics SE = (SAMPLE SD ) / SQRT(sample size)