220 likes | 237 Views
Role of Statistics in Developing Standardized Examinations in the US. by Mohammad Hafidz Omar, Ph.D. April 19, 2005. Map of Talk. What is a standardize test? Why standardize Tests? Who builds standardize tests in the United States? Steps to Building a standardize test
E N D
Role of Statistics in Developing Standardized Examinations in the US by Mohammad Hafidz Omar, Ph.D. April 19, 2005
Map of Talk • What is a standardize test? • Why standardize Tests? • Who builds standardize tests in the United States? • Steps to Building a standardize test • Test Questions & some statistics used to describe them • Statistics used for describing exam scores • Research studies in educational testing that uses advanced statistical procedures
What is a “standardized Examination”? • A standardized test: A test which theconditions of administrationand thescoring proceduresare designed to be thesamein all uses of the test • Conditions of administration: • 1) physical test setting • 2) directions for examinees • 3) test materials • 4) administration time • Scoring procedures: • 1) derivation of scores • 2) transformation of raw scores
Why standardize tests? • Statistical reason: • Reduction of unwanted variations in • Administration conditions • Scoring practices • Practical reason: • Appeal to many test users • Same treatment and conditions for all students taking the tests (fairness)
Who builds standardize tests in the United States? • Testing Organizations • Educational Testing Service (ETS) • American College Testing (ACT) • National Board of Medical Examiners (NBME) • Iowa Testing Programs (ITP) • Center for Educational Testing and Evaluation (CETE) • State Department of Education • New Mexico State Department of Education • Build tests themselves or • Contract out job to testing organizations • Large School Districts • Wichita Public School Districts
a) Administration conditions • Design of experiment concept of control for unnecessary factors • Apply the same treatment conditions for all test takers • 1) physical test setting (group vs individual testing, etc) • 2) directions for examinees • 3) test materials • 4) administration time
b) Scoring Procedures • Same scoring process • Scoring rubric for open-ended items • Same score units and same measurements for everybody • Raw test scores (X) • Scale Scores • Same Transformation of Raw Scores • Raw (X) Equating process Scale Scores h(X)
Overview of Typical Standardized Examination building Process • Costly process • Important Quality control procedures at each phase • Process takes time (months to years) • Creating Test specifications • Fresh Item Development • Field-Test Development • Operational (Live) Test Development
1) Creating Test specifications • Purpose: • To operationalize the intended purpose of testing • A team of content experts and stakeholders • discuss the specifications vs the intended purpose • Serves as a guideline to building examinations • How many items should be written in each content/skill category? • Which Content/skill area is more important than others? • 2-way table of specifications typically contains • content areas (domains) versus • learning objectives • with % of importance associated in each cell
2) Fresh Item Development • Purpose: • building quality items to meet test specifications • Writing Items to meet Test Specifications • Q: Minimum # of items to write? • Which cell will need to have more items? • Item Review (Content & Bias Review) • Design of Experiment stage • Design of Test (easy items first, then mixture – increase motivation) • Design of Testing event (what time of year, sample, etc) • Data Collection stage: • Pilot-testing of Items • Scoring of items & PT exams • Analyses Stage: • analyzing Test Items • Data Interpretation & decision-making stage: • Item Review with aid of item statistics • Content Review • Bias review • Quality control step: (1) Keep good quality item, (2) Revise items with minor problem & re-pilot or (3)Scrap bad items
3) Field-Test Development • Purpose: • building quality exam scales to measure the construct (structure) of the test as intended by the test specifications • Design of Experiment stage • Designing Field-Test Booklets to meet Specifications • Use good items only from previous stage (items with known descriptive statistics) • Design of Testing event • Data collection: • Field-Testing of Test booklets • Scoring of items and FT Exams • Analyses • analyzing Examination Booklets (for scale reliability and validity) • Interpreting results: Item & Test Review • Do tests meet the minimum statistical requirements. (rxx’> 0.90) • If not, what can be done differently?
4) Operational (Live) Test Development • Purpose: • To measure student abilities as intended by the purpose of the test • Design of Experiment stage • Design of Operational Test • Use only good FT items and FT item sets • Assembling Operational Exam Booklets • Design of Pilot Tests (e.g. some state mandated programs) • New & Some of the revised items • Design of Field Test (e.g. GRE experimental section) • Good items that has been piloted before • How many sections? How many students per section? • Design of additional research studies • e.g. Different forms of the test (Paper-&-pencil vs computer version) • Design of Testing events • Data Collection: • First Operational Testing of Students with Final version of examinations • Scoring of items and Exams • Analyses of Operational Examinations • Research studies to establish Reporting scales
Different types of Exam item format • Machine –Scorable formats • Multiple-choice Questions • True-False • Multiple true-false • Multiple-mark questions (Pomplun & Omar, 1997) – aka multiple-answer multiple-choice questions • Likert-like Type Items (agree/disagree continuum) • Manual (Human) scoring formats • Short answers • Open-ended test items • Requires a scoring rubric to score papers
Statistical considerations in Examination construction • Overall design of tests • to achieve reliable (consistent) and valid results • Designing testing events • to collect reliable and valid data (correct pilot sample, correct time of the year, etc) • e.g. SAT: Spring/Summer student population difference • Appropriate & Correct Statistical analyses of examination data • Quality Control of test items and exams
Analyses & Interpretation:Descriptive statistics for distractors (Distractor Analysis) • Applies to Multiple-choice, true-false, multiple true-false format only • Statistics: • Proportion endorsing each distractor • Informs the exam authors which distractor(s) • are not functioning or • Counter-intuitively more attractive than the intended right answer (hi ability wrong answer)
Analyses and Interpretation:Item-Level Statistics • Difficulty of Items • Statistics: • Proportion correct {p-value} – mc, t/f, m-t/f, mm, short answer • Item mean – mm, open-ended items • Describes how difficult an item is • Discrimination • Statistics: • Discrimination index: high vs Low examinee difference in p-value • An index describing sensitivity to instruction • item-total correlations: correlation of item (dichotomously or polychotomously scored) with the total score • pt-biserials: correlation between total score & dichotomous (right/wrong) item being examined • Biserials: same as pt-biserials except that the dichotomous item is now assumed to come from a normal distribution of student ability in responding to item • Polyserials: same as biserials except that the item is polychotomously scored • Describes how an item relates (thus, contributes) to the total score
Examination-Level Statistics • Overall Difficulty of Exams/Scale • Statistics: Test mean, Average item Difficulty • Overall Dispersion of Exam/Scale scores • Statistics: Test variability – standard deviation, variance, range, etc • Test Speededness • Statistics: 1)Percent of students attempting the last few questions • 2) Percentages of examinees finishing the test within the allotted time period • Not speeded test: percentage is more than 95% • Consistency of the Scale/Exam Scores • Statistics: • Scale Reliability Indices • KR20: for dichotomously scored items • Coefficient alpha: for dichotomously and polychotomously scored items • Standard error of Measurement Indices • Validity Measures of Scale/Exam Scores • Intercorrelation matrix • High Correlation with similar measures • Low correlation with dissimilar measures • Structural analyses (Factor analyses, etc)
Statistical procedures describing Validityof Examination scores for its intended use • Is Reality of Exam for the students same as Authors’ Exam Specifications? • Construct validity: Analyses on exam structures (Intercorrelation matrix, Factor analyses, etc) • Can the exam measure the intended learning factors (constructs)? • Answer: with Factor analyses (Data Reduction method) • Predictive validity: predictive power of exam scores for explaining important variables • e.g. Can exam scores explain (or predict) success in college? • Regression Analyses • Differential Item Functioning: statistical bias in test items • Are test items fair for all subgroups (Female, Hispanic, Blacks, etc) of examinees taking the test? • Mantel-Haenszel chi-squared Statistics
Some research areas in Educational Testing that involve further statistical analyses • Reliability Theory • How consistent is a set of examination scores? Signal to signal+noise, 2/(2+ 2), ratio in educational measurement • Generalizability Theory • Describing & Controlling for more than 1 source of error variance • Differential Item Functioning • Pair-wise difference (F vs M, B vs W) in student performance on items • Type I error rate control (many items & comparison inflate false detection rates) issue
Some research areas in Educational Testing that involve further statistical analyses(continued) • Test Equating • Two or more forms of the exam: Are they interchangeable? • If scores on form X is regressed on scores from form Y, will the scores from either test editions be interchangeable? Different regression functions • Item Response Theory • Theory relating students’ unobserved ability with their responses to items • Probability of responding correctly to test items for each level of ability (item characteristic curves) • Can put items (not test) on the same common scale. • Vertical Scaling • How do student performance from different school grade groups compare with each other? • Are their means increasing rapidly, slowly, etc? • Are their variances constant, increasing, or decreasing?
Some research areas in Educational Testing that involve further statistical analyses (continued) • Item Banking • Are the same items from different administrations significantly different in their statistical properties? • Need Item Response Theory to calibrate all items so that there’s one common scale. • Advantage: Can easily build test forms with similar test difficulty. • Computerized Test • Are score results taken on computers interchangeable with those on paper-and-pencil editions? (e.g. http://ftp.ets.org/pub/gre/002.pdf) • Is measure of student performances free from or tainted by their level of computer anxiety? • Computer Adaptive Testing • increase measurement precision (test information function) by allowing students to take only items that are at their own ability level.