1 / 22

Role of Statistics in Developing Standardized Examinations in the US

Role of Statistics in Developing Standardized Examinations in the US. by Mohammad Hafidz Omar, Ph.D. April 19, 2005. Map of Talk. What is a standardize test? Why standardize Tests? Who builds standardize tests in the United States? Steps to Building a standardize test

mcrutcher
Download Presentation

Role of Statistics in Developing Standardized Examinations in the US

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Role of Statistics in Developing Standardized Examinations in the US by Mohammad Hafidz Omar, Ph.D. April 19, 2005

  2. Map of Talk • What is a standardize test? • Why standardize Tests? • Who builds standardize tests in the United States? • Steps to Building a standardize test • Test Questions & some statistics used to describe them • Statistics used for describing exam scores • Research studies in educational testing that uses advanced statistical procedures

  3. What is a “standardized Examination”? • A standardized test: A test which theconditions of administrationand thescoring proceduresare designed to be thesamein all uses of the test • Conditions of administration: • 1) physical test setting • 2) directions for examinees • 3) test materials • 4) administration time • Scoring procedures: • 1) derivation of scores • 2) transformation of raw scores

  4. Why standardize tests? • Statistical reason: • Reduction of unwanted variations in • Administration conditions • Scoring practices • Practical reason: • Appeal to many test users • Same treatment and conditions for all students taking the tests (fairness)

  5. Who builds standardize tests in the United States? • Testing Organizations • Educational Testing Service (ETS) • American College Testing (ACT) • National Board of Medical Examiners (NBME) • Iowa Testing Programs (ITP) • Center for Educational Testing and Evaluation (CETE) • State Department of Education • New Mexico State Department of Education • Build tests themselves or • Contract out job to testing organizations • Large School Districts • Wichita Public School Districts

  6. a) Administration conditions • Design of experiment concept of control for unnecessary factors • Apply the same treatment conditions for all test takers • 1) physical test setting (group vs individual testing, etc) • 2) directions for examinees • 3) test materials • 4) administration time

  7. b) Scoring Procedures • Same scoring process • Scoring rubric for open-ended items • Same score units and same measurements for everybody • Raw test scores (X) • Scale Scores • Same Transformation of Raw Scores • Raw (X)  Equating process  Scale Scores h(X)

  8. Overview of Typical Standardized Examination building Process • Costly process • Important Quality control procedures at each phase • Process takes time (months to years) • Creating Test specifications • Fresh Item Development • Field-Test Development • Operational (Live) Test Development

  9. 1) Creating Test specifications • Purpose: • To operationalize the intended purpose of testing • A team of content experts and stakeholders • discuss the specifications vs the intended purpose • Serves as a guideline to building examinations • How many items should be written in each content/skill category? • Which Content/skill area is more important than others? • 2-way table of specifications typically contains • content areas (domains) versus • learning objectives • with % of importance associated in each cell

  10. 2) Fresh Item Development • Purpose: • building quality items to meet test specifications • Writing Items to meet Test Specifications • Q: Minimum # of items to write? • Which cell will need to have more items? • Item Review (Content & Bias Review) • Design of Experiment stage • Design of Test (easy items first, then mixture – increase motivation) • Design of Testing event (what time of year, sample, etc) • Data Collection stage: • Pilot-testing of Items • Scoring of items & PT exams • Analyses Stage: • analyzing Test Items • Data Interpretation & decision-making stage: • Item Review with aid of item statistics • Content Review • Bias review • Quality control step: (1) Keep good quality item, (2) Revise items with minor problem & re-pilot or (3)Scrap bad items

  11. 3) Field-Test Development • Purpose: • building quality exam scales to measure the construct (structure) of the test as intended by the test specifications • Design of Experiment stage • Designing Field-Test Booklets to meet Specifications • Use good items only from previous stage (items with known descriptive statistics) • Design of Testing event • Data collection: • Field-Testing of Test booklets • Scoring of items and FT Exams • Analyses • analyzing Examination Booklets (for scale reliability and validity) • Interpreting results: Item & Test Review • Do tests meet the minimum statistical requirements. (rxx’> 0.90) • If not, what can be done differently?

  12. 4) Operational (Live) Test Development • Purpose: • To measure student abilities as intended by the purpose of the test • Design of Experiment stage • Design of Operational Test • Use only good FT items and FT item sets • Assembling Operational Exam Booklets • Design of Pilot Tests (e.g. some state mandated programs) • New & Some of the revised items • Design of Field Test (e.g. GRE experimental section) • Good items that has been piloted before • How many sections? How many students per section? • Design of additional research studies • e.g. Different forms of the test (Paper-&-pencil vs computer version) • Design of Testing events • Data Collection: • First Operational Testing of Students with Final version of examinations • Scoring of items and Exams • Analyses of Operational Examinations • Research studies to establish Reporting scales

  13. Different types of Exam item format • Machine –Scorable formats • Multiple-choice Questions • True-False • Multiple true-false • Multiple-mark questions (Pomplun & Omar, 1997) – aka multiple-answer multiple-choice questions • Likert-like Type Items (agree/disagree continuum) • Manual (Human) scoring formats • Short answers • Open-ended test items • Requires a scoring rubric to score papers

  14. Statistical considerations in Examination construction • Overall design of tests • to achieve reliable (consistent) and valid results • Designing testing events • to collect reliable and valid data (correct pilot sample, correct time of the year, etc) • e.g. SAT: Spring/Summer student population difference • Appropriate & Correct Statistical analyses of examination data • Quality Control of test items and exams

  15. Analyses & Interpretation:Descriptive statistics for distractors (Distractor Analysis) • Applies to Multiple-choice, true-false, multiple true-false format only • Statistics: • Proportion endorsing each distractor • Informs the exam authors which distractor(s) • are not functioning or • Counter-intuitively more attractive than the intended right answer (hi ability wrong answer)

  16. Analyses and Interpretation:Item-Level Statistics • Difficulty of Items • Statistics: • Proportion correct {p-value} – mc, t/f, m-t/f, mm, short answer • Item mean – mm, open-ended items • Describes how difficult an item is • Discrimination • Statistics: • Discrimination index: high vs Low examinee difference in p-value • An index describing sensitivity to instruction • item-total correlations: correlation of item (dichotomously or polychotomously scored) with the total score • pt-biserials: correlation between total score & dichotomous (right/wrong) item being examined • Biserials: same as pt-biserials except that the dichotomous item is now assumed to come from a normal distribution of student ability in responding to item • Polyserials: same as biserials except that the item is polychotomously scored • Describes how an item relates (thus, contributes) to the total score

  17. Examination-Level Statistics • Overall Difficulty of Exams/Scale • Statistics: Test mean, Average item Difficulty • Overall Dispersion of Exam/Scale scores • Statistics: Test variability – standard deviation, variance, range, etc • Test Speededness • Statistics: 1)Percent of students attempting the last few questions • 2) Percentages of examinees finishing the test within the allotted time period • Not speeded test: percentage is more than 95% • Consistency of the Scale/Exam Scores • Statistics: • Scale Reliability Indices • KR20: for dichotomously scored items • Coefficient alpha: for dichotomously and polychotomously scored items • Standard error of Measurement Indices • Validity Measures of Scale/Exam Scores • Intercorrelation matrix • High Correlation with similar measures • Low correlation with dissimilar measures • Structural analyses (Factor analyses, etc)

  18. Statistical procedures describing Validityof Examination scores for its intended use • Is Reality of Exam for the students same as Authors’ Exam Specifications? • Construct validity: Analyses on exam structures (Intercorrelation matrix, Factor analyses, etc) • Can the exam measure the intended learning factors (constructs)? • Answer: with Factor analyses (Data Reduction method) • Predictive validity: predictive power of exam scores for explaining important variables • e.g. Can exam scores explain (or predict) success in college? • Regression Analyses • Differential Item Functioning: statistical bias in test items • Are test items fair for all subgroups (Female, Hispanic, Blacks, etc) of examinees taking the test? • Mantel-Haenszel chi-squared Statistics

  19. Some research areas in Educational Testing that involve further statistical analyses • Reliability Theory • How consistent is a set of examination scores? Signal to signal+noise, 2/(2+ 2), ratio in educational measurement • Generalizability Theory • Describing & Controlling for more than 1 source of error variance • Differential Item Functioning • Pair-wise difference (F vs M, B vs W) in student performance on items • Type I error rate control (many items & comparison  inflate false detection rates) issue

  20. Some research areas in Educational Testing that involve further statistical analyses(continued) • Test Equating • Two or more forms of the exam: Are they interchangeable? • If scores on form X is regressed on scores from form Y, will the scores from either test editions be interchangeable? Different regression functions • Item Response Theory • Theory relating students’ unobserved ability with their responses to items • Probability of responding correctly to test items for each level of ability (item characteristic curves) • Can put items (not test) on the same common scale. • Vertical Scaling • How do student performance from different school grade groups compare with each other? • Are their means increasing rapidly, slowly, etc? • Are their variances constant, increasing, or decreasing?

  21. Some research areas in Educational Testing that involve further statistical analyses (continued) • Item Banking • Are the same items from different administrations significantly different in their statistical properties? • Need Item Response Theory to calibrate all items so that there’s one common scale. • Advantage: Can easily build test forms with similar test difficulty. • Computerized Test • Are score results taken on computers interchangeable with those on paper-and-pencil editions? (e.g. http://ftp.ets.org/pub/gre/002.pdf) • Is measure of student performances free from or tainted by their level of computer anxiety? • Computer Adaptive Testing • increase measurement precision (test information function) by allowing students to take only items that are at their own ability level.

More Related