1 / 37

Test Development and Analysis

Test Development and Analysis. Session Two - 21 May, 2014 Test writing and scoring Margaret Wu. Item Development Process. Development of a Framework and Test Blueprint Draft items Item panelling (shredding!) Iterative process:

nyoko
Download Presentation

Test Development and Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Test Development and Analysis Session Two - 21 May, 2014 Test writing and scoring Margaret Wu

  2. Item Development Process • Development of a Framework and Test Blueprint • Draft items • Item panelling (shredding!) • Iterative process: • Draft items to illustrate, and clarify and sharpen up framework. Framework to guide item development.

  3. Framework and Test Blueprint-1 • Clearly identify • ‘Why’ you are assessing (Purpose) • ‘Whom’ to assess (Population) • ‘What’ to assess (Construct domain) • Define parameters for the test,e.g.: • Duration of the test and test administration procedures • Scoring/marking constraints; item formats • Other issues: security, feedback.

  4. Specifying the Purpose • How will the results be used? • Determine pass/fail, satisfactory/unsatisfactory • Award prizes • Provide diagnostic information • Compare students • Set standards • Provide information to policy makers • Who will use the information? • Teachers, parents, students, managers, politicians

  5. Test Blueprint • Sufficiently detailed so that test developers can work from these specifications. • Range of difficulty • Target reliability • Item format. • Weights of sub-domains • Test administration procedures • Timing, equipment, resources • Marking requirements

  6. Test Blueprint – example (PISA Reading)

  7. Uses of Frameworks & Blueprints • To guide item development • Don’t ignore specifications. Cross-check with specs constantly. • To ensure that there is a clear and well-defined construct that can be stable from one testing occasion to another. • Different item writing team • Parallel tests

  8. Item Writing • Science or Art? • Creativity following scientific principles • Established procedures to guide good item development (as covered in this course) • Inspiration, imagination and originality (difficult to teach, but can be gained through experience) • Most important pre-requisite is subject area expertise • Teacher’s craft

  9. Item Writers • Best done by a team • 24-hour job! • Ideas emerge, not necessarily in item writing sessions, or even during office hours. • Ideas appear as a rough notion, like an uncut stone. It needs shaping and polishing, and many re-works! • Keep a notebook for item ideas. • Have camera ready!

  10. Make items inter-esting!

  11. lattice

  12. But, no tricks • Keep materials interesting, but don’t try to “trick” students • i.e. no trickery (as in trying to mislead) • but items can be tricky (as in difficult) • Don’t dwell on trivial points. No room to waste test space. • Think of the bigger picture of the meaning of “ability” in the domain of testing. • Every item should contribute one good piece of information about the overall standing of a student in the domain being tested. • Collectively, all items need to provide one measure on a single “construct”

  13. Item Types • Multiple choice • Easiest to score • Not good face validity • Research showed MC do have good concurrent validity and reliability, despite guessing factor • Constructed • High face validity • Difficult to score • Marker reliability is an issue

  14. Multiple-choice items Test-wise strategies for MC items • Pick the longest answer • Pick “b” or “c”. They are more likely than “a” or “d”. • Pick the scientific sounding answer. • Pick a word related to the topic

  15. Item format can make a difference to cognitive processes -1 • Make sure that we are testing what we think we are testing • The following is a sequence; 3, 7, 11, 15, 19, 23,…. What is the 10th term in this sequence? A 27 B 31 C 35 D 39 • 67% correct (ans D). 24% chose A. That is, about ¼ of students worked out the pattern of the sequence but missed the phrase “10th term”.

  16. Item format can make a difference to cognitive processes -2 • The following is a sequence; 2, 9, 16, 23, 30, 37, … What is the 10th term in this sequence? A 57 B 58 C 63 D 65 • 85% correct, even when this item is considered more difficult than the previous one (counting by 7 instead of by 4). The next number in the sequence (“44”) is not a distractor.

  17. Item format can make a difference to cognitive processes -3 • 16x - 7 = 73. Solve for x. • A. 5 • B. 6 • C. 7 • D. 8 • Substitution is one strategy. Substitute 5,6,7 8 for x and see if the answer is 73.

  18. Item format can make a difference to cognitive processes -4 • The fact that the answer is present in a list can alter the process of solving a problem. • Students look for clues in the options. That can interfere with the cognitive processes the test setter has in mind.

  19. MC options - 1 • Terminology: “key” and “distractors” • Don’t use “All of the above” • Use “None of the above” with caution. • Keep length of options similar. Students like to pick the longest, often more scientific sounding ones. • Make each alternative (a,b,c,d) the same number of times for the key.

  20. MC options - 2 • Avoid having an odd one out. • Which word means the same as amiable in this sentence? Because Leon was an amiable person, he was nice to everyone. A. friendly B. strict C. moody D. mean

  21. 9m 15 m Testing higher-order thinking with MC • Closed the textbook when you write items. If you can’t remember it, don’t ask the students. • Lower-order thinking item: • What is the perimeter of the following shape?

  22. D A B C A better item for testing higher-order thinking skills • Which two shapes have the same perimeter?

  23. MC can be useful • To avoid vague answers, e.g., • How often do you watch sport on TV? • Ans: • When there is nothing else to watch on TV. • Once in a while • A few time a year

  24. Summary about MC items • Don’t be afraid to ask MC items • Check the cognitive processes required, as the answer is given among the options. • Make sure the distractors do not distract in unintended way. • Make sure the key is not attractive for unintended reasons.

  25. Non multiple choice format • Examples: • Constructed response • Performance • Motivation: • Face validity, for testing higher order thinking

  26. Caution about Performance format • Check validity carefully • E.g., Evaluation of Vermont statewide assessment of collecting “portfolios” (1991) concluded that the assessments have low reliability and validity. • Problems with rater judgement and scoring reliably. • E.g, quality of handwriting; presentation • 3-10 times more expensive • Bennett & Ward (1993); Osterlind (1998); Haladyna (1997)

  27. Computer assisted scoring • Formulate firm scoring rules AFTER we examine the data • Other examples, • Household spending • Hours spent on homework • Idea is to capture maximum amount of information with lowest cost. • Capture all different responses. Can always collapse categories later

  28. Scoring – an example • 南港區在台灣什麼地方? • Consider these responses: • 台北市; 台北縣; 台北; 台南; 基隆; 高雄;no response • How to score these? • Where are the levels of latent ability corresponding to these responses? • Ideally, we need scoring that satisfies both technical soundness and psychometric property.

  29. Greenwich 12 Midnight Berlin 1:00 AM Sydney 10:00 AM Scoring – PISAexample • At 7:00 PM in Sydney, what time is it in Berlin? • Score 1: 10am or 10; • Score 0: other responses • Over-look technical correctness • Scores to match latent abilities

  30. Provide Partial Credit to Solve the Problem?? • E.g., • Score 2: 台北市; • Score 1: 台北縣; 台北; • Score 0: 台南; 基隆; 高雄; other • The problem is that the item will have twice the weight of items with maximum score of 1.

  31. Maximum Score of an Item • Should be related to the discrimination power of an item • If an item can divide respondents into three ability groups, just score 0, 1 and 2 • If an item can divide respondents into two ability groups only, score 0 and 1. • The maximum score is not related to item difficulty, but related to item discrimination.

  32. Validate Scoring • Use empirical data to check the proposed scoring • Change scoring rubrics if possible • Consequence of incorrect scoring • Add noise to the measures • Reduce reliability

  33. Item Scores and Item Difficulties - 1 • Typically, a multiple-choice test may have 40 items, of different difficulties. • Typically, each item will be marked 0 (incorrect) or 1 (correct). • That is, items with different difficulties can have the same (maximum) score. • I think people generally accept this kind of scoring.

  34. Item Scores and Item Difficulties - 2 • Now we have partial credit scoring, where the maximum score for an item is more than 1. • Items with a score of 2 may have different difficulties. • The difference in difficulty between a score of 1 and a score of 2 for one item may be different from the difference between a score of 1 and a score of 2 for another item.

  35. Item Scores and Item Difficulties - 3 • It is conceivable that getting a score of 2 on one item may be easier than getting a 1 on another item. • Example: Questionnaire on IT skills • For each question, rate yourself as • Beginner (0); Proficient (1); Advanced (2) • Q1.Use of MS Word • Q2.Use of MS EXCEL • Q3.Use of email • Q4.Use of database • Q5.Use of Powerpoint • Q6.Manage files and folders

  36. Item Scores and Item Difficulties - 4 • However, within an item, achieving a score of 2 should be more difficult than achieving a 1. • In the same way, achieving a score of 1 should be more difficult than achieving a 0. • That is, students with higher abilities should have a higher chance of scoring a higher score within an item.

More Related