330 likes | 580 Views
2. Item 106: Practice Test Five: Curriculum, Instruction, and Assessment" (Kaplan, Praxis Edition, 2006). A new principal in an open-minded school approaches the teachers about including at least one objective test in each subject each quarter because of what he terms the need for accountability."
E N D
1. Measuring Measuring: Developing a scale of proficiency for the CM framework Presented at the13th Biennnial
International Objective Measurement Workshop
April 7, 2006
Brent Duckor, PhD. Candidate, UC Berkeley
2. 2 Item 106: “Practice Test Five: Curriculum, Instruction, and Assessment” (Kaplan, Praxis Edition, 2006) A new principal in an open-minded school approaches the teachers about including at least one objective test in each subject each quarter because of what he terms “the need for accountability.” The requirement for this accountability has probably come about because of the school board’s concern about:
The depth of content assessed in performance tasks
The instructional planning time lost to grading essays and projects
The possibility of teacher bias in evaluating students
The teachers’ skill in creating performance-type assessments
3. 3 Answers and Explanations (Kaplan, Praxis Edition, p. 391, 2006) 106(D)
(A),(B) and (C) are all possibilities but probably not the driving force. (D) is the correct answer because the school board is probably concerned that the teachers’ assessments are not rigorous enough, and they want to make sure there is a “professional” tool involved. The received wisdom and subtle message is that teacher’s knowledge of educational measurement is deficient. It is assumed that professionals who make “professional tools” have the knowledge to make assessments (instruments) “rigorous”. In other words, experts are expert.
But how do they (and we as teachers) learn to be rigorous and what is the learning progression on the path to acquiring knowledge of educational measurement and assessment.
My research explicitly address the question: what does learning educational measurement look like? What the steps in the learning progression? Which variables constitute the core of educational measurement knowledge?The received wisdom and subtle message is that teacher’s knowledge of educational measurement is deficient. It is assumed that professionals who make “professional tools” have the knowledge to make assessments (instruments) “rigorous”. In other words, experts are expert.
But how do they (and we as teachers) learn to be rigorous and what is the learning progression on the path to acquiring knowledge of educational measurement and assessment.
My research explicitly address the question: what does learning educational measurement look like? What the steps in the learning progression? Which variables constitute the core of educational measurement knowledge?
4. Background on the study Defining “universe” of measurement knowledge
Listening to the authorities and “professionals”
Domains beyond reliability and validity
Limitations to “content-mining” and “item hunting”
Cognition-based approaches (“How we think”)
Knowledge-types (Shavelson et al, 2004)
Construct modeling
Building blocks framework (Wilson, 2005)
Evidentiary approach (Mislevy et al., 2003)
Assessment triangle (NRC, 2000) What is MK and who defines it?
How do we think about MK?
What types of latent traits or constructs might constitute the MK universe?What is MK and who defines it?
How do we think about MK?
What types of latent traits or constructs might constitute the MK universe?
5. 5 Wilson’s Constructing Measures (2005) framework for understanding educational measurement Wilson’s CM framework provides both a definition of MK and a method for measuring itWilson’s CM framework provides both a definition of MK and a method for measuring it
6. Study Posits the existence of multi-dimensional proficiencies for “constructing measures” (CM) framework
Develops 5 construct maps, pools of items & strategy for scoring item responses
Persons sampled from diverse but construct-proximal populations
Fits partial credit Rasch measurement model to empirical data to test hypotheses about structure of proficiencies
Examines evidence for reliability and validity of inferences drawn from scores on CM instrument
Explores relations of “CM scale” to other variables that may explain variations in individual proficiencies
7. 5 constructs/dimensions under investigation in this study Understanding Construct Maps (UCM)
Understanding the Items Design (UID)
Understanding the Outcome Space (UOS)
Understanding Wright Maps (UWM)
Understanding Quality Control (UQC)
Evidence for Validity
Evidence for Reliability
8. Reseach Questions Quality of the CM instrument: Evidence for Validity
R1: What validity evidence is there for the content of the CM instrument?
R2: What validity evidence is there based on the response processes of the CM instrument?
R3: What validity evidence is there based on the internal structure of the CM instrument?
R4: What validity evidence is there based on relations to external variables of the CM instrument?
9. Reseach Questions Quality of the CM instrument: Reliability
R5: Is there evidence that the CM instrument has sufficient internal consistency?
R6: Is there evidence that the CM instrument has sufficient inter-rater consistency?
Factors associated with proficiency on the CM instrument
R7: What is the relationship, if any, between performance on the CM instrument and other factors such as research, professional and course experience?
10. Methods Instruments
CM instrument (n=72)
18 open-ended items
8 fixed choice items
4 exit interview items
53 demographic items
Embedded assessments (n=8)
Construct map homework
Items design homework
Data collection homework
Final report
Semi-structured interviews (n=5)
Subjects
Three sample pools
EDU274A alum
CAESL
Work Circle
Prof. development
IOMW 2004
72 participants
Characteristics
Female (58%)
Under 40 years (69.4%)
Graduate students (48.6%)
Have Master’s degree (59.7%)
11. 11 Procedures
12. Results (RQ1): Validity evidence for CM instrument content The construct maps
Theoretical descriptions of locations of respondents and responses to items for each sub-dimension
The items design
Mixed item format
Task analysis
The outcome space
General rubrics
Item specific rubrics
The measurement model
Technically calibrated Wright Map employed to test hypotheses about respondent and item locations More than just a test blueprintMore than just a test blueprint
13. 13 Construct Map (UCM)
14. 14 Items design Mixed format
Open ended
Visual and/or verbal prompt
Extended response
Fixed choice
Stem
Partially ordered distractors
Task analysis
Task demands
Cognitive demands
Item openess and complexity
15. 15 Open ended item (UCM1) OE1 Stem:
“An educational consultant is asked to develop an instrument to measure understanding of a “Living the Civil War” after-school program. The consultant proposes to measure the following:”
Figure:
Textbox titled: “Participants’ level of historical knowledge”
Two columns titled: “Respondents” and “responses to items”
Each column contains descriptions for a given level
Two prompts:
“Is this a good example of a construct map? Please explain.”
“What advice, if any, would you give to improve this construct map?”
16. 16 UCM1: Item analysis
17. 17 General Scoring Guide (UCM)
18. 18
19. 19 Wright Map (UCM)
20. Results (RQ2): Validity evidence for response processes Of the 84.7% reporting, four out of five respondents did not find the CM instrument confusing
Of the 91.7% reporting, respondents did identify several factors that they believed affected their ability to give their best response to the CM instrument:
Content domain and/or prior knowledge (41%)
Time and Length (38%)
Memory(13%)
Administration and/or format (8%)
21. 21 Results (RQ2): Validity evidence for response processes Of the 88.9% reporting, two out of three respondents did not want to go back and change any of their responses, although some reported using test-taking strategies and “guessing” on the fixed choice items
Of the 76.4% reporting, three out of four respondents believed the CM instrument could be improved while the other respondents did not believe (18%) or were not sure (6%) if it required improvement
Respondents suggested the following areas for improvement:
Shorten time and length (43%)
Item format and wording e.g. fixed choice distractors (33%)
Terminology and content coverage e.g. reliability scenarios (19%)
Standardize administration conditions (5%)
22. Results (RQ3): Validity evidence for internal structure Did the evidence support the constructs?
Wright Map (“CM Scale”) suggests structure predicted by construct map(s)
Yet evidence for multidimensionality given low correlations between separately calibrated maps (.263=r=.538)
Did the evidence support the items design?
Item analysis
For each item, the mean location of the item thresholds increases as the score increases
Respondents higher on the construct are, in fact, also scoring higher on each item.
Differential Item Functioning (DIF)
Female respondents scored lower (0.088) logits than male respondents, but this parameter estimate is not statistically significant (Chi square test of parameter equality = 0.46, df=1, Sig. level p=0.499)
While overall no statistically significant evidence of DIF found, one item (MC6) did display “largish” DIF (.766>.638) which is likely due to sampling effects
23. CM Scale
24. CM Instrument Partial Credit Model fit Did the Rasch measurement model fit the item data?
Overall, weighted mean square statistics indicated good item fit (.75<MNSQ<1.33)
Only two generalized item thresholds (OE12.0 and OE14.0) showed evidence of misfit (.58), but neither were statistically significant (-0.8)
Did the Rasch measurement model fit the person data?
Overall, weighted mean square fit statistics (.75<MNSQ<1.33) indicated relatively good person fit with some exceptions
6 out of 72 respondents did show evidence of statistically significant person misfit
Two cases (MNSQ=.49) indicated better than expected model fit
Four cases (MNSQ=1.89, 2.05, 2.44, 3.14) showed worse than expected fit indicating that the expected order may be wrong i.e. the model did not account for much of the variability in these individuals scores Of the four cases (MNSQ=1.89, 2.05, 2.44, 3.14), the response processes evidence from the exit interviews confirmed that at least two of these respondents found the instrument confusing and/or difficult to engage.Of the four cases (MNSQ=1.89, 2.05, 2.44, 3.14), the response processes evidence from the exit interviews confirmed that at least two of these respondents found the instrument confusing and/or difficult to engage.
25. 25 Results (RQ4): Validity evidence for relations to other variables Correlation between 274A course grades and EAP, MLE and raw scores were all low(*)
Ranking of post-interview (SSI) responses and 274A final reports (EA) corresponded with patterns of CM proficiency (*)This may be due to “Restriction of range” or attenuation effects since grades were only available for 35 of the 72 respondents.(*)This may be due to “Restriction of range” or attenuation effects since grades were only available for 35 of the 72 respondents.
26. 26 Results (RQ5): Evidence for reliability
27. Results (RQ5): Reliability evidence for CM instrument’s internal consistency
28. Results (RQ5): Reliability evidence for alternate forms
29. Results (RQ6): Reliability evidence for rater agreement
30. 30 Results (RQ6): Reliability evidence for rater consistency
31. Results (RQ7): Factors associated with proficiency on CM scale These four independent variables “explain” about 35% of the variation in proficiency; that is, they seem to affect proficiency on the CM scale.
It may be the case that other factors also affect proficiency or there may be unaccounted for random measurement error in the CM instrument scores.
Fitting a unidimensional latent regression model with ConQuest might address the latter concern.These four independent variables “explain” about 35% of the variation in proficiency; that is, they seem to affect proficiency on the CM scale.
It may be the case that other factors also affect proficiency or there may be unaccounted for random measurement error in the CM instrument scores.
Fitting a unidimensional latent regression model with ConQuest might address the latter concern.
32. 32 Results (RQ7): Factors associated with proficiency on CM scale 1. Those respondents who have taken 274A, and have experience with research, paid professional and/or consulting experience in the field do seem to score higher on the CM scale
The coefficient of .477 for the dummy variable (274A experience) indicates that, on average, individuals who took the course score .477 logits higher on the CM scale compared to those who did not
The coefficient of .275 for the dummy variable (research experience) indicates that, on average, individuals who have research experience score .275 logits higher on the CM scale compared to those who did not
The coefficient of .232 for the dummy variable (paid professional/consulting experience) indicates that, on average, individuals who have professional experience score .232 logits higher on the CM scale compared to those who did not
1. Those respondents who have taken 274A, and have experience with research, paid professional and/or consulting experience in the field do seem to score higher on the CM scale
The coefficient of .477 for the dummy variable (274A experience) indicates that, on average, individuals who took the course score .477 logits higher on the CM scale compared to those who did not
The coefficient of .275 for the dummy variable (research experience) indicates that, on average, individuals who have research experience score .275 logits higher on the CM scale compared to those who did not
The coefficient of .232 for the dummy variable (paid professional/consulting experience) indicates that, on average, individuals who have professional experience score .232 logits higher on the CM scale compared to those who did not
33. 33 Results (RQ7): Factors associated with proficiency on CM scale (cont.) All variance inflation factors (VIF) are low. Allison (1999) suggests values less than 2.50 indicate no or little evidence for multicollinearity.All variance inflation factors (VIF) are low. Allison (1999) suggests values less than 2.50 indicate no or little evidence for multicollinearity.
34. 34 Discussion/Next steps Construct design
Improve theory of overall CM proficiency
Items design/outcome space
Revise or remove several fixed choice items
Change stems to use terms e.g. score interpretation consistently
Clarify language in distractors
Develop more open ended items targeted on UOS and UWM dimensions
Augment task analysis and “think aloud” protocols on open ended items to ensure better understanding of cognitive processes and possible role of construct irrelevant “noise”
Measurement model
Fit unidimensional latent regression model
Fit multi-dimensional model to (n>72) data set