E N D
1. Benchmark Assessments: Promises and Perils(Are we wasting our resources?) James H. McMillan
Virginia Commonwealth University
2. Roadmap A few questions for the audience
Data-driven instruction
Benchmark test characteristics
Formative assessment
Research on benchmark testing
MERC study
Recommendations
5. Commercial Influence The formative assessment market is considered one of the fastest-growing segments of test publishing - predicted to generate revenues of $323 million for vendors.(Olson, 2006).
7. This is what we are doing:
Inference
Use
Scores Conclusion
Claim
Consequence
8. More accurately:
Inference
Use
Score + Error Conclusion
Claim
Consequence
9. Why Add Error? Error in testing (e.g., bad items, administrative differences)
Limited coverage of important objectives due to sampling
Unethical test preparation
Single indicator
Inflation (not your $)
10. Test Score Inflation Score inflation refers to increases in scores that do not signal a commensurate increase in proficiency in the domain of interest...gains on scores of high-stakes tests are often far larger than true gains in students learning... [this] leads to an illusion of progress and to erroneous judgments about the performance of schools ...it cheats students who deserve better and more effective schooling
Dan Koretz, Measuring Up: What Educational Testing Really Tells Us, Harvard Press, 2008.
11. How Much and What Kind of Error for Individual Scores?
Test Score + Error Conclusion
12. Inferences from Low Scores Instruction not aligned
Poor teaching
Poor item(s)
Student weakness
Individual students
Whole class
Remediation
False negative
13. Inferences from High Scores Instruction well aligned
Good teaching
Student competence
Instructional enhancements
False positives
14. Table Time to Talk How can we account for the error that is a part of every benchmark assessment we give, whether 1) commercially prepared or 2) developed at the local level?
15. Features of Formative Assessment A process of several components, not simply a test
Used by both teachers and students
Takes place during instruction
Provides feedback to students
Provides instructional adjustments or correctives
16. Formative Assessment Cycle
17. Formative Assessment Characteristics
20. Table Time to Talk
In your division/school, is benchmark = formative?
To what extent is benchmark testing in your division/school very or barely formative?
21. Research: 2007 IES Study Purpose: Predictive validity of four commercial benchmark tests used in Delaware, Maryland, New Jersey, Pennsylvania, and Washington DC
Method: Analysis of predictive correlations of district and individual student scores with state assessments
Result: Well matched but Evidence is generally lacking of the predictive validity with respect to state assessment tests.
22. 2008 IES Study Purpose: Study effects of using benchmark assessments for grade 8 math in Massachusetts
Method: Quasi-experimental comparison of 22 intervention and 44 comparison schools
Result: No significant differences in student achievement on state tests
23. CTB/Mcgraw-Hill, 2009 AERA Presentation Research based assessments with sound technical foundations
Rapid or immediate turn around of data and reports to support and modify instruction
Component flexibility In response to the market need and customer requests, we develop a system built with sound technical base, flexibility, and to support instruction to be modified by providing data in a useful way. In response to the market need and customer requests, we develop a system built with sound technical base, flexibility, and to support instruction to be modified by providing data in a useful way.
24. Flexibility Publisher developed components can be chosen as desired
Types of tests
Predictive tests reflecting state tests
Diagnostic tests reflecting pacing, scope, sequence
Numbers of tests/frequency of administration
Item types (MC, GR, CR) A flexible system is required to meet user needs. Users can select various types of publisher developed tests and administer as desired. If the full battery of predictive and diagnostic assessments are selected then testing occurs approximately monthly. Diagnostic tests may replace less reliable teacher made tests that take teachers time to develop and score.A flexible system is required to meet user needs. Users can select various types of publisher developed tests and administer as desired. If the full battery of predictive and diagnostic assessments are selected then testing occurs approximately monthly. Diagnostic tests may replace less reliable teacher made tests that take teachers time to develop and score.
25. Flexibility Teachers can
Create custom tests using a pre-populated item bank aligned to state standards
Share custom developed forms with other teachers on the system
Write new items (or instructional activities) using item authoring software
Share the new items or instructional exercises on their local system
26. Flexibility Administration and data capture modes
Paper and Pencil
Scan and score answer sheets
Online
Student response devices or clickers
Assign instructional activities
Directly from online reports
Automatically or manually
27. Empirical Data Studies: Did Forms Achieve Appropriate Reliability and Desired Difficulty? Reliability is monitored for publisher developed tests. Test difficulty is calibrated to be developmentally appropriate. Tests administered later in the year are developed to be more difficult than tests administered earlier. Test difficulty is intended to match student growth, so average proportion correct should be approximately constant over the year, as illustrated by the Colorado Language Arts assessments (.6, .61, .61 for forms administered in Sept, Dec, and Feb, respectively). Other forms indicated here are within .05 in difficulty over the year.Reliability is monitored for publisher developed tests. Test difficulty is calibrated to be developmentally appropriate. Tests administered later in the year are developed to be more difficult than tests administered earlier. Test difficulty is intended to match student growth, so average proportion correct should be approximately constant over the year, as illustrated by the Colorado Language Arts assessments (.6, .61, .61 for forms administered in Sept, Dec, and Feb, respectively). Other forms indicated here are within .05 in difficulty over the year.
29. Different numbers of items appear under each standard and benchmark. Training and caveats are provided to use the results responsibly.
Note: Only the portion of the Grade Level Expectations (GLE) covered by the assessed curriculum are measured on this form of the Diagnostic Assessment. Thus, inferences from students performances should not be made to the GLE as a whole, but only to the assessed portion of the GLE. A specific GLE is measured by items on this form only when the GLE comprised at least five percent of the assessed content as indicated by the pacing guide; GLEs that did not comprise at least five percent of the curriculum are not measured by this form. Also, the reported results for GLEs measured with fewer items are less reliable than for GLEs measured with more items. Thus, when small numbers of items are used to measure a GLE, other measures (e.g., observations, homework, etc.) should be used to confirm the results reported here. Different numbers of items appear under each standard and benchmark. Training and caveats are provided to use the results responsibly.
Note: Only the portion of the Grade Level Expectations (GLE) covered by the assessed curriculum are measured on this form of the Diagnostic Assessment. Thus, inferences from students performances should not be made to the GLE as a whole, but only to the assessed portion of the GLE. A specific GLE is measured by items on this form only when the GLE comprised at least five percent of the assessed content as indicated by the pacing guide; GLEs that did not comprise at least five percent of the curriculum are not measured by this form. Also, the reported results for GLEs measured with fewer items are less reliable than for GLEs measured with more items. Thus, when small numbers of items are used to measure a GLE, other measures (e.g., observations, homework, etc.) should be used to confirm the results reported here.
30. Class Report The Predictive Class Assessment Report shows the average scale score for a class and how the class is predicted to perform on CRCT in the spring of each year.
AThe percentage of students in a particular class who are expected to fall into each proficiency category on CRCT. For example, 4% of students in this class are expected to fall into the Exceeds category on CRCT. The categories listed will include the three CRCT proficiency levelsDoes Not Meet, Meets, Exceeds.
BThe average scale score of the class394. The number in parenthesis is the Standard Deviation (SD). In this case, the SD is 42.
CThe minimum/maximum scale score range for the Acuity assessments. The scale scores range from 230 to 590. The Predictive Class Assessment Report shows the average scale score for a class and how the class is predicted to perform on CRCT in the spring of each year.
AThe percentage of students in a particular class who are expected to fall into each proficiency category on CRCT. For example, 4% of students in this class are expected to fall into the Exceeds category on CRCT. The categories listed will include the three CRCT proficiency levelsDoes Not Meet, Meets, Exceeds.
BThe average scale score of the class394. The number in parenthesis is the Standard Deviation (SD). In this case, the SD is 42.
CThe minimum/maximum scale score range for the Acuity assessments. The scale scores range from 230 to 590.
31. Longitudinal Reports The Student Longitudinal Report demonstrates the progress by scale score of an individual student on each Acuity assessment. The Longitudinal Report will not be displayed until two Acuity assessments have been taken. The report will display information over a three-year period beginning in the 2008-09 school year.
AA specific scale score range for the Acuity assessments.
BA line graph that shows a series of scale scores from Acuity assessments. The scale score from each assessment is indicated by and the band around this symbol is the Standard Error of Measurement (SEM).
CAcuity assessment forms included in this longitudinal report. On this report, the scale scores represent data points from the three predictive assessments taken by a student in third, fourth and fifth grades.The Student Longitudinal Report demonstrates the progress by scale score of an individual student on each Acuity assessment. The Longitudinal Report will not be displayed until two Acuity assessments have been taken. The report will display information over a three-year period beginning in the 2008-09 school year.
AA specific scale score range for the Acuity assessments.
BA line graph that shows a series of scale scores from Acuity assessments. The scale score from each assessment is indicated by and the band around this symbol is the Standard Error of Measurement (SEM).
CAcuity assessment forms included in this longitudinal report. On this report, the scale scores represent data points from the three predictive assessments taken by a student in third, fourth and fifth grades.
32. MERC StudyPurpose Explore the extent to which benchmark test results are used by teachers in formative ways to support student learning
What is the policy context and nature of benchmark testing?
How do teachers use benchmark testing data in formative ways?
What factors support and/or mitigate teachers formative use of benchmark testing data?
33. Methods Role of MERC study team
Qualitative double-layer category focus-group study design (Krueger & Casey, 2009)
Layers school type (elementary or middle) and district (N=4)
Protocol was developed and piloted
the general nature of benchmark testing policies and the type of data teachers receive
expectations for using benchmark test results
instructional uses of benchmark test results
general views on benchmark testing policies, practices and procedures
Focus groups lasted between 1-1.5 hours with 4-5 participants and were digitally recorded
34. Participants A two-stage convenience sampling process was used to select and recruit focus group participants
District ? School Principal ? Teachers
Spring 2009: 9 focus groups w/40 core-content area teachers across 4 districts
The majority were white (85%) and female (90%) with an average of 12.5 years of teaching exp. (range of 1-32 yrs.)
25% were beginning teachers with 1-3 years of teaching experience and 25% had been teaching for over 20 years.
The majority (80%) taught at the elementary level in grades 4 and 5 and the remaining were middle school teachers in the areas of civics, science, mathematics and language arts.
35. Preliminary Findings: Informing Instruction 1. Teachers make a variety of instructional adjustments based on the results of benchmark assessments, especially when there is an expectation or culture established for using data.
If I see a large number of my students missing in this area, I am going to try to re-teach it to the whole class using a different method. If it is only a couple of [students], I will pull them aside and instruct one-on-one.
We are asked to be accountable for each and every one of those students and we sit face-to-face with an administrator who says to you, how are you going to address their needs? And we have to be able to say, well, I am pulling them for remediation during this time, or I am working with a small group or Ive put them in additional enrichment
we have got to be able to explain how we are addressing those weaknesses.
36. Preliminary Findings: Learning Time v. Testing Time 2. Teachers have significant concerns about the amount of instructional time that is devoted to testing in general and the implications of this for the quality of instruction they can provide.
I think it is definitely made us change the way we teach
I do feel like sometimes I dont teach things as well as I used to because of the time constraints.
Just the time is takes to give all these assessments. As important as these assessments are, it does take instructional time
we dont just do those [benchmark assessments], because we do a lot of pre and post-assessments so this is just one more thing on top of a lot of other testing we do.
You are sacrificing learning time for testing time
we leave very little time to actually teach.
37. Preliminary Findings: Value of Test Results 3. The value teachers place on benchmark testing data is associated with their views on the quality of the test items, the integrity of the scoring process, and the alignment of the test with the curriculum.
We really need to focus on the tests being valid. It is hard to take it seriously when you dont feel like it is valid. When you look at it and you see mistakes or passages you know your students arent going to be able to read because it is way above their reading level. The people writing the tests need some kind of training.
Sometimes, I think
you have to use your professional judgment
sometimes the questions that are on the test are just simply bad questions
was it because they didnt understand cause and effect or is it because that was really a poorly written question?
38.
Many times the 9-week assessments are so all encompassing that it is difficult for the students
.you may only have one question that addresses a specific objective. And so that is not really a true representation of what the child knows about that objective.
39. Preliminary Findings: Instructional Benefits Teachers views on benchmark testing policies are somewhat positive They recognize the benefits to their instruction and students learning.
It [benchmark test results] helps me analyze my instruction as a whole. I didnt present the questions in my classroom the way the questions are presented on the test. So maybe I need to back track or maybe I didnt spend a lot of time in this area, and I need to go back and [re-teach].
They are one more piece you can have and when you compare it across the county you can use it to see how you are doing
but again, it just shows that it isnt the only thing we base our instructional decisions on by all means.
It tells you what you need to place more emphasis on. It really alerts you to the weaknesses of your class and how much more practice you need to provide. I think they [the benchmark tests] are excellent, I think they are great because of the correlation with the SOL tests and this is one way of getting the children prepared and familiar with the formatting of the SOL test so that they will be successful.
40. Conclusions Preliminary findings to date seem consistent with the literature.
School culture and expectations for the use of test results is a key factor in how results are used.
The results so far indicate that, under the right conditions, benchmark testing may serve a meaningful formative purpose.
Additional focus groups are planned for fall 2009 to test preliminary findings.
41. Recommendations for Effective Use of Benchmark Assessments Clarify Purpose Instructional
To adapt instruction and curriculum to meet student needs content, pace and strategies, whole class or small groups, re-teach
To identify low performing students
To implement teacher strategies to use the information?
Program Evaluation
To compare different instructional programs
To determine instructional effectiveness
To modify curriculum and instruction in the future
Predicting State Test Results
42. Recommendations for Effective Use of Benchmark Assessments Use About 30 High Quality Items
3-5 items for diagnostic information per topic or trait
Pilot?
Item discrimination, difficulty, fairness, and clarity
Establish Alignment Evidence
Content matched to SOLs
Number of items match Reporting Category percentages
Content matched to instruction and opportunity to learn
Cognitive demands of items
Provide Clear Guidelines for Using Results
Standardize Administration Procedures
43. Recommendations for Effective Use of Benchmark Assessments Verify Results with Other Evidence
Classroom tests
Contrasted groups
Include Estimates of Error
Monitor Unintended Consequences
Ensure Fairness
Equitable treatment
Opportunity to learn
Document Costs
financial
student time
teacher time
44. Recommendations for Effective Use of Benchmark Assessments Evaluate Use of Results What evidence exists that teachers are using results to modify instruction and students are learning more?
Use Teams of Teachers for Review and Analysis
Provide Adequate Professional Development
Commercial or Locally Prepared?
45. Commercial or Locally Developed?
46. Benchmark Assessments: Promises and Perils(Are we wasting our resources?) James H. McMillan
Virginia Commonwealth University