Race to the Top Assessment Program: A new Generation of Comparable State Assessments

Race to the Top Assessment Program:A new Generation of Comparable State Assessments Gary W. Phillips American Institutes for Research United States Department of Education Public Hearings December 1, 2009, Denver, Colorado

Fewer, Clearer, Higher Standards • The goals of the next generation assessment system envisioned by the Race to the Top cannot be reached with our existing testing paradigm. • Our existing system of state assessments are • uncoordinated • non-comparable • non-aggregatable • non-scalable • too expensive • too slow Gary W. Phillips

Three Pillars of New Assessment System • Common standards • Computer-adaptive tests • Better Measures of Growth Gary W. Phillips

Three Pillars of the New Assessment System:1. Common Content Standards • Common content standards in each state consortium that are internationally competitive and lead to high school graduates who are ready for well paying careers and postsecondary schooling. • Common item bank (developed by teachers across the consortium), common test blueprints, and each state would administer comparable tests that are equated to the consortia common scale. • At least 85% of each state test would cover all the consortia common content standards (the other 15% would be state supplements to the common content standards). Gary W. Phillips

Three Pillars of the New Assessment System:1. Common Performance Standards • Common internationally benchmarked proficient standards for each grade (that are comparable across all consortia) are vertically articulated across grades and on a trajectory that leads to high school career-ready and college-ready proficiency. (Difficulty of proficient standard is comparable across all consortia and across all states). • Conventional standard setting methodology would be re-engineered. Current standard setting (e.g., bookmark procedure) is based primarily on content judgments (state impact data are an afterthought and no national or international impact data are typically used). In the new design the common proficient performance standard would be established first through empirical benchmarking. Performance level descriptors would subsequently be written to describe the proficient standard, then the PLD for other standards would be written. • Adequate yearly progress (AYP) would be based on proficient performance standards that are comparable across all consortia and across all states and therefore yields fair state, district and school comparisons. Gary W. Phillips

Illustration of International Benchmarking in West Virginia Gary W. Phillips

Three Pillars of the New Assessment System:2. Computer-adaptive Tests • The current model of one size fits all (the same paper-pencil test given to all students) provides poor measurement for large portions of the student population. They are too easy for high achieving students and too hard for low achieving students, students with disabilities and English language learners. • Computer-adaptive tests should be encouraged in each consortium. (They already exist in various stages of development in many states, including Delaware, Georgia, Hawaii, Idaho, Maryland, North Carolina, Oregon, South Dakota, Utah, and Virginia). • Cost savings, multiple testing opportunities, immediate feedback, shorter tests. • Formative assessments and interim assessments (intended to improve instruction) would be developed that are aligned with the summative assessment and the common standards. • Constructed-responseitems (where possible) would also be administered and scored by computer (but validated by teacher hand scoring). Constructed response items and performance tasks that could not be scored by computer would be scored by teachers. • Accommodations would be provided and universal design would be part of the assessment. • Better reliability and more accurate measurement for high and low achieving students and better measurement for students with disabilities and English language learners. • Better validity because the item-selection algorithm can be adaptive as well as standards-based. • At the student level the test can meet the blueprint (e.g., if the blueprint calls for 20% algebra then 20% of the items in the CAT will be algebra). • At the classroom level the test can cover the deeper levels of the content standards (e.g. across the classroom it might cover all sub-objectives). This forces teachers to teach all levels of the content standards for which they will be held accountable. Gary W. Phillips

Three Pillars of the New Assessment System:3. Better Measures of Growth • With current growth models we frequently see negative growth for the top students and find our lowest achieving students are the fastest learners. Both of these patterns are usually artifacts of the ceiling and floor effects of our current testing paradigm. These artifacts would be ameliorated by computer-adaptive testing. • Common vertical scale would be needed to measure growth across grades (within each consortium) which would facilitate the measurement of student grade-to-grade growth and the application of student growth models. • Value-added indices and teacher effectiveness measures would be comparable and more accurate. • Statewide longitudinal data system would be required that uses a unique statewide student identifier with student data that are transferable and linked to teachers and schools and maintained throughout K-12. • More reliable measures of growth. Growth measures are inherently less reliable than status measures. However, because computer adaptive testing provides more reliable measures of status they therefore providemore reliable measures of growth. Gary W. Phillips

Benefits of new Assessment Design • Implements the vision of Race to the Top with high quality assessments based on fewer, clearer, higher standards. • Improves NCLB by correcting two of its fundamental problems (too many content standards and too many performance standards). • Scalable to a large number of states by taking advantage of innovation and technology. • Better measurement for a wider range of students in the general population, can be implemented in alternate assessments in the 1% population, and eliminates the need for a modified assessment for the 2% population. • Feasible and meets all professional and technical standards by the AERA, NCME and APA. • Affordable and in the long run would cost about half as much as paper-pencil tests. • Benefits the feds (comparable data for states, districts, schools). • Benefits the states (cheaper, faster, better assessments with some local flexibility). Gary W. Phillips

Technical question 1 - What is the best technical approach for ensuring the vertical alignment of the entire assessment system across grades (e.g., grades 3 through 8 and high school)? • The entire assessment system within each consortia would be placed on a vertical scale (e.g., from grade 3 through high school). The vertical scale would reflect the incrementally increasing difficulty of the content standards as the student moves up the grades and would be used to improve the accuracy of student growth models and provide better measures of teacher and principal effectiveness. • In addition to a vertical scale, the performance standards would be vertically articulated. For example, the proficient standard would be established in such a way that they would reflect an orderly progression of increasing higher and higher expectations as the student moves up the grades. They would be on an upward trajectory leading to an internationally benchmarked, career-ready and college-ready proficiency standard in high school. Gary W. Phillips

Technical question 2 - What would be the best technical approach for ensuring external validity of such an assessment system, particularly as it relates to postsecondary readiness and high-quality internationally benchmarked content standards? • Each consortia of states would need to fund empirical research on how well the high school test predicts college and career success. Recent work by the National Assessment Governing Board (related to validating the 12th grade NAEP) would inform this process. • The predictive validity studies and an evaluation of the validity of the international benchmarking should be done by an independent group (e.g., the National Academy of Sciences). Gary W. Phillips

Technical question 3 - What is the proportion of assessment questions that you recommend releasing each testing cycle in order to ensure public access to the assessment while minimizing linking risk? What are the implications of this proportion for the costs of developing new assessment questions and for the costs and design of linking studies across time? • Each state consortia should release enough items each year to thoroughly represent the content standards (this would be around 75-100 items). Over time, more and more items would be released. • The above design depends on a major item development effort. A substantial pool of items would be needed to • Adequately cover the content standards. • Equate new forms to the common scale with each successive administration. • Release enough items to help teachers use the items for teaching and diagnostic purposes. • However, since items would be shared across states within a consortia the cost should be manageable. Gary W. Phillips

Race to the Top Assessment Program: A new Generation of Comparable State Assessments