Robert L. Linn CRESST, University of Colorado at Boulder

The Concept of Validity in the Context of NCLB Robert L. Linn CRESST, University of Colorado at Boulder Presentation at the Ninth Annual Maryland Assessment Conference: The Concept of Validity : Revisions, New Directions and Application. College Park MD: University of Maryland. Sponsored by the Maryland State Department of Education and the Maryland Assessment Research Center for Education Success, October 9 and 10, 2008

Validity Points of Broad Consensus • Validity is the most fundamental consideration in the evaluation of the appropriateness of claims about, and uses and interpretations of assessment results. • Validity is a matter of degree rather than all or none.

Validity (continued) Broad, but not universal agreement (for exception, see Lissitz & Samuelson, 2007) • It is the uses and interpretations of tests rather than the test itself that is validated. • Validity may be relatively high for one use or interpretation of assessment results by quite low for another use or interpretation.

Validity (continued) • A comprehensive validation program for state tests used for purposes of NCLB requires systematic analysis of the myriad uses, interpretations, and claims that are made. • Evidence relative to particular uses, interpretations and claims needs to be accumulated and organized into relevant validity arguments (Kane, 2006).

1999 Test Standards • “Validity is the degree to which evidence and theory support the interpretations of test scores entailed by the proposed uses of tests.” • Validation logically begins with an explicit statement of the proposed interpretation of test scores, along with a rationale for the relevance of the interpretation to the proposed use.” (AERA, APA, & NCME, 1999, p. 9).

Foundation for position in the Test Standards Concept of validity in the Test Standards builds on the work of major validity theorists • Cronbach (1971, 1980, 1988, 1989) • Kane (1993) • Messick (1975, 1989) • Shepard (1993)

Kane (2006) Argument-Based Approach • Interpretive Argument: specification of proposed interpretations and uses of • Validation Argument: evaluation of the interpretive argument • Builds on earlier work by Cronbach (1989), Kane(1992), Messick (1989), and Shepard (1993)

Validity Argument(Cronbach, 1988) • Functional perspective • Political perspective • Operationalist perspective • Economic perspective • Explanatory perspective

Guiding Questions Shepard (1993) • “What does the testing practice claim to do? • What are the arguments for and against the intended aims of the test? • What does the test do in the system other than in claims?” (p. 429)

NCLB Accountability • States required to administer tests of mathematics and Reading or English language arts required for all students grade 3 though 8 • Science tests required for one grade in each of three levels: elementary, middle, and high school

NCLB Accountability (continued) • States had to adopt academic achievement standards defining proficient performance and two other levels (usually called basic and advanced) • States had to establish targets, known as annual measurable objectives (AMO’s) that would be on trajectories that would lead to all students being at the proficient level or above by 2014

NCLB Targets • Current Status: AMO is percent proficient each year that is set to be on a trajectory to 100% proficient or above by 2014 • Change: Safe harbor allows school to make AYP if percentage of students is reduced by at least 10% compared to previous year

NCLB Targets (continued) Disaggregated reporting for subgroups • Economically disadvantaged students • Major racial and ethnic groups • Students with disabilities • Students with limited English proficiency

NCLB Targets (continued) Subgroup reporting • Critical for monitoring the closing of gaps in achievement • No real relevance for small schools with homogeneous student bodies • However, it leads to many hurdles that large, diverse schools must meet

Multiple-Hurdle Approach • NCLB uses multiple-hurdle approach • Schools must meet multiple targets each year – participation and achievement separately for reading and mathematics for the total student body and for subgroups of sufficient size

Multiple-Hurdle Approach (continued) • Many ways to fail to make AYP (miss any target), but only one way to make AYP (meet or exceed every target) • Large schools with diverse student bodies at a relative disadvantage in comparison to small schools or schools with relatively homogeneous student bodies

Growth Models • Growth Pilot Program: Percentage of students who are either proficient or on a growth trajectory toward proficient within three years • Restriction of growth results for AYP by rapid growth trajectory has meant that few schools that would not make AYP under status approach do so because of growth approach

Primary Use and Interpretation of Test Results for NCLB • Use: Identification of schools as making or failing to make AYP • Schools that fail to make AYP two or more years in a row placed in “needs improvement” category • Interpretation: Schools that make AYP or better or more effective than schools that fail to make AYP

Multi-level Interpretations • Validity of interpretations of individual student scores not equivalent to validity of interpretations of aggregate results (Zumbo & Forer, in press) • Need to think in terms of validation at aggregate level (e.g., school or school district) as well as individual student level

Validation of School Quality Inference • Validating the claim that if school A makes AYP it is of higher quality or more effective than school B that fails to make AYP requires elimination of plausible hypotheses for difference in AYP status • AYP differences due to higher achievement at school A higher than school B in earlier years, e.g., when children enter school • AYP Differences due to differences in demographics • Differences due to differences in parental support

Inferences from Growth Models • Growth models rule out the alternate explanation of differences in prior achievement • Nonetheless, causal inferences about school effectiveness are not justified by the growth approach to test-based accountability (Raudenbush, 2004, Rubin, Stuart, & Zanutto, 2004)

Growth Model Results • Many rival explanations to between-school differences in growth besides differences in school quality or effectiveness • Results better thought of as descriptive for generating hypotheses about school quality that need to be evaluated

School Characteristicsand Instructional Practice • School differences in achievement and in growth describe outcomes and can be the source of hypotheses about school effectiveness • Accountability systems need to be informed by direct information about school characteristics and instructional practices

NCLB Peer Review • Peer Review Purposes • Inform states about what would be Useful Evidence • Guide review teams who advise the Department

Validity Evidence for Peer Review • Related to test content • Based on relationships to other variables • Based on student response processes • Based on internal structure • Alignment of assessments to content standards • Based on consequences of assessments

Consequences and Validity • “Perhaps the most contentious topic in validity is the role of consequences” (Brennan, 2006, p. 8). • Although investigations of consequence of test uses commonly referred to as “consequential validity”, Messick did not use that designation.

Messick’s Facets of Validity

Controversy • Many experts (e.g., Popham, Mehrens, Green, Ebel, and, most recently, Lissitz and Samuelson) have argued that consequences should not be considered part of validity, while others (e.g., Lane, Linn, Moss, Shepard, Brennan, and Kane) have argued that they should be considered as part of validity.

Controversy (continued) • Fairly broad agreement that it is important to look at positive and negative effects of test use as part of overall evaluation, even if such and evaluation is considered beyond the scope of validation, per se.

Peer Review Guidance on Consequences “In validating an assessment, the State must also consider the consequences of its interpretation and use. Messick (1989) points out that these are different functions and that the impact of an assessment can be traced either to an interpretation or to how it is used. Furthermore, as in all evaluative endeavors, States must attend not only to the intended outcomes, but also to unintended effects” (U.S. Department of Education, 2004, p. 33).

Test Standards • Narrow view of consequences and validity • Consequences that are directly due to the way in which the construct is measured • Degree to which intended benefits are realized • Excludes “evidence that may inform decisions about social policy but falls outside the realm of validity

Test Standards • 1.24 “When unintended consequences result from test use an attempt should be made to investigate whether such consequences arise from the test’s sensitivity to characteristics other than those it is intended to assess or to the test’s failure fully to represent the intended construct” (1999, p. 23).

Michael Kane • “Consequences have always been a part of our conception of validity… Traditional definitions of validity in terms of how well a testing programs achieves its goals… necessarily raise questions about consequences, positive and negative” Kane, 2006, p. 54).

Consequences of Uses of NCLB Assessments • Controversy regarding consequences as a component of validity, but not about the importance of evaluating consequences • Frameworks • Bill Mehrens • Suzanne Lane and her colleagues

Mehrens Framework • Curricular and instructional reform • Teacher motivation and stress • Student motivation and self concept • Changes in student achievement • Public awareness of student achievement

Lane, et al Framework • Identification of a set of propositions about consequences that are central to an interpretive argument • (e.g., School administrators and teachers are motivated to adapt instruction and curriculum to the content standards) • (e.g., students are motivated to learn as well as to perform their best on the assessment) • Teacher and student questionnaires and interviews regarding motivation and instructional practices • Collection of multiple indicators of student achievement

Frameworks of Lane and Mehrens • Applicable to the status approach to AYP as well as to growth model approach to AYP, and/or other types of accountability uses of growth models, e.g., value-added models. • With growth models the emphasis on student learning may be greater than in a status approach to accountability.

Curricular and instructional reform • Questionnaire studies of are most common • Teachers • Principals • Interviews • Teachers • Principals • Qualitative studies • Collection of instructional artifacts

Teacher motivation and stress - Student motivation and self concept • Questionnaire studies are most common • Teachers • Students • Interviews • Teachers • Students • Qualitative studies

Student achievement • Center on Education Policy • Tracked trends on state tests before and after enactment of NCLB • Tracked size of achievement gaps • Compared trends in achievement and gaps on state tests to NAEP • Generally modest increases in achievement and modest reductions in size of gaps • Doesn’t prove effect of NCLB tests but generally consistent with intention

Alternate Assessments • Inclusion of students with severe cognitive disabilities in alternate assessments intended to improve learning for those students • Inclusion judged to be having positive effects on students participating in alternate assessments • Need more evidence of influence on instruction for included students and effects on their learning

End-of-Course Tests • Use of questionnaires, interviews, and collection of instructional artifacts to document changes in • Rigor of courses and instruction • Uniformity of instruction across schools • Student course taking patterns • Student dropout rates

Conclusion Two major validity issues yet to be addressed by states regarding their NCLB testing programs • Validity of inferences about school quality based on test-based AYP determinations for schools • Consequences of state testing programs used for purposes of NCLB Neither issue is easy to address, but both are important to the justification of state testing programs used for NCLB

“Validation is doing your damnedest with your mind – no holds barred. Eddington, as you know said that about science” (Cronbach, 1988, p. 14).

Robert L. Linn CRESST, University of Colorado at Boulder

Robert L. Linn CRESST, University of Colorado at Boulder

Presentation Transcript

Robert L. Linn

Robert L. Linn

University of Colorado at Boulder

FLAGSHIP 2030 University of Colorado at Boulder

Janette Klingner University of Colorado at Boulder

Hot Work Permit Presentation University of Colorado at Boulder

Jane Crayton University of New Mexico University of Colorado at Boulder

Presenters (from University of Colorado Boulder):

University of Colorado Boulder

University of Colorado at Boulder

Jane Crayton University of New Mexico University of Colorado at Boulder

Biological Laboratory Safety University of Colorado at Boulder

Robert Ergun University of Colorado

TIGER TEACHES University of Colorado at Boulder

SOS at Fiske Planetarium University of Colorado at Boulder

Rod Frehlich and Robert Sharman University of Colorado, Boulder RAP/NCAR Boulder

Colorado Center for Astrodynamics Research University of Colorado at Boulder

Hilda Borko University of Colorado, Boulder CRESST

Robert L. Linn

Michael Skeen Ryan Starkey University of Colorado at Boulder