210 likes | 360 Views
Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments. An Empirical Assessment Based on Four Recent Evaluations. IES Research Conference June 28 th , 2010. Marie-Andrée Somers (Presenter) Pei Zhu Edmond Wong MDRC.
E N D
Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments An Empirical Assessment Based on Four Recent Evaluations IES Research Conference June 28th, 2010 Marie-Andrée Somers (Presenter) Pei Zhu Edmond Wong MDRC
Two key concerns with using state tests in an evaluation… • They may not be suitable for the evaluation • Validity concerns: They may not be aligned with outcomes of interest (do not provide a valid inference about program impacts) • Reliability concerns: They may be too difficult for low-performing students (unreliable) • Variation in scale/content of state tests also complicates the task of combining impact findings across states and grades
About This Study • Funded by Institute of Education Sciences (IES) • Purpose is to “bring data to bear” on several topics covered in May et al. discussion paper: • Are state tests suitable for evaluation purposes? • As a measure of the outcome(s) of interest? • As a measure of student achievement at baseline? • How should impacts on state tests be pooled? • Are impact findings sensitive to methods of rescaling and aggregating test scores across states and/or grades?
Overview of Analytical Approach • We identified 4 large-scale randomized experiments where achievement was measured using both (i) state tests AND (ii) a study test • The study test provides a benchmark for gauging the suitability of state tests • Two types of analyses: • Impact analyses: We compared estimated impacts on state tests and on the « benchmark » study test • Descriptive analyses: We also examined published information on the characteristics/content of tests
Data and Samples • Studies represent diversity with respect to grade levels and outcomes • Analysis sample includes students with a state test score and a study test score
Approach for Estimating Impacts • Impact on state tests: • Rescaling: Scores are z-scored by state and grade using the sample mean and standard deviation • Pooling approach: Impacts by state and grade are aggregated using precision weighting • Impact on the study test: • Rescaled/pooled using the same approach for comparability
Two dimensions of suitability Validity: Whether the content of state tests is aligned with the outcomes of interest in the evaluation Reliability: Whether state tests provide a reliable measure of achievement for the target population(in this case, low-performing students) A key concern: State tests have low reliability and do not yield valid inferences about program effectiveness Criteria for Assessing “Suitability”
Criteria for Assessing “Suitability” • Implications for the impact findings: • Poor Validity: • Could fail to detect impacts on the outcome of interest (invalid inference about program effectiveness) • Affects the magnitude of the estimated impact on state tests • Low Reliability: • Student achievement is estimated with greater error • Affects the standard error of the estimated impact on state tests
Criteria for Assessing “Suitability” • Reliability: Compare the standard error of the estimated impact on state tests vs. the study test • Smaller standard error is better (more precision) • Validity: Compare the magnitude of the impact estimates, in light of estimation error… • Compare the statistical significance of the impact findings (i.e., conclusions about program effectiveness based on p-value) • If both estimates are statistically significant, then also compare their magnitudes
Criteria for Assessing Validity • The extent to which the magnitude of the impact estimates are expected to differ depends on the outcome that state tests are intended to measure • Two types of intervention: • Targeted outcome is general achievement(Studies A and B) • The outcome of interest is “general achievement” in math or reading • Both state tests and the study test measure the targeted outcome (general achievement) • If state tests are valid, then the impact on the study test and state tests should be similar
Criteria for Assessing Validity • Two types of intervention (ctd.) • Targeted outcome is a specific skill(Studies C and D) • There are two outcomes of interest: • Targeted skill (short-term) and • General achievement (longer term) • Study test is used to measure the short-term outcome (specific skill), while state tests are used to measure the longer-term outcome (general achievement) • If state tests are valid, then the impact on state tests should be smaller than theimpact on the study test
Benchmark: Impact on the Study Test
P-Value & Magnitude (Validity) Targeted Outcome is General Achievement p = 0.119 p = 0.055
P-Value & Magnitude (Validity) Targeted Outcome is General Achievement p = 0.119 p = 0.189 p = 0.055 p = 0.229
P-Value & Magnitude (Validity) Targeted Outcome is a Specific Skill p = 0.002 p = 0.578
P-Value & Magnitude (Validity) Targeted Outcome is a Specific Skill p = 0.002 p = 0.007 p = 0.578
P-Value & Magnitude (Validity) Targeted Outcome is a Specific Skill p = 0.002 p = 0.007 p = 0.578 p = 0.219
Standard Errors (Reliability) State-Study Ratio: 1.20 1.07 1.04 1.03
Conclusion • Findings suggest that state tests can be used as a complement to a study-administered test • State tests are suitable (valid and reliable) in 3 of 4 studies • Whether state tests can be used as a substitute for a study test is an open question • Limited availability in some grades and subjects • Available for all states/grades in only 1 of 4 studies • May not be able to use them to measure a specific targeted skill • Possibly less reliable • Findings from descriptive analysis lead to the same conclusions as the impact analysis…
Questions? • Marie-Andrée Somers • marie-andree.somers@mdrc.org • Pei Zhu • pei.zhu@mdrc.org • Edmond Wong • edmond.wong@mdrc.org