320 likes | 414 Views
Bayesian Inference for Some Value-added Productivity Indicators. Yeow Meng Thum. Measurement & Quantitative Methods Counseling, Educational Psychology, & Special Education College of Education Michigan State University. Conference on Longitudinal Modeling of Student Achievement
E N D
Bayesian Inferencefor Some Value-addedProductivity Indicators Yeow Meng Thum Measurement & Quantitative MethodsCounseling, Educational Psychology, & Special EducationCollege of EducationMichigan State University Conference on Longitudinal Modeling of Student Achievement University of Maryland, November 2005
Overview and Conclusions • Recent thinking raised doubts about the validity of so-called “teacher effects” or “school effects” captured in applications • We may have the models, but we do not have suitable data to be able to claim causal agency (question about design). • Places doubt on whether the empirical evidence for “accountability,” on the basis of which a teacher, or a program, or school may be identified as responsible for improvement or for failure, is so directly accessible . • Purpose: Suggests that descriptive measures of productivity and improvement for accounting units, teachers or schools, are still valid given the accountability data. This is where we begin, so: • Focus:Measurement, leaving aside structural relationships (until we have better data) • Employ well-defined data-base (evidence base). • Build Productivity Indicators that address value-added hypotheses about growth and change. • Design procedures for their inference -- Bayesian.
Start by Defining & MeasuringValue-added Performance(Thum, 2003a) • Make the Accountability Data Block Explicit • Value-added notion is keyed on our ability to measure change. Begins with a model for the Learning Change in the student: Multivariate Multi-Cohort Growth Modeling (Thum, 2003b) • To Measure Change, Estimate Gains • Multiple Outcomes Helps • Employ standard error of measurement (sem) of the score • Metric Matters for Measuring Change • Require Model-based Aggregation & Inference • Keep the “Black-box” Open
8 7 6 5 4 3 Quasi-longitudinal: Longitudinal at theSchool-grade level Grade 01 02 03 04 05 06 07 08 8 7 6 5 4 3 01 02 03 04 05 06 07 08 01 02 03 04 05 06 07 08 8 7 6 5 4 3 01 02 03 04 05 06 07 08 Year Longitudinal Student DataIs the Key Evidence Base Longitudinal Student Cohorts Definition ofData Blockmust be integral to any Accountability Criteria/System Point? A “constant ballast” Standardize evidence-base to stabilize comparisons.
Bayesian Multivariate Meta-Analysis Between Schools: Multivariate Multi-cohortMixed-Effects Model Within Each School j (example):
Why Focus on Gains? • Thum (2003) offered a summary of some reasons. The gain score • is not inherently unreliable (Rogosa, others); is not always predictable from knowledge of initial status; is conceptually congruent with, and is an unbiased estimator of, the true gain score; does not sum to zero by construction for the group; places the pre-test AND post-test on equal footing as outcomes, thus generalizes directly to growth modeling. • In contrast, the residual gain score • ranks on only “relative progress,” allowing for “adjusted comparisons,” but is by no means “corrected” for anything in particular; an individual’s gain is dependent on who else is included in, or excluded from, the regression, and as such makes gains measurementsubject to manipulation; sums to zero for a group and thus severely limits its utility for representing overall change; violates regression requirement that pre-test are error-free; does not generalize easily to longer time series. Additionally, expanding on the conceptual congruence of the gain score with true gain, note how the gain score is ALSO the ideal for supporting causal claims under the widely-considered Rubin-Holland counterfactual framework. In the gain score, we do not need to guess the result in the unobserved “counterfactual” condition!!!!!!!6
Outcome Grade Year Overall Strategy:Obtain a good fitting (measurement) model for each school (surface for math), then construct and evaluate relevant valued-added hypotheses for the school. Note: Surface need not be a “flat.”
So, there are ONLY Value-added Hypotheses, NOT Value-added Models! • It is up to us to define What Progress Are We Talking About? • How-to: Get the best data available, smooth it for irregularities with the most reasonable model, and construct from the “signal,” statistics that address your hypotheses directly.
Grade 5 4 3 2 1 (Q1) Cohorts (Q2) Grade-levelMeans (Q3) Grade-levelPACs 1998 1999 2000 2001 2002 Basic “Progress Hypotheses”
Reporting Criterion ReferencingScale Scores Analysis / Reporting Grade 3 L3Advanced C2 L2Proficient C1 L1Basic Norm ReferencingNRT (NCE) C’1 C’2 Criterion AND Norm Referencing:dual reporting formats for two questions about your achievement
Some Value-Added Hypotheses * An example is the 100% Proficient is a standard for NCLB. Other examples may compare schools with each other, with “similar” schools determined by ranking on a selected covariate set (ala California), etc.** Thum & Chinen (in preparation)
Standards of Progress • To fully judge progress, we rely on standards, or benchmarks, absolute and contextualized, whenever these are available. • Within EACH school (over time), we might consider progress of • Different subjects, or their composites • Different grades, or their aggregates (lower primary, etc.) • Different student cohorts, or their comparisons • Different sub-groups • All the above may be individually, or in groups of school-grades, compared with • District average, schools-like-mine, etc. • Fixed district goals.
Comparing Cohort Slopes: Improvement (Q1) Score Decreasing Productivity Increasing Productivity 98 99 00 01 02 98 99 00 01 02 Year Year
Is School 201 getting more effective? (Q1) This compares present with past performance. We can also compare School 201’s latest growth rate with the district average, with the average of schools “similar” to School 201.
What is Adequate Yearly Progress?Example, via an Empirical Definition (Q2) • AYP must take into account • Where you start and Where you should end up (mandated) • Between the present time, t, and the mandated time frame to reach proficiency (T = 12) • Thus, AYP may be defined as the growth rate that will place you on the target given where you are presently, such as(YT-Yt) / (T- t) , or a some more refined version; where YT is the cut-score for the “proficiency” and Yt is the present score. DOES NOT MEAN THE ANALYSISNEED TO BE PERFORMED ON CATEGORICAL DATA
Predicted Grade-Year Means (Q2) Score 800 700 600 500 Grade 6 5 4 3 2 1 1 2 3 4 5Year Based on a model for the information contained in the data-block … Is Grade 4 Predicted Average Increasing? Is Grade 1 Predicted Average Increasing?
Y, Mean Upper bound ofSchool’s AYPfor Time=4. L3Advanced Lower bound ofSchool’s AYPfor Time=4. CU L2Proficient CL L1Basic X, Time T Assessing AYP-NCLB Object of Inference:
Answer: If you are growing at at time t, your minimum growth rate to reach the target is , and so you make AYP-NCLBif , with probability . Defining AYP-NCLB Question: Given where you are at this point in time, are you improving at a pace that will put you on the specified target in the remaining time frame? Implication: AYP depends on the performance of the school; so it changes over time. Classification errors are directly assessed.
T Trend in Percent Proficient (Q2) Y, Score SAFE HARBOR School makes AYP If % proficientincreased by 10 % Proficient CL NotProficient X, Time Object of Inference:
Y, Mean L3Advanced Object of Inference: CU L2Proficient CL L1Basic X, Time T Some standards: for school, district, schools-like-mine. Value-Added over Projected Status (Q2)
Y A B AB C D C D 1 T X, Time Object of Inference: Total Output: School Excellenceas A Value-Added Hypothesis (Q2) Comparing 4th grade growth for schools, j = A, B, C and D,that combines Growth AND Final Status! Areas underpredicted curves, f(x) !
Why Bayesian Inference • Basic Components • See O’ Hagan (1994) for a summary. Basically formulated here as an enhancement of likelihood inference. • Highlighted Advantages • Conceptual – Credibility Intervals as likely range of of true parameters is the more natural vis-à-vis Neyman-Pearson Co.I. • Analytically less demanding, using statistics to do statistics via Markov chain Monte Carlo (MCMC), inference for ratios is straightforward. • Disadvantages • Where do get out priors – not a problem (for long anyway) with longitudinal data. • Computationally intensive, and in normally large accountability applications we need to proceed carefully.
Posterior Distributions of a Value-added indicator, , for 3 Schools. Productivity Profiles: Result:A measure of how much was achieved (a percent) and at what level of precision (a probability), and so the comparison is (relatively) scale-free. Thum (2003b) Ratios & Productivity Profiles
We are only confident (at 70% level) that 3 teachers reached 4% 80 % 70 % A Confidence in Meeting al al , Proportion of Standard, % Sample Teacher Productivity Profiles I
70% 70% 80% 70% 70% Models differ in terms of adjustments for different classroom characteristics. Sample Teacher Productivity Profiles 2
Model 0 70% 80% Model 1 70% 80% Model 4 70% 80% Different Models produces different conclusion (Thum 2003b) Sample Teacher Productivity Profiles 3
Standing Issues re Inputs:Validity & Quality of Outcome Measures • We assume that we have an outcome of student learning which the user believes to be a valid/useful measure of the intended construct. • The outcome measure possesses the necessary psychometric (scale) properties supporting its use. • To the degree that either, or both, the construct validity of the measure, and its scale-type (interval), are approximate in practice, we submit that the validity of the interpretation using this outcome needs to be tempered accordingly. • Faced with this complex of nearly unsolvable issues, I find myself resting some of my choices on the “satisfising principle” (Simon, 1956).
Selected References Thum, Y. M. (2002). Measuring Student and School Progress with the California API. CSE Technical Report 578. Los Angeles: Center for Research on Evaluation, Standards, and Student Testing, UCLA. Thum, Y. M. (2003a). No Child Left Behind: Methodological Challenges Recommendations for Measuring Adequate Yearly Progress. CSE Technical Report 590. Los Angeles: Center for Research on Evaluation, Standards, and Student Testing, UCLA. Thum, Y. M. (2003b). Measuring Progress towards a Goal: Estimating Teacher Productivity using a Multivariate Multilevel Model for Value-Added Analysis. Sociological Methods & Research, 32 (2), 153-207. AcknowledgementsThe analyses presented here are drawn from a larger comparative analysis study organized and supported by the New American Schools. Additional illustrations concerning the API draw support from CRESST and the Los Angeles Unified School District. Many of the ideas were first tested in an evaluation sponsored by the Milken Family Foundation. Portions of this presentation were part of an invited presentation in AERA 2005, Montreal. Y. M. Thum thum@msu.edu
“Too much trouble”, “too expensive”,or “who will know the difference” are death knells to good food. Julia Childs (1961) Final Caveat:In this work, the procedures are complex only to the degree that they meet the demands of the task at hand – nothing more, nothing less. We have clearly come a long way from naively comparing cross-sectional means.