State and District Evaluation Tools

State and District Evaluation Tools SWPBS Forum October 2008 Claudia Vincent and Scott Spaulding clavin@uoregon.edusspauldi@uoregon.edu University of Oregon

Goals: • Provide information about desirable features of SWPBS evaluation tools • Provide an overview of the extent to which SWPBS evaluation tools meet these desirable features

Roles of evaluation data Drive implementation decisions Provide evidence for SWPBS impact on student outcomes Improved student outcomes Student outcome measures Implement practices Interpreteval data EVALUATION DATA Fidelitymeasures Use eval data for decision-making Implement systems to support practices PBS Self-Assessment Action Plan

What is a “good” measure? • A measure that drives implementation decisions should be: • socially valid • contextually appropriate • sufficiently reliable (reliable enough to make defensible decisions) • easy to use • A measure that builds the evidence base for SWPBS should: • have known reliability • have known validity • clearly link implementation status to student outcomes

Using Classical Test Theory to define desirable qualities of evaluation tools • Measurement scores have twocomponents: • True score, e.g. a school’s true performance on “teaching behavioral expectations” • Error, e.g. features of the measurement process itself • Our goal is to use tools that • maximize true score and minimize measurement error,and therefore • yield precise and interpretable data, and therefore • lead to sound implementation decisions and defensible evidence. True score(relevant to construct) Error(noise)

Desirable features of evaluation tools • True score is maximized and error minimized if the evaluation tool is technically adequate, i.e. • can be applied consistently (has good reliability) • measures the construct of interest (has good validity) • Sound implementation decisions are made if the evaluation tool is practical, i.e. data • are cost efficient to collect (low impact) • are easy to aggregate across units of analysis (e.g. students, classrooms, schools, districts, states) • are consistently used to make meaningful decisions (have high utility)

Reliability indicators • Consistency across • Items/subscales/total scales (“internal consistency”) • Data collectors (“inter-rater reliability” or “inter-observer agreement”) • Time (“test-retest reliability”)

Internal consistency • Definition: • Extent to which the items on an instrument adequately and randomly sample a cohesive construct, e.g. “SWPBS implementation” • Assessment: • If the instrument adequately and randomly samples one construct, and if it were divided into two equal parts, both parts should correlate strongly • Metric: • coefficient alpha (the average split-half correlation based on all possible divisions of an instrument into two parts) • Interpretation: • α ≥ .70 (adequate for measures under development) • α ≥ .80 (adequate for basic research) • α ≥ .90 (adequate for measures on which consequential decisions are based)

Inter-rater reliability/Inter-observer agreement • Definition: • Extent to which the instrument measures the same construct regardless of who collects the data • Assessment: • If the same construct were observed by two data collectors, their ratings should be almost identical • Metric: • Expressed as percentage of agreement between two data collectors • Interpretation: • ≥ 90% good • ≥ 80% acceptable • < 80% problematic

Test-retest reliability • Definition: • Extent to which the instrument yields consistent results at two points in time • Assessment: • The measure is administered at two points in time. The time interval is set so that no improvement is expected to occur between first and second administration. • Metric: • Expressed as correlation between pairs of scores from the same schools obtained at the two measurement administrations • Interpretation: • r ≥ .6 acceptable

How does reliability impact data precision and decision-making? How can we interpret this graph?

How does reliability impact data precision and decision-making? • Interpretability of data! • Did these schools truly differ in the extent to which they taught behavioral expectations? • Or…did these schools obtain different scores because • the tool’s items captured only some schools’ approach to teaching expectations? (tool lacked internal consistency) • they had different data collectors? (tool lacked inter-rater agreement) • some collected data in week 1 and some in week 2 of the same month?(tool lacked test-retest reliability)

Types of validity • Content validity • Criterion-related validity • Concurrent validity • Predictive validity • Construct validity

Content validity • Definition: • Extent to which the items on an instrument relate to the construct of interest, e.g. “student behavior” • Assessment: • Expert judgment if items measure content theoretically or empirically linked to the construct • Metric: • Expressed as percentage of expert agreement • Interpretation: • ≥ 80% agreement desirable

Criterion-related validity • Definition: • Extent to which the instrument correlates with another instrument measuring a similar aspect of the construct of interest and administered concurrently or subsequently • Assessment: • Concurrent validity: compare data from concurrently administered measures for agreement • Predictive validity: compare data from subsequently administered measures for predictive accuracy • Metric: • Expressed as a correlation between two measures • Interpretation: • Moderate to high correlations are desirable • Concurrent validity: Very high correlations might indicate redundancy of measures

Construct validity • Definition: • Extent to which the instrument measures what it is supposed to measure (e.g. the theorized construct “student behavior”) • Assessment: • factor analyses yielding information about the instrument’s dimensions (e.g. aspects of “student behavior”) • Correlations between constructs hypothesized to impact each other (e.g. “student behavior” and “student reading achievement”) • Metric: • Statistical model fit indices (e.g. Chi-Square) • Interpretation: • Statistical significance

How does validity impact data precision and decision-making? How can we interpret this graph?

How does validity impact data precision and decision-making? • Interpretability of data! • Can we truly conclude that student behavior is better in school F than school J? • Does the tool truly measure well-defined behaviors? (content validity) • Do student behaviors measured with this tool have any relevance for the school’s overall climate? For the student’s long-term success? (concurrent, predictive validity) • Does the tool actually measure “student behavior”, or does it measure “teacher behavior”, “administrator behavior”, “parent behavior” ?(construct validity)

Interpreting published reliability and validity information • Consider sample size • Psychometric data derived from large samples are better than psychometric data derived from small samples. • Consider sample characteristics • Psychometric data derived from specific samples (e.g. elementary schools) do not automatically generalize to all contexts (e.g. middle schools, high schools).

Utility of evaluation data • Making implementation decisions based on evaluation data • When has a school reached “full” implementation? • “Criterion” scores on implementation measures should be calibrated based on student outcomes social achievement academic achievement student outcome goals 10 20 30 40 50 60 70 80 90 100 Implementation criterion

Utility of evaluation data • Evaluation data lead to consequential decisions, e.g. • Additional trainings when data indicate insufficient implementation • Emphasis on specific supports where data indicate greatest student needs • To make sure we arrive at defensible decisions, we need to collect evaluation data with tools that • have documented reliability and validity • clearly link implementation to student outcomes

Take-home messages • Collect evaluation data regularly • Collect evaluation data with tools that have good reliability and validity • Guide implementation decisions with evaluation data clearly linked to student outcomes

Goals: • Provide information about desirable features of SWPBS evaluation tools • Provide an overview of the extent to which SWPBS evaluation tools meet these desirable features

SWPBS evaluation questions • How is my school doing? • My school is “80/80”. Now what? • My school is just beginning SWPBS. Where do I start? • How do we handle the kids still on support plans? • I’ve heard about school climate. What is that? • What about the classroom problems we still have?

SWPBS evaluation tools • Measurement within SWPBS • Research or evaluation? • What tools do we have? • What evidence exists for use of these tools? • Guidelines for using the measures

Measurement within SWPBS • Focus on the whole school • School-wide PBS began with a focus on multiple systems • Evaluation of a process • Evaluation of an outcome • Growth beyond initial implementation Classroom Non- classroom Individual Student School-wide Systems Sugai & Horner (2002)

Tertiary Prevention: Specialized Individualized Systems for Students with High-Risk Behavior Continuum of School-wide Positive Behavior Support ~5% Secondary Prevention: Specialized Group Systems for Students with At-Risk Behavior ~15% Primary Prevention: School-/Classroom- Wide Systems for All Students, Staff, & Settings ~80% of Students

Level of Prevention and Intervention Dimension of Measurement Process Outcomes Tertiary Academics Behavior ? ? ? Secondary Academics Behavior ? ? ? Primary Academic Achievement Social Behavior ? ? ? Student School Nonclassroom Classroom Unit of Measurement and Analysis

Research or Evaluation? • Drive implementation decisions • Provide evidence for SWPBS impact on student outcomes • Measures have developed to support research-quality assessment of SWPBS • Measures have developed to assist teams in monitoring their progress

Review of SWPBS tools Some commonly used measures: • Effective Behavior Supports Survey • Team Implementation Checklist • Benchmarks of Quality • School-wide Evaluation Tool • Implementation Phases Inventory

Review of SWPBS tools Newer measures: • Individual Student Schoolwide Evaluation Tool • Checklist for Individual Student Systems • Self-assessment and Program Review

Fidelity assessment: Support Tier and Unit of Measurement

What makes a “good” measure? • Is it important, acceptable, and meaningful? • Can we use it in our school? • Is it consistent? • Is it easy to use? • Is it “expensive”? • Does it measure what it’s supposed to? • Does it link implementation to outcome?

Evidence for use of measures • Effective Behavior Supports Survey (EBS) • School-wide Evaluation Tool (SET) • Benchmarks of Quality (BoQ)

Evidence for use: EBS Survey • Effective Behavior Supports Survey • Sugai, Horner, & Todd (2003) • Hagan-Burke et al. (2005) • Safran (2006)

EBS Survey: Overview • 46-item, support team self-assessment • Facilitates initial and annual action planning • Current status and priority for improvement across four systems: • School-wide • Specific Setting • Classroom • Individual Student • Summary by domain, action planning activities • 20-30 minutes, conducted at initial assessment, quarterly, and annual intervals

EBS Survey: Reliability • Internal consistency • Sample of 3 schools • current status: α =.85 • improvement priority: α =.94 • Subscale α from .60 to .75 for “current status” and .81 to .92 for “improvement priority” • Internal consistency for School-wide • Sample of 37 schools • α = .88 for “current status” • α = .94 for the “improvement priority”

Evidence for use: SET • School-wide Evaluation Tool • Sugai, Horner & Todd (2000) • Horner et al. (2004)

SET: Overview • 28-item, research evaluation of universal implementation • Total implementation score and 7 subscale scores: • school-wide behavioral expectations • school-wide behavioral expectations taught • acknowledgement system • consequences for problem behavior • system for monitoring of problem behavior • administrative support • District support • 2-3 hours, external evaluation, annual

SET: Reliability • Internal consistency • Sample of 45 middle and elementary schools • α = .96 for total score • α from .71 (district-level support) to .91 (administrative support) • Test-retest analysis • Sample of 17 schools • Total score, IOA = 97.3% • Individual subscales, IOA = 89.8% (acknowledgement of appropriate behaviors) to 100% (district-level support)

SET: Validity • Content validity • Collaboration with teachers, staff, and administrators at 150 middle and elementary schools over a 3-year period

SET: Validity • Construct validity • Sample of 31 schools • SET correlated with EBS Survey • Pearson r = .75, p< .01 • Sensitivity to differences in implementation across schools • Sample of 13 schools • Comparison of average scores before and after implementation • t = 7.63; df = 12, p< .001

Evidence for use: BoQ • Schoolwide Benchmarks of Quality • Kincaid, Childs, & George (2005) • Cohen, Kincaid, & Childs (2007)

BoQ: Overview • Used to identify areas of success / improvement • Self-assessment completed by all team members • 53-items rating level of implementation • Team coaches create summary form, noting discrepancies in ratings • Areas of strength, needing development, and of discrepancy noted for discussion and planning • 1-1.5 hours (1 team member plus coach) • Completed annually in spring

BoQ: Overview • Items grouped into 10 subscales: • PBS team • faculty commitment • effective discipline procedures • data entry • expectations and rules • reward system • lesson plans for teaching behavioral expectations • implementation plans • crisis plans • evaluation

BoQ: Reliability • Internal consistency • Sample of 105 schools • Florida and Maryland • 44 ES, 35 MS, 10 HS, 16 center schools • overall α of .96 • α values for subscales • .43 “PBS team” to • .87 “lesson plans for teaching expectations”

BoQ: Reliability • Test-retest reliability • Sample of 28 schools • Coaches scores only • Total score: r = .94, p < 0.01 • r values for subscales: • 0.63 “implementation plan” to • 0.93 “evaluation” • acceptable test-retest reliability • Inter-observer agreement (IOA) • Sample of 32 schools • IOA = 89%

State and District Evaluation Tools

State and District Evaluation Tools

Presentation Transcript

State Budget and School District Impact

Evaluation Tools

Evaluation Tools

Illinois Evaluation Tools

District and State Test Coordinator Training

State and District Perspectives: Putting Policy into Practice Educator Evaluation and Assessment

TCAP – the State, District and Bristol

Clinical Evaluation Tools

Scenario Evaluation Tools

Milestones Evaluation Tools

Data and Evaluation Tools

GATE Evaluation Tools

Using Corpora and Evaluation Tools

Updates on state and district initiatives

State and District Partnership Deployments

Creating Evaluation Tools and Building Aligned Evaluation Criteria

State Evaluation System and Evaluation Workbook

MONITORING AND EVALUATION TOOLS Monitoring and Evaluation

State and District Evaluation Tools

MONITORING AND EVALUATION TOOLS Monitoring and Evaluation

District Evaluation of MTSS

Evaluation Tools and Approaches