Determining the Validity and Reliability of Key Assessments

Determining the Validity and Reliability of Key Assessments Julia M. Lee – Presenting the work of the faculty of the Dewar College of Education at Valdosta State University GaPSC Assessment Workshop May 14, 2012

Tips for Developing Key Assessments • “Begin with the end in mind” • Have as many Education faculty members as possible involved in the development • Make sure you have P-12 and A&S partners involved • Look at “big picture” – ultimate outcome(s) rather than isolated knowledge and skills • Explicit, explicit, explicit

Tips for Meeting Standard Two • Involve as many faculty as possible in the assessment system • Development • Implementation • Evaluation • Analysis • Revision • Implementation • . . . . . • Have faculty complete a “self-study” or “self-evaluation” of each program and its assessment components • Provide training to faculty, candidates, and other “users” on the instruments developed • Develop users’ guides, data calendars, data collection documents (information to be collected, timeline, source, responsibility, etc.)

Aligned Impartiality Consistency Soundness with Fairness Legitimacy Comprehensiveness Objectivity Reliability Purpose Validity Stability

Assessment System and Unit Evaluation (2a) “4. The professional education unit has taken effective steps to eliminate bias in assessments and is working to establish the fairness, accuracy, and consistency of its assessment procedures and professional education unit operations.”

Impact of 2a4 • “2b1. The professional education unit maintains an assessment system that provides regular and comprehensive information on applicant qualifications, candidate proficiencies, competence of graduates, professional education unit operations, and preparation program quality.” • “2b3. Candidate assessment data are regularly and systematically collected, compiled, aggregated, summarized, and analyzed to improve candidate performance, preparation program quality, and professional education unit operations.”

Impact of 2a4 • “2c1. The professional education unit regularly and systematically uses data, including candidate and graduate performance information, to evaluate the efficacy of its courses, preparation programs, and clinical experiences.” • “2c2. The professional education unit analyzes preparation programs’ evaluation and performance assessment data to initiate changes in preparation programs and professional education unit operations.”

Establishing fairness, accuracy, and consistency of assessment procedures and instruments: Steps taken by the COE • Use of multiple assessments (multiple sources) • Primary use of analytic rather than holistic rubrics • Use of multiple raters • Provision of training on assessment instruments • Completion of inter-rater reliability studies and / or consensus agreement

Processes Used to Determine Reliability and Validity of Two Key Assessments • College of Education Observation Instrument • College of Education Disposition Survey

College of Education Observation Instrument • Part of “determining” reliability and validity involves building instruments and supporting documents in such a way that these issues are considered from the very beginning • How the COE OI was developed • Development and implementation of an instructional manual and training sessions • Completion of inter-rater reliability studies

Development of the COE Observation Instrument • Aligned to professional education standards (Danielson, INTASC, Georgia Framework) • Georgia Framework indicators that were observable formed the foundation of the instrument • P-12 Teachers, P-12 Administrators, and University faculty participated in the development

Development and Implementation of an Instruction Guide and Training Sessions for the COE OI • Training manual and training developed by a group of P-12 educators and University Faculty Members • Training manual provides explicit guidance for decision-making regarding the rubric • Training sessions are provided for first-time users (2 hour session) as well as for on-going users (1 hour “refresher” training)

Completion of Inter-rater Reliability Studies • Provided training to 17 triads (student teachers, their P-12 mentors, and their university supervisors) on the Instrument • These triads independently rated teaching one teaching episode for each candidate • Computed inter-rater agreement between P-12 mentors and university supervisors • (Agreements/Agreements + Disagreements) * 100 • Adjacent values • Standard Met / Not Met • Results • Adjacent values • Standard Met / Not Met

Inter-rater Agreement Results (% of agreement)

What did these data tell us? • All items on this instrument were reliable and valid. • There was a high level of agreement on all items on this instrument. • The independent raters did not agree with each other regarding whether or not candidates met the standard for most items. • In general, with the exception of one item, the independent raters’ had similar ratings for both types of reliability evaluated.

Inter-rater Agreement Results (% of agreement)

Decisions Made • Require all faculty who supervise to complete the training session • Provided training to mentors who frequently supervise student teachers • Provided training to several cohorts of Ed.S. students, many of whom served as public school mentors • Asked COE Assessment Committee to review data and make recommendations for changes based on reliability data • Modified this item on the instrument

Modification of the Item • Learning Environments • Original Item III-G.: Communication • Rating of 1-2: Errors in spoken/written language; ineffective nonverbal communication; unclear directions; does not use effective questioning skills • Rating of 3-4: Error-free spoken/written language; effective nonverbal communication; directions are clear or quickly clarified after initial student confusion; effective questioning and discussion strategies

Modification of the Item, continued • Learning Environments • New Items III-Ga. Communication • Rating of 1-2: Errors in spoken/written language • Rating of 3-4: Error-free spoken/written language • New Item III-Gb. Communication • Rating of 1-2: Ineffective nonverbal communication; unclear directions; does not use effective questioning skills • Rating of 3-4: Effective nonverbal communication; directions are clear or quickly clarified after initial student confusion; effective questioning and discussion strategies

College of Education Disposition Survey • Again, at the initial development stage, there was a focus on reliability and validity issues • How the Unit-adopted dispositions were chosen • How the COE Disposition Survey was developed

Adoption of Dispositions • Looked at all the disposition statements in the INTASC standards • Data collected from P-12 educators and candidates regarding importance of specific dispositions • Surveyed unit faculty regarding the relative importance of each disposition statement • Conceptual framework committee reviewed results and provided input into selection • Three primary dispositions emerged from this process

Development of the COE Disposition Survey and Advanced Disposition Survey • Initially designed and field tested in the summer of 2005 • Original survey consisted of 12 items • Of those 12 items, four were targeted to specifically address two of the unit-adopted dispositions (fairness and the belief that all students can learn) • Alternate forms of survey questions were written to address reliability • Candidates were asked to identify, using a Likert scale, if they “strongly agree,” “agree,” “n/a or neutral,” “disagree,” or “strongly disagree” with statements addressing these dispositions

Original Statements on Surveys • Statement 2: I believe that schools today need to get back to basics--teachers should present lessons for everyone in the same structured way for students to learn the content. • Statement 3: I believe that it is important to adapt instruction to students' different learning styles, and help students achieve in ways they find easy to learn. • Statement 11: The impact of my performance as a teacher is primarily dependent upon the students' family backgrounds and the students' personal motivation. • Statement 12: I believe all students can learn.

Early Data Gathered

What did these data tell us? • The four items on this instrument appeared to be reliable and valid. • Candidates’ responses appeared to be pretty consistent in terms of these items. • There appeared to be little if any consistency in candidates’ responses on these items. • In general, candidates’ responses related to the two items addressing “fairness” appear to be consistent, this does not appear to be the case with the two items addressing “the belief that all students can learn.”

Decisions Made • Looked more in-depth at these items (at the individual candidate level) to determine agreement for the two items addressing the belief that all students can learn • Asked COE Assessment Committee to review data and make recommendations for changes based on these (and other) data • The Assessment Committee recommended re-wording of the item and including separate statements rather than combined statement • Faculty across the unit had multiple conversations about the role of the teacher in influencing student achievement as well as motivation.

Some Common Errors Found In Key Assessments • Items included that are not appropriately aligned to the standard(s) OR what is supposed to be measured • Not adequately measuring the standard (only certain aspects) • Not setting clear performance expectations – e.g., what is “passing” or “acceptable? OR, setting inappropriate performance expectations • Not matching the type of rubric to the assessment need (e.g., use of holistic vs. analytic rubrics). • Performance descriptors on rubrics that are not sufficiently differentiated across levels. • Use of non-specific terms in performance descriptors (“some,” “effectively,” “adequately”) without explicit guidance for how those terms are to be defined • Use of broad terms – outcomes not well defined • Lack of appropriate balance of “brevity and detail” – either not efficient or not effective • Lack of well-defined criteria to guide ratings – may lead to biased ratings (e.g., leniency bias) • Not using multiple measures to assess outcomes

Candidate will integrate research findings in his/her practice: Research proposal

References and Resources • Carey, J.). (2011). Outcomes assessment: Linking learning, assessment, and program improvement. PowerPoint presentation from ALA Annual Meeting, June 27, 2011. • Darling-Hammond, L. (2006). Assessing teacher education: The usefulness of multiple measures for assessing program outcomes. Journal of Teacher Education, 57 (2), 120-138. • Darling-Hammond, L., Amrein-Beardsley, A., Harertel, E., Rothstein, J. (2012). Evaluating teacher evaluation. Phi Delta Kappan, (Mar 2012 Supplement), 5-6. • Gonsalvez, C.J. & Freestone, J. (2007). Field supervisors’ assessments of trainee performance: Are they reliable and valid? Australian Psychologist, 42(1), 23-32. • Johnson, L.E. (2008). Teacher candidate disposition: moral judgment or regurgitation? Journal of Moral Education, 37, 429-444.

References and Resources, continued • Magin, D. & Helmore, P. (2001). Peer and teacher assessments of oral presentation skills: How reliable are they? Studies in Higher Education, 26, 287-298. • McAllister, S., Lincoln. M., Ferguson, A., & McAllister, L. (2010). Issues in developing valid assessments of speech pathology students’ performance in the workplace. International Journal of Language and Communication Disorders, 45(1), 1-14. • Oláh, L.N., Lawrence, N.R., & Riggan, M. (2010). Learning to Learn From Benchmark Assessment Data: How Teachers Analyze Results. Peabody Journal of Education, 85, 226-245. • Sandholtz, J.H. & Shea, L.M. (2012). Predicting performance: A comparison of university supervisors’ predictions and teacher candidates’ scores on a teaching performance assessment. Journal of Teacher Education, 63(1), 39-50. • VSU Dewar College of Education Institutional Report (2006).

Determining the Validity and Reliability of Key Assessments

Determining the Validity and Reliability of Key Assessments

Presentation Transcript

Reliability and Validity

Reliability and Validity

Reliability and Validity

Reliability and Validity

VALIDITY AND RELIABILITY

Reliability and Validity

Validity and Reliability

Reliability and Validity

Validity and reliability

Validity and Reliability

Validity and Reliability

Reliability and Validity

Validity and Reliability

Determining Validity and Reliability of Key Assessments

Validity and Reliability

Reliability and Validity

Validity and Reliability

Reliability and Validity

Reliability and Validity

Determining Validity and Reliability of Key Assessments

Validity and Reliability