1 / 66

The National Student Survey: A Multilevel Analysis of Discipline Effects

The National Student Survey: A Multilevel Analysis of Discipline Effects. The NSS Conference 8 thMay , 2008 Nottingham, UK. Professor Herbert W. Marsh; Jacqueline Cheng, University of Oxford. Overview. Background Use of students’ evaluations to evaluate teachers in individual classes

marisa
Download Presentation

The National Student Survey: A Multilevel Analysis of Discipline Effects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The National Student Survey: A Multilevel Analysis of Discipline Effects The NSS Conference 8thMay, 2008 Nottingham, UK Professor Herbert W. Marsh; Jacqueline Cheng, University of Oxford

  2. Overview Background • Use of students’ evaluations to evaluate teachers in individual classes • Australian research using student ratings to benchmark educational experience in Australian universities and discipline-within-university groups NSS Results • Factor structure of NSS responses • Ability to differentiate between institutions and discipline-within-university groups

  3. Students’ Evaluations of University Teaching: Dimensionality, Reliability, Validity, Potential Biases and Usefulness

  4. Introduction Although there is limited research on the use of university student ratings to evaluate universities as a whole, there is a large research literature on the use of students’ evaluations of teaching effectiveness to evaluate the effectiveness of individual teachers. Students' evaluations of teaching effectiveness (SETs) are of considerable interest. They have generated a great deal of research in North America and, increasingly, universities all over the world. The research literature contains many thousands of publications. (My research programme has continued for 30 years). SET research has been motivated by the traditional importance of teaching in universities. Interest also comes from an increasing emphasis on monitoring the quality of university teaching and Quality Assurance exercises in Higher Education.

  5. The Purposes of SETs • Diagnostic feedback to faculty about the effectiveness of their teaching that will be useful for the improvement of teaching; • A measure of teaching effectiveness to be used in personnel decisions; • Information for students to use in the selection of courses and instructors; and • An outcome or a process description for research on teaching. The first purpose nearly universal, but the next three are not.

  6. Summary ConclusionsMy research has led me to conclude that SETs are: • Multidimensional; • Reliable and stable; • Primarily a function of the instructor who teaches a course rather than the course that is taught; • Valid in relation to a variety of indicators of effective teaching; • Relatively unaffected by a variety of variables hypothesized as potential biases; and • Seen to be useful by students for use in course selection, by administrators for use in personnel decisions, by faculty as feedback about teaching

  7. Student Rating Dimensions

  8. SEEQ Dimensionality • Effective teaching is a multidimensional construct and so it is not surprising that SETs are also multidimensional. For example, a teacher may be organized but lack enthusiasm, and this should be accurately reflected in the SETs. • Information from SETs depends upon the content of the items. Poorly worded or inappropriate items will not provide useful information. • If a survey instrument contains an ill-defined hodgepodge of different items and SETs are summarized by an average of these items, then there is no basis for knowing what is being measured.

  9. Dimensionality: The SEEQ Factors Learning/Value:You found course intellectually challenging/stimulating; Instructor Enthusiasm:Instructor dynamic/energetic in conducting course; Organisation:Course materials were well prepared/carefully explained; Individual Rapport:Instructor was friendly towards individual students; Group Interaction:Students encouraged to participate in class discussions; Breadth of Coverage:Presented background/origin of ideas/concepts; Examinations/Grading:Feedback valuable from exams/graded materials; Assignments/Readings:Readings, homework, etc. contributed to appreciation and understanding of subject; Workload/Difficulty:Relative course difficulty (very easy...medium…very hard).

  10. SEEQ Dimensionality: Summary The SEEQ factor structure: • Has been replicated in dozens of published factor analyses; • Generalizes well across discipline and level (undergraduate/graduate) • Is also supported by teacher self-evaluations of their own teaching. The debate about which components should be measured has not been resolved, but factors like the SEEQ factors have been identified with carefully designed instruments. "Home-made" surveys are rarely developed in relation to rigorous psychometric considerations. They fail to provide a comprehensive evaluation of SET dimensions, undermining their usefulness, particularly for diagnostic feedback.

  11. Reliability and Stability

  12. SEEQ Reliability The reliability of SETs is most appropriately determined from studies of interrater agreement (i.e., generalizability of ratings over students in the same class). The reliability of the class-average response depends upon the number of students rating the class; it is about • .95 for the average response from 50 students, • .90 from 25 students, • .74 from 10 students, and • .60 from five students. However, even for teachers with small classes, it is possible to get high reliability by averaging results from several small classes. Given a sufficient number of students, SET reliability compares favourably with the best objective tests.

  13. What is the Relative Importance of the Teacher vs. Course Effects How highly correlated are SETs in: • two different courses taught by the same instructor • same course taught by different teachers on two different occasions? For Overall Instructor Ratings of: • same instructor teaching same course on two occasions (r = .72) [teacher & course effect], • same instructor teaching two different courses (r = .61) [teacher effect], • same course taught by two different instructors (r = -.05) [course effect]. SETs primarily reflect the teacher who is doing the teaching, not the course that is being taught.

  14. Longitudinal Stability over 13 Years Cross-sectional studies at different levels of education suggest an teaching effectiveness declines with experience/age. In a true longitudinal study I considered 195 teachers evaluated continuously over 13 years (average of 30.9 classes/ teacher). • I evaluated the linear and nonlinear effects of year, course level (graduate vs. undergraduate), and their interaction. • Changes in ratings over time were all close to zero for the 9 SEEQ factors and the two overall rating items. • Most teachers had highly stable mean ratings over 13 years– 84% showed no significant effects over time; a few improved, a few declined. In summary, mean ratings of the same teachers are remarkably stable over a 13-year period.

  15. Ratings of 1 teacher over 13 years; consistently 1SD Above Mean Grand Mean Over 195 Teachers Ratings of 1 teacher over 13 years; consistently 1SD Below Mean Each of the 195 grey horizontal lines represents ratings by one teacher over 13 years. For most teachers there is no systematic increase or decrease in ratings over the 13 years.

  16. Validity

  17. In Support of the Validity of SETs Effective teaching is a hypothetical construct for which there is no adequate single indicator. Hence, the validity of SETs and other indicators of effective teaching must be demonstrated through a construct validation approach. SETs are positively related to many criteria of teaching effectiveness, including: • the ratings of former students; • student achievement in multisection validity studies; • faculty self-evaluations of their own teaching effectiveness; and • observations of trained observers on specific processes (e.g., teacher clarity).

  18. In the Multisection Validity Study (relating SETs to Student Achievement) : • There are many sections of the same course; • Each section is taught by a separate teacher but has the same course outline, textbooks, objectives, final exam; • Students are randomly assigned to sections; • Final exam reflects the common objectives; • Students evaluate teaching effectiveness on a well- standardized instrument before final course grade; and • Section-average SETs are related to section-average exam performance, controlling for pretest measures. Do sections that evaluate teaching as best also perform best on final exam (when plausible counter explanations are not viable)?

  19. Meta-Analysis Cohen did classic meta-analysis of multisection validity studies. Student achievement consistently correlated with SETs. For a subset of 41 "well-designed" studies, correlations between achievement and SETs were substantial: Structure (.55), Interaction (.52), Skill (.50), Overall Course (.49), Overall Instructor (.45), Learning (.39) Cohen (1987, p. 12) concluded that "I am confident that global ratings of the instructor and course, and certain rating dimensions such as skill, rapport, structure, interaction, evaluation, and student's self-rating of their learning can be used effectively as an integral component of a teaching evaluation system."

  20. Instructor Self-Evaluations In two studies, teachers evaluated their own teaching using SEEQ and were evaluated by their students: • Separate factor analyses of teacher and student responses identified the SEEQ factors; • Student-teacher agreement on all SEEQ factors significant (median rs=.49 & .45), supporting convergent validity; • Multitrait-multimethod analyses indicated student/teacher agreement was specific to each SEEQ factor, supporting discriminant validity; • Mean differences small (i.e., student ratings were not systematically higher or lower than teacher self-evaluations). Good student/teacher agreement supports the validity of student ratings.

  21. Improving Teaching Effectiveness

  22. Improving Teaching Effectiveness Logically, the introduction of a broad, institution-based, carefully planned program of SETs is likely to improve teaching effectiveness. However, empirical demonstration based on SETs comes from SET Feedback studies in which: • Teachers are randomly assigned to experimental (feedback) and control (no feedback) groups; • SETs are collected in the middle of the term; • Midterm ratings are quickly returned to feedback instructors, augmented, perhaps, with personal consultation; and • Groups are compared on end-of-term ratings and, perhaps, other variables.

  23. Improving Teaching Effectiveness In his classic meta-analysis, Cohen found that instructors who received midterm feedback were subsequently rated about one-third of a standard deviation higher than controls on the Total SET Rating. He reported even larger differences for ratings of Instructor Skill, Attitude Toward Subject, and Feedback to Students. Studies that augmented feedback with consultation produced substantially larger differences, but other methodological variations had little effect. These results demonstrate that SET feedback, particularly when augmented by consultation, led to improved teaching effectiveness.

  24. Prototype SEEQ Feedback/Consultation Intervention Teachers randomly assigned to groups. At T1 (middle of semester 1), T2 (end of semester 1) and T3 (end of semester 2) all teachers: • evaluated themselves & rated importance of each SEEQ factor; • were evaluated by students on SEEQ; At T2 Feedback Teachers selected target SEEQ factors that: • were important to the teacher (teacher self-evaluations); • had low ratings (needed improvement); • were "appropriate areas to target improvement efforts.“ Teaching idea packets given to teachers for each targeted SEEQ factor. Each packet contained up to 40 strategies (based on interviews with outstanding teachers). Teacher (with consultant) selected a few strategies for each target SEEQ factor for implementation. Control teachers received no feedback until end of the study.

  25. Results For Feedback Teachers • Because teachers only targeted one or a few scales, interpretations of overall ratings most straight forward; effects significant for all 4 overall ratings (effect sizes .4 to .5). • The feedback group had higher ratings for all 12 SEEQ scores; 8 were statistically significant. • Now, lets see how the intervention worked with the target scales (that teachers chose for the intervention) compared to non-target scales.

  26. Intervention Post-test ratings were all much higher; Target scales were now similar to non-Target scales. The intervention improved target scales much more than non-target scales Control Group Post-test ratings of Target scales were still much lower than non-target scales At Pretest the ratings of Target scales were much lower than non-target scales for all groups – part of the reason that they were selected Consistent with the rationale for the study, ratings of targeted scales improved substantially relative to nontargeted areas for experimental groups, but not for the control group.

  27. Discussion The most important results of the investigation were to provide varying degrees of support for a priori predictions that: • SEEQ feedback and the feedback/consultation provided an effective means of improving university teaching; • Effects stronger for the initially less effective teachers; • In support of multidimensional SEEQ perspective, improvement largest for targeted SEEQ scales; • Important for teachers to specifically target particular scales. • Teaching packets: even if teachers motivated to improve their teaching, they apparently do not know how to do so. Need concrete strategies to facilitate teaching improvement efforts. However, few universities implement teaching improvement programmes as part of the collection of SETs even though clear evidence that they work. We invite collaboration to pursue this teaching improvement research in UK

  28. Overall Summary Conclusions In conclusion, let me return to my original conclusion that SETs based on the teacher as the unit of analysis are: • Multidimensional; • Reliable and stable; • Primarily a function of the teacher who teaches a course rather than the course that is taught; • Valid in relation to a variety of indicators of effective teaching; • Relatively unaffected by a variety of variables hypothesized as potential biases; and • Seen to be useful by students for use in course selection, by administrators for use in personnel decisions, by faculty as feedback about teaching

  29. Using Student Ratings To Benchmark Universities: The Australian Experience

  30. Benchmarking Australian Universities • Australian government & universities cooperate to collect standardized data of many kinds that are used to compare universities – a benchmarking exercise.   • In Australia, the Course Experience Questionnaire (CEQ) is used to compare undergraduate teaching in different universities – to benchmark teaching effectiveness. • Following from the CEQ, the PREQ (Postgraduate Research Evaluation Questionnaire) was designed to compare universities and disciplines within universities on the quality of research supervision and the PhD experience– to benchmark Australian universities . Here I summarise results from two PREQ studies, one based on the 1996-98 cohort and a second based on the 1999 cohort

  31. Unit of Analysis Problem The unit of analysis should be the individual supervisors (as in student evaluation and school effectiveness research). However, using the supervisor as the unit of analysis would introduce confidentiality and reliability issues– due to the small number of graduating PhD students per supervisor. The issue is moot because supervisors are not identified on the PREQ. In order to be useful for benchmarking universities, PREQ responses must be able to differentiate between universities (or discipline-within-university groups). • Support for reliability requires that there are relatively small differences in ratings within the same university and relatively large differences between different universities.

  32. The Present Investigation Data from PhD students who graduated from 32 universities (Study 1) and from 35 universities (Study 2). PREQ contains six multi-item scales (Supervision, Climate, Clarity, Infrastructure, Skills Development, and Thesis Examination Process) and an Overall Rating. At the level of the individual student, responses had reasonable psychometric properties. Multilevel analyses evaluated differences between universities (with and without adjustment for background characteristics). In neither study were there significant differences between universities. The reliability of PREQ responses for purposes of differentiating between universities does not differ significantly from zero.

  33. Mean Rating (x) and Range of Probable Error for Each University Mean Rating Across all 32 Universities Mean Overall Rating (Study 1)( probable error)

  34. Differences Among 35 Universities: Study 2 (0.4% of Var Explained)

  35. Do PREQ Responses Differentiate Between Disciplines? YES, but discipline differences explain only about 1-2% of the variance. The lowest Overall Ratings were for Humanities (and were significantly better in Agriculture, Business, Physical Sciences, Health Sciences, and Education) However, the critical question is whether responses differentiate between discipline-within-university groups (because PREQ used to benchmark the same discipline across different universities). For Overall Ratings, the answer is NO. The small discipline differences tend to be similar across different universities (e.g., Humanities tends to get lower ratings in all universities).

  36. Validity/Usefulness of PREQ Responses PREQ responses are completely unreliable, so they must also be invalid for purposes of differentiating between universities. PREQ responses were unrelated to : • Research Productivity (publications & grants); • Number Australian PhD Student Scholarships; • Attrition Rates. We concluded that PREQ responses were unlikely to be useful for any of the purposes for which they were designed (including benchmarking and improving PhD programmes)

  37. Using the NSS to Benchmark Universities: The UK Experience

  38. NSS Overview 2001 - HEFCE proposed revised method for quality assurance of teaching and learning in higher education with three aims • To help inform the choices of prospective students; • To contribute to public accountability; and • To provide useful data to institutions to use in their enhancement activities. The first data collections took place in 2005 and 2006. These form the basis of the present investigation

  39. NSS 22-item Instrument (6 specific factors & overall rating item) • Teaching: “staff are good in explaining things”; • Assessment/Feedback: “Assessment arrangements and marking have been fair”; • Support: “I have received sufficient advice and support with my studies”; • Organisation/Management: “The timetable works efficiently as far as my activities are concerned”; • Resources: “The library resources and services are good enough for my needs”; • Personal development: “The course has helped me to present myself with confidence”; • Overall satisfaction: “Overall, I am satisfied with the quality of the course”

  40. NSS Overview

  41. Factor Analysis • How many factors do NSS responses measure? • Can NSS responses be explained by a single global score? • How similar is the factor structure in 2005 and 2006?

  42. Results: Factor Analysis Exploratory Factor Analysis: • A combination of “interpretability” and various guidelines (eigenvalue > 1; scree; significance tests) suggested either 6 (consistent with a priori design) or 7 factors Confirmatory Factor Analysis: • Role of Q22 (Overall Satisfaction Rating) • Goodness of fit & interpretability support 8-factor solution. • Factor structure was highly consistent across 2005 and 2006 (based on tests of factorial invariance) • One higher-order factor model fit the data well, but not as well as the 8-factor first-order model. However, because the contribution of each factor differs, an unweighted average may not be appropriate.

  43. First-order Factor Structure Relating 22 items to 8 factors Results: Higher-order Factor Analysis

  44. Second-order Structure Relating 8 FO factors & 1 SO Factor Results: Higher-order Factor Analysis HO Model: TLI =.983 CFI= .985 RMSEA = .048 FO Model: TLI =.988 CFI= .991 RMSEA = .035

  45. Multilevel Analysis • How much variance explained by university & discipline-within-university groups? • Can differences be explained by student characteristics and disciplines? • Are differences significant and meaningful? • How reliable are NSS responses? How is this related to sample size at different levels?

  46. Design: Variance Components 3-Level Model • L1 = students, • L2 = discipline-within-university groups, • L3 = university Variance component Models • How much variance is explained by each level? • How much is this changed by controlling fixed effects (student characteristics & discipline)? Means & Probable Error (Error Bars) • For each group (university or discipline-within-university) there is an mean level of satisfaction and a range of probable error (error bar around the mean)

  47. Disciplines/Disciplines-within-university groups Number of Discipline Categories: • Actual discipline structure is different for every university Number of Discipline-within-university Groups • The number of discipline-within-university groups is MUCH larger than the number of disciplines. • If 19 disciplines in each of 140 universities, potential of 19 x 140 discipline-within-university Groups. Number of students in each groups can be very small Discipline: Fixed vs. Random Effect • Fixed Effect: “main” effect of discipline averaged across all universities (i.e., are Psychology students more satisfied than Economics students, averaged across all universities) • Random Effect: variance explained by discipline-within-university Groups (i.e., comparison of Psychology in different universities or Psychology and Economics in same university)

  48. Variance Components (variance explained) University: Not Much variance explained (about 2.5%), but highly significant. Controlling student/discipline reduced variance components somewhat. Discipline-within-university Groups: More variance explained but depends on number of discipline categories (up to 5%); Controlling fixed effects of discipline reduces variance substantially.

  49. Differences Between Universitiesas a Function of Probable Error and Reliability

  50. Mean Satisfaction Across All Universities This university has a mean almost 0.4 SD below the average across all universities Error Bar: range of probable error from about -.2 to about -.7 (+/- .25 SD) Differences Between Universities: Caterpillar Plots Each triangle is the mean satisfaction ratings for one university. The vertical line that goes above and below the mean is an error bar (range of probable error); longer bars represent more error. Universities Ranked From Lowest to Highest

More Related