190 likes | 272 Views
Reliability and Comparability of Peer Review Results. Nadine Rons , Coordinator of Research Evaluations & Policy Studies Research & Development Department, Vrije Universiteit Brussel Eric Spruyt , Head of the Research Administration Department Universiteit Antwerpen. “Three cheers for peers”.
E N D
Reliability and Comparability of Peer Review Results Nadine Rons, Coordinator of Research Evaluations & Policy Studies Research & Development Department, Vrije Universiteit Brussel Eric Spruyt, Head of the Research Administration Department Universiteit Antwerpen
“Three cheers for peers” • ‘Three cheers for peers’, Editorial, Nature 439, 118 (12 January 2006). • "Thanks are due to researchers who act as referees, as editors resolve their often contradictory advice." • "Only in a minority of cases does every referee agree ..."
Presentation plan • Validation of results Reliability & comparability • Material investigated 'Ex post' peer review + citation analysis of teams • Investigation of results Reliability: inter-peer agreement & different rating habits Comparability: related concepts & intrinsic characteristics • Conclusions Aimed at improved results, a better understanding, choosing the right method
I. Validation of results 1. Reliability Peer review: principal method to evaluate research quality. BUT: various kinds of bias & different rating habits. & Not always feasible to use measures limiting their influence. • Possible to measure reliability ? 2. Comparability • H F Moed (2005), 'Citation Analysis in Research Evaluation', chapter 18: 'Peer Review and the Validity of Citation Analysis', Springer. More reliable results better correlations with other outcomes? Correlations often relatively weak & depending on the discipline. • Can this be explained? (crucial for further acceptance!)
II. Material investigated (Peer review) 1. Peer review • Shared principles for the panel-evaluations of teams per discipline: • Expertise-based • International level • Uniform treatment • Coherence of results • Multi-criteria approach • Pertinent advice • Exceptions: • Different experts for each team (1 discipline at VUB). • Specific methodology using different indicators (1 discipline at UA).
II. Material investigated (Peer review @ VUB) • VUB-indicators: • Standard procedure 'VUB-Richtstramien voor de Disciplinegewijze Onderzoeksevaluaties', VUB Research Council (2001). • Scientific merit of the research / uniqueness of the research • Research approach / plan / focus / coordination • Innovation • Quality of the research team • Probability that the research objectives will be achieved • Research productivity • Potential impact on further research and on the development of applications • Potential impact for transition to or utility for the community • Dominant character of the research (fundamental / applied / policy oriented) • Overall research evaluation
II. Material investigated (Peer review @ UA) • UA-indicators: • 'Protocol 1998' for the Assessment of Research Quality, Association of Universities of the Netherlands (VSNU, 1998). • Academic quality • Academic productivity • Scientific relevance • Academic perspective Exception (1 discipline, "partial" indicators): • Publications • Projects • Conference participations • Other • Globally
II. Material investigated (Citation analysis) 2. Citation analysis • 'New Bibliometric Tools for the Assessment of National Research Performance: Database Description, Overview of Indicators and First Apllications', H F Moed et al., Scientometrics 33 (1995). • Centre for Science and Technology Studies (CWTS), Leiden University. • Thomson ISI citation indexes, corresponding period, same teams. • Indicators include: • CPP/JCSm: citations / publication with respect to expectations for the journals • CPP/FCSm: citations / publication with respect to expectations for the field • JCSm/FCSm: journal citation score with respect to expectations for the field
III. Investigation of results (Overview) 1. Reliability a. Inter-peer agreement: Three groups of evaluations according to measured level of agreement. b. Rating habits: Panel-procedures vs. exception with different experts for each team. • Influence on results & on correlations between peer review indicators investigated. 2. Comparability a. Related concepts: 'Global' vs. 'partial' indicators & variation with discipline. b. Intrinsic characteristics of methods: Contributions to ratings counted differently & scale effects. • Influence on comparability investigated.
III. Investigation of results (1. Reliability, a. Inter-peer agreement) • Reliability 1. a. Inter-peer agreement In panels: different opinions different positions of teams. • Level of inter-peer agreementmeasured by correlations between the ratings from different peers. • 3 groups compared: panels with high, intermediate and low inter-peer agreement.
III. Investigation of results (1. Reliability, a. Inter-peer agreement) • Influence on results: Results compared to citation analysis: • Better inter-peer agreement=higher number of significant correlations, BUT: only at the higher aggregation level of the 3 groups. • Other mechanisms have a stronger impact on correlations. • Influence on correlations between peer review indicators: Significant correlations for each pair of peer review indicators, for each of the 3 groups (also for indiviual disciplines). • Correlations between peer review indicators are relatively robust for variations in inter-peer agreement.
III. Investigation of results (1. Reliability, b. Rating habits) 1.b. Rating habits Opinions ratings: according to own habits, reference levels in other evaluations, scores given to other files, known use of scores, ... Two cases compared: • Exception with different experts for each team scores not necessarily in line with opinions. • Standard panel-evaluations uniform reference level.
III. Investigation of results (1. Reliability, b. Rating habits) • Influence on results: Results compared to citation analysis: • Panel-evaluations: significant correlations for all peer review indicators with some or all citation analysis indicators (& vice versa). • Different experts: significant correlation for only 1 pair of indicators. • Rating habits can influence results significantly. • Influence on correlations between peer review indicators: • Panel-evaluations: significant correlations for all pairs of indicators. • Different experts: significant correlations for only8% of the pairs. • Low observed correlations between indicators (expected to be correlated) can indicate diverging rating habits.
III. Investigation of results (2. Comparability, a. Related concepts) 2. Comparability 2.a. Related concepts • Partial indicators (publications, projects, conferences, ...): no significant correlations between peer review indicators, in contrast to global indicators (scientific merit, productivity, relevance, ...). • Performances in different activities are not necessarily correlated. • Correlations of peer review with citation analysis indicators: the pairs correlating best strongly vary with discipline. • An indicator may not represent a same concept for all subject areas. • Always use more than one indicator!
III. Investigation of results (2. Comparability, b. Intrinsic characteristics) 2.b. Intrinsic characteristics • Contributions to ratings: Different in the minds of peers (pro & contra) and in citation analysis (positive counts). • Scale effects: Minimum & maximum limits & their position with respect to the mean value.
III. Investigation of results (2. Comparability, b. Intrinsic characteristics) • Peer rating frequency distribution: • Peer ratings: pro & contra, also elements counted 'negatively'. • Scale: minimum & maximum limit. Relative frequency distribution of peer results 50% Scientific merit of the research — 45% Uniqueness of the research 40% Research approach / plan / focus / co-ordination 35% Innovation 30% Quality of the research team Percentage of the number of teams (58) 25% Probability that the research 20% objectives will be achieved Research productivity 15% 10% Potential impact on further research and on the development of applications 5% Potential for transition to or utility for the community 0% Overall research evaluation LOW (1) LOW (2) FAIR (3) FAIR (4) HIGH (9) GOOD (7) GOOD (8) HIGH (10) AVERAGE (5) AVERAGE (6) Peer results
III. Investigation of results (2. Comparability, b. Intrinsic characteristics) • Citation impact frequency distribution: • Citation impact: only positive counts, strong influence of highly cited articles. • Scale: minimum limit closer to mean & no maximum limit. Relative frequency distribution of citation impact All teams in the pure ISI analysis 40% 35% 30% 25% CPP/JCSm Percentage of the number of teams (60) 20% CPP/FCSm 15% 10% 5% 0% 0,1 0,4 0,7 1 1,3 1,6 1,9 2,2 2,5 2,8 3,1 Indicator value
III. Investigation of results (2. Comparability, b. Intrinsic characteristics) Scientific relevance vs. Field citation impact • Good correlations only when effects of intrinsic characteristics can be filtered out. High & intermediate inter-peer agreement group 3,0 2,5 ? 2,0 Field citation impact (CPP/FCSm) 1,5 1,0 0,5 0,0 4 5 6 7 8 9 10 Peer review "Scientific relevance" score
IV. Conclusions • Reliability • Peer review results can be influenced considerably by rating habits. • It is recommended to create a uniform reference level (e.g. using panel procedures) or check for signs of low reliability by analysing the outcomes of the peer evaluation itself. • Comparability • Besides reliability, comparability of results depends on the nature of the indicators, on the subject area, on intrinsic characteristics of the methods, ... • Different methods describe different aspects. The most suitable method should be carefully chosen or developed. • Evaluations should always be based on a series of indicators, never on one single indicator.