Quality criteria for data aggregation used in academic rankings IREG FORUM on University rankings Methodologies under s

Quality criteria for data aggregation used in academic rankings IREG FORUM on University rankings Methodologies under scrutiny 16-17 May 2013, Warsaw, Poland Michaela Saisana michaela.saisana@jrc.ec.europa.eu European Commission, Joint Research Centre, Econometrics and Applied Statistics Unit

Outline Global rankings at the forefront of the policy debate Overview of two global university rankings (ARWU, THES) Statistical Coherence Tests Uncertainty analysis Policy Implications Conclusions

Definition of the university is broad: A university – as the name suggests – tends to encompass a broad range of purposes and dimensions, focus and missionsdifficult to condense into a compact measure Still, for reasons of governance, accountability and transparency, there is an increasing interest among policymakers as well as among practitioners in measuring and benchmarking "excellence" across universities. The growing mobility of students and researchers has also created a market for these measures among the prospective students and their families. Global rankings at the forefront of the policy debate

Global rankings have raised debates and policy responses (EU, national level): to improve the positioning of a country within the existing measures, to create new measures, to discuss regional performance (e.g. show that USA is well ahead of Europe in terms of cutting-edge university research) Global rankings at the forefront of the policy debate

Global rankings at the forefront of the policy debate Guess how many contain the word “THES ranking” or “ARWU ranking”? 10-fold increase in the last 10 years 20%

Academic Ranking of World Universities (ARWU) (Shanghai Jiao Tong University), 2003 Webometrics (Spanish National Research Council), 2003 World University Ranking (Times Higher Education/Quacquarelli Symonds), 2004–09 Performance Ranking of Scientific Papers for Research Universities (HEEACT), 2007 Leiden Ranking (Centre for Science & Technology Studies, University of Leiden), 2008 World's Best Colleges and Universities (US News and World Report), 2008 SCImago Institutional Rankings, 2009 Global University Rankings (RatER) (Rating of Educational Resources, Russia), 2009 Top University Rankings (Quacquarelli Symonds), 2010 World University Ranking (Times Higher Education/Thomson Reuters—THE-TR), 2010 U-Multirank (European Commission), 2011 Global rankings at the forefront of the policy debate • Over 60 countries have introduced national rankings, and there are numerous regional, specialist and professional rankings.

Global rankings at the forefront of the policy debate University rankings are used to judge about the performance of university systems … whether intended or not on by their proponents

Global rankings at the forefront of the policy debate • France: • Creation of 10 centres of HE excellence • The minister of Education set a target to put at least 10 French universities among the top 100 in ARWU by 2012 • President has put French standing in these international ranking at the forefront of the policy debate (Le Monde, 2008). • Italy (0 Uni in the top 100 of the ARWU ranking seen as failure of the national educational system). • Spain ( 1 Uni in the top 200 of the ARWU  hailed as a great national achievement)

An OECD study shows that worldwide university leaders are concerned about ranking systems with consequences on the strategic and operational decisions they take to improve their research performance. (Hazelkorn, 2007) There over 16,000 HEIs, yet some of the global rankings merely capture the top 100 universities – less than 1%. (Hazelkorn, 2013) Global rankings at the forefront of the policy debate

An extreme impact of Global Rankings What - 2005 THES created a major controversy in Malaysia: country’s top two universities slipping by almost 100 places compared to 2004. Why - change in the ranking methodology (not well known fact and of limited comfort) Impact - Royal commission of inquiry to investigate the matter. A few weeks later, the Vice-Chancellor of the University of Malaysia stepped down. Global rankings at the forefront of the policy debate

Global rankings at the forefront of the policy debate Overview of two global university rankings (ARWU, THES) Statistical Coherence Tests Uncertainty analysis Policy Implications Conclusions

Overview – 2007 ARWU ranking METHODOLOGY • 6 indicators • Best performing institution =100; score of other institutions calculated as a percentage • Weighting scheme chosen by rankers • Linear aggregation of the 6 indicators PROS and CONS • 6 « objective » indicators • Focus on research performance, overlooks other U. missions. • Biased towards hard-science institutions • Favours large institutions

Overview – 2007 THES ranking METHODOLOGY • 6 indicators • z-score calculated for each indicator; best performing institution =100; other institutions are calculated as a percentage • Weighting scheme: chosen by rankers • Linear aggregation of the 6 indicators PROS and CONS • Attempt to take into account teaching quality • Two expert-based indicators: 50% of total (Subjective indicators, lack of transparency) • yearly changes in methodology • Measures research quantity

Overview- Comparison (2007) 1 –Same top10: Harvard, Cambridge, Princeton, Cal-tech, MIT and Columbia 2 -Greater variations in the middle to lower end of the rankings 3 -Europe is lagging behind: both ARWU (else SJTU) and THES rankings 4 –THES favours UK universities: all UK universities below the line (in red)

University rankings- yearly published +Very appealing for capturing a university’s multiple missions in a single number +Allow one to situate a given university in the worldwide context - Can lead to misleading and/or simplistic policy conclusions

Question: Can we say something about the quality of the university rankings and the reliability of the results?

Statistical coherence The Stiglitz report (p.65): […] a general criticism that is frequently addressed at composite indicators, i.e. the arbitrary character of the procedures used to weight their various components. […] The problem is not that these weighting procedures are hidden, non-transparent or non-replicable – they are often very explicitly presented by the authors of the indices, and this is one of the strengths of this literature. The problem is rather that their normative implications are seldom made explicit or justified.

Question: Can we say something about the quality of the university rankings and the reliability of the results?

Statistical coherence - Dean’s example X1: hours of teaching X2: # of publications Y = 0.5 x1+ 0.5 x2 Estimated R12 = 0.0759, R22 = 0.826, corr(x1, x2) =−0.151, V(x1) = 116, V(x2) = 614, V(y) = 162

Statistical coherence - Dean’s example To obviate this, the dean substitutes the model A professor comes by, looks at the last formula, and complains that publishing is disregarded in the department … Y = 0.5 x1+ 0.5 x2 X1: hours of teaching X2: number of publications with Y = 0.7 x1+ 0.3 x2

Statistical coherence ARWU score Si: ruler for ‘importance’ Using these points we can compute a statistic that tells us: Example: Si =0.88  we could reduce the variation of the ARWU scores by 88% by fixing ‘Papers in Nature & Science’.

Statistical coherence Our suggestion: to assess the quality of a composite indicator using – instead of Ri2 (Pearson product moment correlation coefficient of the regression of y on xi) its non-parametric equivalent Pearson’s correlation ratio Smoothed curve Unconditional variance First order sensitivity index

Statistical coherence Pearson’s correlation ratio ‐ First order effect ‐ Top marginal variance - Main effect … Source: Paruolo, Saisana, Saltelli, 2013, J.Royal Stat. Society A Features: it offers a precise definition of importance, that is ‘the expected reduction in variance of the CI that would be obtained if a variable could be fixed’; it can be used regardless of the degree of correlation between variables; it is model-free, in that it can be applied also in non-linear aggregations; it is not invasive, in that no changes are made to the CI or to the correlation structure of the indicators (unlike what we will see next on uncertainty analysis).

Statistical coherence One can hence compare the importance of an indicator as given by the nominal weight (assigned by developers) with the importance as measured by the first order effect (Si)to test the index for coherence.

Statistical coherence - ARWU Si’s are more similar to each other than the nominal weights, i.e. ranging between 0.14 and 0.19 (normalized Si’s to unit sum; CV estimates) when weights should either be 0.10 or 0.20. Source: Paruolo, Saisana, Saltelli, 2013, J.Royal Stat. Society A

Statistical coherence - THES • The combined importance of peer-review variables (recruiters and academia) appears larger than stipulated by developers, indirectly supporting the hypothesis of linguistic bias at times addressed to THES. • The teacher/student ratio, a key variable aimed at capturing the teaching dimension, is much less important than it should be (normalized Si is 0.09, nominal weight is 0.20). Source: Paruolo, Saisana, Saltelli, 2013, J.Royal Stat. Society A

Notwithstanding recent attempts to establish good practice in composite indicator construction (OECD, 2008), “there is no recipe for building composite indicators that is at the same time universally applicable and sufficiently detailed” (Cherchye et al., 2007). Booysen (2002, p.131) summarises the debate on composite indicators by noting that “not one single element of the methodology of composite indexing is above criticism”. Andrews et al. (2004)] argue that “many indices rarely have adequate scientific foundations to support precise rankings: […] typical practice is to acknowledge uncertainty in the text of the report and then to present a table with unambiguous rankings” Uncertainty analysis - Why?

60 50 40 30 20 10 Uncertainty analysis - How? Space of alternatives Weights Missing data Aggregation Normalisation Including/excluding variables Country 1 Country 2 Country 3 Model averaging: whenever a choice in the composite setting-up may not be strongly supported or if you may not trust one single model, we’ll recommend you to use more models

Uncertainty analysis - How? How coupled stairs are shaken in most of available literature How to shake coupled stairs

Objective of UA:  NOT to verify whether the two global university rankings are legitimate models to measure university performance  To test whether the rankings and/or their associated inferences are robust or volatile with respect to changes in the methodological assumptions within a plausible and legitimate range. Uncertainty analysis – ARWU & THES Question: Can we say something about the quality of the university rankings and the reliability of the results? Source: Saisana, D’Hombres, Saltelli, 2011, Research Policy 40, 165–177

Activate simultaneously different sources of uncertainty that cover a wide spectrum of methodological assumptions Estimate the FREQUENCY of the university ranks obtained in the different simulations Number of indicators Aggregation normalization imputation weighting Uncertainty analysis – ARWU & THES 70 scenarios

Uncertainty analysis – ARWU • Harvard, Stanford, Berkley, Cambridge, MIT: top 5 in more than 75% of our simulations. • Univ California: original rank 18th but could be ranked anywhere between the 6th and 100th position • Impact of assumptions: much stronger for the middle ranked universities

Uncertainty analysis – THES • Impact of uncertainties on the university ranks is even more apparent. • M.I.T.: ranked 9th, but confirmed only in 13% of simulations (plausible range [4, 35]) • Very high volatility also for universities ranked 10th-20th position, e.g., Duke Univ, John Hopkins Univ, Cornell Univ.

Uncertainty analysis – ARWU results

Uncertainty analysis – THES results

HEI provide an array of services and positive externalities to society (universal education, innovation and growth, active citizens, capable entrepreneurs and administrators, etc.) which call for multi-dimensional measures of effectiveness and/or efficiency. A clear statement of the purpose of any such measure is also needed, as measuring scientific excellence is not the same as measuring e.g. employability or innovation potential, or where to study, or how to reform the university system so as to increase the visibility of national universities. Policy implications

Indicators and league tables are enough to start a discussion on higher education issues BUT not sufficient to conclude it. Assigned university rank largely depends on the methodological assumptions made in compiling the rankings. 9 in 10 universities shift over 10 positions in the 2008 SJTU. 92 positions (Univ Autonoma Madrid) and 277 positions (Univ Zaragoza) in Spain, 71 positions (Univ Milan) and 321 positions (Polytechnic Inst Milan) in Italy, 22 positions (Univ Paris 06) and 386 positions (Univ Nancy 1) in France. Policy implications

A multi-modeling approach can offer a representative picture of the classification of universities by ranking institutions in a range bracket, as opposed to assigning a specific rank which is not representative of the plurality of opinions on how to assess university performance. The compilation of university rankings should always be accompanied by coherence tests & robustness analysis. Policy implications

‘rankings are here to stay, and it is therefore worth the time and effort to get them right’ (Alan Gilbert, Nature News, 2007) Conclusions • ‘because they define what “world-class” is to the broadest audience, these measures cannot be ignored by anyone interested in measuring the performance of tertiary education institutions’ (Jamil Salmi, 2009)

Conclusions • ‘rankings are here to stay’ (Sanoff, 1998) • ‘ranking systems are clearly here to stay’ (Merisotis, 2002) • ‘tables: they may be flawed but they are here to stay’ (Leach, 2004) • ‘they are here to stay’ (Hazelcorn, 2007) • ‘like them or not, rankings are here to stay’ (Olds, 2010) • ‘whether or not colleagues and universities agree with the various ranking systems and league table findings is insignificant, rankings are here to stay’ (UNESCO, 2010) • ‘educationalists are well able to find fault with rankings on numerous grounds and may reject them outright. However, given that they are here to stay…’ (Trofallis, 2012) • ‘while many institutions had reservations about the methodologies used by the rankings compliers, there was a growing recognition that rankings and classifications were here to stay’ (Osborne, 2013)

More at: http://composite-indicators.jrc.ec.europa.eu (or simply Google “composite indicators” – 1st hit)

Paruolo P., Saisana M., Saltelli A., 2013, Ratings and Rankings: voodoo or science?. J Royal Statistical Society A176(2). • Saisana M., D’Hombres B., Saltelli A., 2011, Rickety Numbers: Volatility of university rankings and policy implications. Research Policy 40, 165–177. • SaisanaM., D’Hombres B., 2008, Higher Education Rankings: Robustness Issues and Critical Assessment, EUR 23487, Joint Research Centre, Publications Office of the European Union, Italy. • Saisana M., Saltelli A., Tarantola S., 2005, Uncertainty and sensitivity analysis techniques as tools for the analysis and validation of composite indicators. J Royal Statistical Society A168(2), 307-323. • OECD/JRC, 2008, Handbook on Constructing Composite Indicators. Methodology and user Guide, OECD Publishing, ISBN 978-92-64-04345-9. References and Related Reading

Quality criteria for data aggregation used in academic rankings IREG FORUM on University rankings Methodologies under s