The challenging issue of assessment criteria for qualitative research

Martyn Hammersley [These slides can be found at: http://martynhammersley.wordpress.com/] ‘Articulating quality criteria for qualitative, empirical educational research: Possible? Desirable?’ Danish School of Education, Aarhus University, Copenhagen, April 2019 The challenging issue of assessment criteria for qualitative research

Concern with qualitative ‘criteria’ • In the 1970s and 80s, there was much discussion of the criteria by which qualitative research should be judged. According to Denzin and Lincoln (1994:11), this reflected a ‘legitimation crisis’ (see Seale 1999:ch1). • Subsequently, many qualitative researchers moved away from a concern with epistemic issues, focusing instead on the politics and ethics of research. This shaped ideas about criteria: in particular, political, aesthetic, and ethical ones were proposed. • But it was also asked whether qualitative criteria are possible or desirable. For example: ‘The demand for criteria reflects the desire to contain freedom, limit possibilities, and resist change’ (Bochner 2000:266) • Recent background to this has been the increasing take-up of the label ‘arts-based research’ (Leavy 2019), and declarations about the ‘post-qualitative’ (Lather and St Pierre 2013).

However… • By no means all qualitative researchers have gone down this route, and for some the issue of legitimation remains important (see Weil, Eberle, and Flick 2008: Hammersley 2009; Flick 2018:ch29) • In the context of research for policy and practice, the issue of criteria for assessing qualitative research continues to be a pressingly practical one (for instance, Spencer 2003; Schou et al 2011). • Also relevant has been the growing inclusion of qualitative work in ‘systematic’ reviews, and the emergence of qualitative synthesis. These, too, raise the question of how to assess the quality of such work, as a practical matter.

The recent ‘evidence turn’ in qualitative sociology • In sociology in the past ten years, there has been a growing concern with the issue of ‘evidence’ in qualitative research: Small 2009; Duneier 2011; Jerolmack and Khan 2014; Becker 2017; Lubet 2018. • In some respects this builds on a considerable body of recent work concerned with case study in political science (see, for instance, Gerring 2007). • Interestingly, this has not, as far as I know, yet involved a return to the issue of ‘criteria’. • It also does not seem to have had much impact outside of the U.S.

Why do we need qualitative criteria? There are at least four reasons that criteria of assessment are believed to be necessary: • To legitimate qualitative research, given that (it is claimed) quantitative work has widely-accepted criteria of assessment; • To enable sound funding decisions to be made; • To provide lay users and non-qualitative researchers with a basis for judging the quality of qualitative findings; • To clarify, coordinate, and improve the assessments made by qualitative researchers themselves, whether of others’ work or of their own

A step back: What are criteria? We must distinguish between: • The standards or dimensions in terms of which studies are being judged • The benchmarks on these dimensions used to produce the evaluation: for example, a threshold dividing good from bad, sound from unsound • The signs we use to decide where on each relevant dimension what we are evaluating lies, for example to decide whether it is above or below a threshold A common assumption seems to be that research assessment criteria should be signs that take the form of determinate indicators, perhaps even necessary and jointly sufficient conditions. Often proposed examples of such indicators in quantitative research include: random assignment, random sampling, and reliability testing.

Transparent accountability There are two related reasons why this interpretation of ‘criteria’, as determinate indicators, has been popular: • Lay people judging research findings (for example policymakers or teachers ‘using’ research), and non-qualitative researchers (such as journal editors reviewing articles), feel a need for these. • There is an influential conception of objectivity according to which there should be ‘transparency’: it ought to be possible to view how findings were produced and to apply a scheme that immediately indicates whether or not the research was sound. It is argued that lay users ought not to have to rely on researchers’ own ‘subjective’ judgments about the quality of their work. This parallels similar attitudes towards the work of other professions.

Evaluation of quantitative work • If we look at how quantitative researchers actually evaluate one another’s work, they do not do this by applying a set of transparent indicators that can immediately tell them its quality. Nor is any such set of indicators available (Hammersley 2008). • Rather they engage in interpretative judgments about relevant aspects of a study. • For instance, as regards random allocation, they will try to assess how effectively this was implemented. They will also take account of other features, such as whether there was double blinding, and how likely it was that this operated successfully. • As regards random sampling, there will be questions about the level of non-response, and its implications for the representativeness of the sample. And other aspects of the research will also be interpreted and assessed.

There are no self-applying criteria • Transparent assessment of research relying upon signs that amount to immediate indicators of quality is not feasible; any more than are similar attempts to apply schemes of transparent accountability to other professional activities, such as teaching. • Any list of criteria can only be a partial and abstract representation of how researchers actually go about evaluating research. • Such evaluation necessarily relies heavily upon deployment of a learned capacity to make sound judgments.

The important role of ‘criteria’ • However, this does not mean that criteria play no role in assessment: clarification of standards, benchmarks, and signs can serve as guidance, not least as reminders of what ought to be taken into account, and why. • Indeed, in my view the learned capacity for assessing research does not come about ‘naturally’, just through doing it, but also requires reflection on practice. • And I don’t believe that there can be stable agreement about the quality of particular studies within a research community without collective reflection, with criteria serving as a guide for this.

Quantitative standards? • What count as appropriate signs of likely quality depends upon standards and benchmarks. • Standards often referred to in relation to quantitative research are internal and external validity, construct validity, and reliability. • But, in my view, there are no ‘types’ of validity; and validity is a standard that applies to research findings across the quantitative/qualitative divide. • By contrast, reliability is a standard that applies to research instruments, not findings • This last point highlights the fact that we not only have to clarify what the term ‘criterion’ means but also what the focus of assessment is.

Different criteria for different types of assessment What is being assessed? • Findings/Conclusions: judged in terms of such standards as validity and relevance • The research process: as regards how well designed it was to serve the aims of the research, threats to validity involved, etc • The researcher: competence, objectivity, ingenuity, creativity, etc • The research report: clarity of structure and formulation, completeness of coverage, etc. While these four kinds of assessment involve overlapping considerations, they are not the same; yet they are often conflated in lists of qualitative criteria. In the rest of my talk I will focus on assessing the validity of findings/conclusions. I suggest this is done through judgments of their plausibility and their credibility.

Assessing the validity of findings: Plausibility Plausibility = relationship to what is already taken be reliable knowledge in research terms, though this necessarily depends upon some commonsense assumptions. Possible plausibility relations include: confirmation; compatibility; tension; incompatibility/contradiction That a claim is implausible does not mean it should be rejected, only that convincing evidence is required before it can be accepted.

Assessing the validity of findings: credibility Credibility = degree of evidential support • How reliable is the evidence? • Does the evidence presented strongly support the conclusions? • Are there alternative inferences from the evidence that ought to be taken into account? • Was the study well designed? Were all relevant available data sources used? Were appropriate techniques of analysis employed? • Of course, what kind of evidence is required depends upon the sort of knowledge claim involved.

The process of assessment

Types of knowledge claim Evidence required varies: • Descriptions. eg ‘bullying by girls takes a different (specified) form from bullying by boys (in the cases studied/or, say, in UK secondary schools)’. How well were instances of ‘bullying’ identified, and features documented; how well was the population/sample covered? • Explanations: Some feature/event occurred as a result of specified factor(s). eg ‘the new sex education policy was abandoned because of a campaign by the tabloid press’. How well are the policy, its abandonment, and the campaign described; what evidence is provided about the causal process? • Theories: A particular type of feature/event is generally brought about by the occurrence of some prior type of feature/event, under specified conditions or ‘other things being equal’. eg ‘smaller classes increase children’s learning’. Is there effective comparison of contrasting cases?

Are there distinctive qualitative criteria? • In my view, both qualitative and quantitative findings should be judged in terms of the same standards: validity (via plausibility and credibility) and relevance. And in relation to both these standards a relatively high threshold will have to be met. • There are, of course, differences between quantitative and qualitative approaches as regards the signs used to determine the likely validity of knowledge claims, specifically as regards credibility. • However, these differences are at the level of particular methods: for example, between experiment and survey; participant observation and unstructured interviewing; theme analysis and discourse analysis. • Triangulation of data can provide valuable but not absolutely conclusive signs of likely validity; the same is true of ‘member-checking’ or audit trails.

The state of educational research You will perhaps have recognised that my account of standards of assessment, and of the signs that can be used to evaluate the validity of findings, depends upon some fairly traditional assumptions about the nature of academic educational research, ones that many colleagues would dismiss as positivist (or perhaps as ‘merely academic’).

A fundamental problem:The diverse goals of qualitative research What are appropriate standards of assessment for qualitative research depends upon what are its goals: • to produce educationally-relevant knowledge? • to challenge some feature of the status quo? • to improve policymaking and/or practice? • To exemplify ethical or political ideals? The diverse criteria that have been put forward for qualitative inquiry reflect these differences in view about what the goal or intended product of such research is. There are also ontological and epistemological differences.

Ontological and epistemological differences For instance: • What is the nature of educational practices and processes? • Are they subjective phenomena that require ‘thick description’? • Are they public forms of behaviour that can be objectively described and explained in qualitative terms? • Are they discursively constituted, so that the task of research is to document the process of constitution? The answers to these (and related) questions have divergent implications for the nature of the data and evidence required in qualitative educational research, and for how findings should be assessed.

Conclusion • It is certainly true that there are problems surrounding criteria for assessing qualitative research • The meaning of ‘criteria’ is problematic • Verbal criteria cannot provide transparent means of assessment for lay users of research • At the same time, clarity about standards, thresholds, and signs is necessary, and sadly neglected – amongst both quantitative and qualitative researchers • However, the lack of agreement about research goals among qualitative researchers is a fundamental obstacle. • So, in that sense, the legitimation problem persists. • And this is true not just in pragmatic but also in theoretical terms: how can we justify an enterprise in which there are conflicting stated criteria, and little agreement in judgments of particular studies amongst its practitioners?

How is this problem to be resolved? As the saying goes: please send your answers on a postcard!

Bibliography Becker, H. S. (2017) Evidence, Chicago, University of Chicago Press. Bochner, A. (2000) ‘Criteria against ourselves’ Qualitative Inquiry, 6, 2, pp266-272 Denzin, N. and Lincoln, Y. (eds) (1994) Handbook of Qualitative Research, Thousand Oaks CA, Sage. Duneier, M. (2011) ‘How not to lie with ethnography’, Sociological Methodology, 41, pp1-11. Flick, U. (2018) An Introduction to Qualitative Research, Eight edition, London, Sage. Gerring, J. (2007) Case Study Research, Cambridge, Cambridge University Press. Hammersley, M. (2008a) ‘Assessing validity in social research’, in Alasuutari, P. , Bickman, L. and Brannen, J. (eds) The Sage Handbook of Social Research Methods, London, Sage. Hammersley, M. (2009) ‘Challenging relativism: the problem of assessment criteria’, Qualitative Inquiry, 15, 1, pp3-29. Hammersley, M. (2011) Methodology, Who Needs It?, London, Sage. Jerolmack, C. and Khan, S. (2014) ‘Talk is cheap: Ethnography and the attitudinal fallacy’, Sociological Methods and Research, 43, 2, pp178-209. Lather, P. and St. Pierre, E. (2013) ‘Post-qualitative research’, International Journal of Qualitative Studies in Education, 26:6, 629-633, Leavy, P. (2015) Handbook of Arts-Based Research. New York: The Guilford Press. Lincoln, Y. and Guba, E. (1985) Naturalistic Inquiry, Beverly Hills CA, Sage. Lubet, S. (2018) Interrogating Ethnography: Why evidence matters, New York, Oxford University Press. Reicher, S. (2000) ‘Against methodolatry: Some comments on Elliott, Fischer, and Rennie’,British Journal of Clinical Psychology (2000), 39, 1, pp1-6. Schou, L. et al (2011) ‘Validation of a new assessment tool for qualitative research articles’, Journal of Advanced Nursing 68(9):2086-94 Seale, C. (1999) The Quality of Qualitative Research, London, Sage. Small, M. (2009) ‘“How many cases do I need?” On science and the logic of case selection in field-based research’, Ethnography, 10, 1, pp5-38 Spencer, L. et al (2003) Quality in Qualitative Evaluation, London, Cabinet Office. Available at: https://www.academia.edu/2218457/Quality_in_Qualitative_Evaluation_A_framework_for_assessing_research_evidence Weil, S., Eberle, T. & Flick, U. (2008) ‘Between Reflexivity and Consolidation—Qualitative Research in the Mirror of Handbooks. Book Review Symposium Forum Qualitative Sozialforschung / Forum: Qualitative Social Research, 9(3), Art. 28, http://nbn-resolving.de/urn:nbn:de:0114-fqs0803280.

The challenging issue of assessment criteria for qualitative research