410 likes | 427 Views
Understanding the role of subjective probability in scientific expertise through expert elicitation protocols and critique of biases in judgements.
E N D
Elicitation: Subjective but Scientific Tony O’Hagan Expert Elicitation Symposium, Liverpool
Science and subjectivity • Elicitation is necessarily subjective • We ask experts for their personal judgements • Their knowledge and uncertainty will be expressed through probabilities • And these are necessarily subjective probabilities • “Surely this is totally unscientific?” • A common reaction when experts are first introduced to subjective probability • They often need education … Subjective!! Expert Elicitation Symposium, Liverpool
You want to use subjective probability judgements? Isn’t that totally unscientific? Science is supposed to be objective. Yes, objectivity is the goal of science, but scientists still have to make judgements. These judgements include theories, insights, interpretations of data. Science progresses by other scientists debating and testing those judgements. Making good judgements of this kind is what distinguishes a top scientist. Expert Elicitation
But subjective judgements are open to bias, prejudice, sloppy thinking … Subjective probabilities are judgements but they should be careful, honest, informed judgements. In science we must always be as objective as possible. Probability judgements are like all the other judgements that a scientist necessarily makes, and should be argued for in the same careful, honest, informed way. Expert Elicitation
Outline • Elicitation • The type of elicitation I’ll be talking about • Subjective • This is how I do elicitation • Scientific • And this is why I do it that way • Practical • But we need to be realistic Expert Elicitation Symposium, Liverpool
Elicitation The type of elicitation I’ll be talking about Expert Elicitation Symposium, Liverpool
My talk will be about • Eliciting expert knowledge about a single uncertain quantity of interest (QoI) • Multivariate elicitation is another big topic • And nobody knows what works and what doesn’t • Eliciting in a way that captures uncertainty about that quantity • Estimates alone are useless • Eliciting from multiple experts • Much of my talk will apply also for a single expert • But multiple experts is the most important context • Eliciting on behalf of a client • Who knows much less than the experts Expert Elicitation Symposium, Liverpool
Subjective This is how I do elicitation Expert Elicitation Symposium, Liverpool
Elicitation protocols • There are many pitfalls for the unwary • Well-researched cognitive biases • Experts with vested interests • Personalities … • But people with expertise in elicitation have developed several carefully-designed protocols • Following one of these is part of being as scientific and objective as possible • I use the SHELF protocol • Developed by myself and Jeremy Oakley Expert Elicitation Symposium, Liverpool
Outline and key features of SHELF • Experts (4 to 8) come together in an elicitation workshop • Led by an experienced facilitator • Two rounds of judgement • 1. They make individual judgements without discussion • Carefully structured sequence of questions • Guided by research in psychology • Then discuss their differences • 2. They make consensus judgments • Behavioural aggregation • Maximising use of evidence • The Rational Impartial Observer (RIO) • SHELF package • Templates for recording the elicitation • Guidance notes for the facilitator • Software for fitting distributions • PowerPoint slides to help experts make the required judgements • Oakley J. E. and O'Hagan, A. (2016). SHELF: the Sheffield Elicitation Framework (version 3.0), University of Sheffield, UK. (http://tonyohagan.co.uk/shelf) Expert Elicitation Symposium, Liverpool
Timescale flowchart Pre-elicitation Prepare evidence dossier • Identify experts • Allocate experts and parameters to workshops Invite experts, get their commitment and agree workshop dates Appoint facilitator There is much to do before the elicitation workshop itself! Planning needs to start early because experts often have very full diaries • Brief experts • Get additional evidence Locate and prepare venues Update evidence dossier Training Experts take e-learning • Conduct workshops • Complete documentation Expert Elicitation Symposium, Liverpool
Workshop flowchart • Roles, review purpose of workshop, etc. • Completion of SHELF1 form • Review principal ideas from online e-learning course • Run a practice elicitation • Partly for the experts to practice the skills learnt online • Partly so that they see how the Sheffield process works • Particularly group discussion, consensus judgements Introductions Training Parameter 1 Parameter 2 Review evidence Review evidence • E.g. Plausible range, median and tertiles • Privately without discussion Individual judgements • Individual judgements are revealed and discussed • Particularly high, low, wide or narrow distributions challenged and justified • General discussion of reasons for individuals’ judgements Discussion Group judgements • Group ‘consensus’ judgements are made • The rational, impartial observer • A probability distribution is fitted • Feedback and opportunity to revisit judgements • Confirmation of final elicited distribution • Completion of SHELF2 form …….. Expert Elicitation Symposium, Liverpool
The SHELF1 form At the beginning of a workshop this form is completed It records basic information Expert Elicitation Symposium, Liverpool
The SHELF2 form This form is completed for each elicited parameter It provides a record of the elicitation Note the two judgement phases • Individual judgements • Group consensus judgements Expert Elicitation Symposium, Liverpool
Scientific And this is why I do it that way Expert Elicitation Symposium, Liverpool
The evidence dossier • A document summarising • the evidence regarding each QoI to be elicited • based on the researches of the project team • possibly supplemented by additional evidence from experts • for use by the experts when making judgements • Not too long • Otherwise it’s hard for experts to assimilate all the evidence when making their judgements • Point out weaknesses • Sample size, sampling/experimental technique, different region/species/duration/age/etc. • “Law of Small Numbers” heuristic – experts rely too much on poor data when it’s all there is Expert Elicitation
Importance of the dossier We need to assemble the evidence • Elicited distributions should be evidence based • As far as practicable • Experts’ judgements should differ only because of their expertise and interpretation of the evidence • Not from having different data • Aggregation is otherwise much less reliable/effective • The availability heuristic makes it important to review all the evidence together Evidence should be shared It should be available to experts during elicitation Expert Elicitation
Probabilistic judgements • We don’t ask just for vague things like an “estimate” or a “likely range” • Different people interpret such requests differently • So we can’t compare the judgements of different experts • And we don’t know what any expert’s judgements mean • Instead we ask for well-defined and interpretable things like a median • Experts are given training in making these judgements • It’s a probabilistic judgement but we’re not asking for a probability • We don’t ask an expert for their probability that the QoI lies in some range • Because that would lead to anchoring bias Expert Elicitation Symposium, Liverpool
Example – how many Muslims? • I run this experiment in all my training courses • Participants are asked to make judgements about • M = the number of Muslims (in millions) in England and Wales • according to the 2011 census • They are asked for two probabilities: • P(M > 2) and P(M > 8) • They don’t see the second question until they’ve answered the first Expert Elicitation
Anchoring can have a big effect • Half of the respondents are given the 2 million question first, and half the 8 million first • Average probabilities from 100+ respondents in each group: • From 13 classes • Anchoring on 8 rather than 2 gives higher probabilities • Same effect observed in almost every individual class • Don’t put numbers in the experts’ heads Expert Elicitation
Sequence of judgements • Avoidance of anchoring effects also guides the sequence of judgements • First, we ask for a plausible range • Not intended to represent any particular probability • People can’t properly differentiate between 95%, 99% or 99.9% • The idea is to get the experts to think about the full range of possibility • Counteracting over-confidence bias • And also to set counter-balancing anchors • We then ask for the expert’s median • Anchors on both sides from plausible range • Asking for a simple judgement of equal probability M U L Expert Elicitation Symposium, Liverpool
Then we ask for tertiles or quartiles • Also anchored on both sides • And require only equal probability judgements • Median and tertiles/quartiles are the principal probabilistic judgements • They have meaningful quantitative interpretations • Plausible range used for plotting T1 T2 M U L Expert Elicitation Symposium, Liverpool
Behavioural aggregation • SHELF uses behavioural aggregation • We get the experts together and elicit from them ‘consensus’ probabilistic judgements • We want to share expertise and judgement • Two rounds of judgement • The individual judgements establish the experts’ initial positions • And provide natural starting points for discussion • We then have discussion, often extensive • With a view to understanding (if not resolving) differences of opinion • Then they are asked for group judgements Expert Elicitation Symposium, Liverpool
RIO Group judgements are from the perspective of a rational impartial observer (RIO) • Behavioural aggregation seeks to obtain an agreed, consensus distribution from the experts • But even after the discussion, they will have differing opinions • RIO is the key to obtaining agreement • Experts are asked to agree on what a Rational Impartial Observer might reasonably believe after seeing the experts’ judgements and hearing their discussions • Facilitator can compare with original individual judgements • Is the degree of compromise in keeping with the intervening discussion? RIO • Very similar concept to RIO in seismic assessment of nuclear reactors: • https://www.nrc.gov/docs/ML1211/ML12118A445.pdf Expert Elicitation - L'Oréal
Not mathematical aggregation • Mathematical aggregation requires the arbitrary choice of a pooling rule • Often equal-weighted average • Cooke’s “classical” method puts a lot of effort into assessing experts on “seed”” quantities • And then applies an arbitrary rule to get weights • Which often gives several experts negligible weight • The SHELF discussion lets poor initial judgements be corrected, and doesn’t kick experts out • And whose distribution do we end up with? • A subjective probability distribution should represent somebody’s beliefs • With SHELF, it’s (at least notionally) RIO’s • Which is really what the client wants • With mathematical aggregation, it’s nobody’s Expert Elicitation Symposium, Liverpool
Templates • SHELF templates serve two purposes • They guide the facilitator through the SHELF process • Ensuring that all the steps are followed in the right sequence • Completed templates form a record of the elicitation • The client can see that the process has been followed • They can see initial judgements, how the discussion unfolded, and final group judgements • They can take their own view of whether to adopt the final distribution as their own Expert Elicitation Symposium, Liverpool
The facilitator • The facilitator is a very important role • Guides the experts in making accurate judgements • Manages the discussion, watching out for various group interaction biases • Dominant/reticent experts, factions, experts extrapolating beyond their expertise, group-think • Resources to train and support facilitators • Training courses (3 days, next one in April) • Every template has an annotated form with guidance on each step • More than 100 pages of guidance documents • PowerPoint slide sets to use in the workshop • Software for fitting and displaying distributions Expert Elicitation Symposium, Liverpool
Practical But we need to be realistic Expert Elicitation Symposium, Liverpool
It’s not practical! • Doing elicitation as carefully, rigorously and scientifically as possible takes serious time and effort • It only makes sense when it’s sufficiently important to do it as well as possible • E.g. to inform a significant decision • It may still not be practical for various reasons • Not practical to get experts together face to face • Insufficient time or money • Particularly when there are many uncertain quantities • The aim should always be to do the best within practical constraints • Probabilistic judgements with well-defined meaning • Meaningful aggregation, attention to cognitive biases, documentation Expert Elicitation Symposium, Liverpool
Can’t get all the experts together? • Next best option is video-conference • Preferably have most of the experts in one room • Teleconference is a poor alternative • Fall-back position is probabilistic Delphi • As set out in EFSA Guidance • So also known as EFSA Delphi • Experts separately make SHELF judgements, then receive feedback and another round • Some kind of mathematical aggregation required at the end. • Training is absolutely crucial • European Food Safety Authority (2014). Guidance on Expert Knowledge Elicitation in Food and Feed Safety Risk Assessment. EFSA Journal 2014;12(6):3734, 278 pp. doi:10.2903/j.efsa.2014.3734 Expert Elicitation Symposium, Liverpool
Too many quantities? • Minimal assessment • Also introduced in EFSA Guidance • Team makes quick assessments of all quantities • Estimate and uncertainty measure (yes, rather vague!) • These are used to assess sensitivity of the decision to each quantity • Used to prioritise quantities for full SHELF workshops • Minimal assessment retained for the others • Hierarchical approach • Full SHELF workshop for selected (representative/critical) quantities • Same experts (now trained) get probabilistic Delphi for others • If still too many, use simple Delphi for medians and import distributions from higher levels Expert Elicitation Symposium, Liverpool
Imprecise probabilities? • It’s often said that some probability judgements are less confident, less precise than others • This has led some people to say that probabilities can only be specified as lying within some range • Some have even suggested that probability judgements cannot be made in some situations • “Deep uncertainty”, “Knightian uncertainty”, … • Let’s see what’s going on Expert Elicitation Symposium, Liverpool
Tossing coins and drawing pins H = ‘Head’ with a single coin toss U = ‘up’ with single drawing-pin toss P(U) = 0.5 Not so confident Give 0.5 because don’t see which of ‘up’ or ‘down’ would be more common • P(H) = 0.5 • Confident judgement • Coin looks normal, no reason to suspect bias Up Down Head Tail Expert Elicitation Symposium, Liverpool
Complete description of uncertainty • For a single event, like H or U, there are just two outcomes • It happens or it doesn’t • Uncertainty about the single outcome is completely described by the single probability that it happens • And the probability it doesn’t is one minus the probability it does • Differences between tossing a coin or a drawing-pin only emerge when we consider more than one toss Expert Elicitation Symposium, Liverpool
If I have previously seen 3 tosses All 3 coin tosses were ‘head’ All 3 drawing-pin tosses were ‘up’ Begins to look like ‘up’ is more common than ‘down’ Revise my probability to, say, P(U) = 0.6 • Coin still looks normal • Three ‘Head’s just chance • Stick with P(H) = 0.5 Expert Elicitation Symposium, Liverpool
Is this “uncertainty about uncertainty”? • No! • For a single toss, the single probability still completely describes uncertainty • My probabilities on the basis of my current information (when I haven’t seen any previous tosses) are simply • P(H) = 0.5, P(U) = 0.5 • But the thought experiment of the 3 previous tosses gives a way to understand why I feel less confident about P(U) = 0.5 than P(H) = 0.5 • And this explanation does not lead us to “imprecise probabilities” Expert Elicitation Symposium, Liverpool
But elicitation is not an exact science! We fitted an arbitrary distribution to just 3 judgements Every judgement is approximate Or a different facilitator Experts would have made different judgements on a different day Or with a different group … Expert Elicitation Symposium, Liverpool
So we need to do some sensitivity analysis • Vary the elicited distribution a bit • See if the changes lead to materially different conclusions/decisions in the client’s motivating problem • Now do we have imprecise probabilities? • Yes! • But … Expert Elicitation Symposium, Liverpool
How much imprecision? • How far should we vary the judgements? • If I thought harder about the drawing-pin, I might give a slightly different P(U) Short pin -> high P(U) Still feels like P(U) about 0.5! Long pin -> low P(U) How much imprecision?Not much! Expert Elicitation Symposium, Liverpool
How much imprecision? • How far should we vary the distribution? • Not very far • And this is where I differ from much of what I read about “imprecise probabilities” • Given all the evidence, and all the experts’ expertise, and the discussions we have seen, how far might RIO’s probability distribution differ from the elicited distribution? • It’s impossible to say • But not very far! • And it’s certainly impossible to set precise bounds on it as formally required for “imprecise probability” theory Expert Elicitation Symposium, Liverpool
From XKCD: https://xkcd.com/2110/ Expert Elicitation Symposium, Liverpool