An Investigation of Evaluation Metrics for Analytic Question Answering

An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004

Outline • Motivation & Goals • Hypothesis-driven development of metrics • Design – collection, subjects, scenarios • Data Collection & Results • Summary and issues

Motivation • Much progress in IR has been attributed to community evaluations using metrics of precision and recall with common tasks and common data. • There is no corresponding set of evaluation criteria for the interaction of users with the systems. • While performance is crucial, the utility of the system to the user is critical. • The lack of evaluation criteria prevents the comparison of systems based on utility. Acquisition of new systems is therefore based on performance of the system alone – and frequently does NOT reflect how systems will work in the user’s actual process.

Goals of Workshop • To develop metrics for process and products that will reflect the interaction of users and information systems. • To develop the metrics based on: • Cognitive task analyses of intelligence analysts • Previous experience in AQUAINT and NIMD evaluations • Expert consultation • To deliver an evaluation package consisting of: • Process and product metrics • An evaluation methodology • A data set to use in the evaluation

Hypothesis-driven development of metrics Hypotheses – QA systems should … Candidate metrics – What could we measure that would provide evidence to support or refute this hypothesis? • Collection methods: • Questionnaires • Mood meter • System logs • Report evaluation method • System surveillance tool • Cognitive workload instrument • … Measures – implementation of metric; depends on specific collection method

Hypotheses

Examples of candidate metrics • H1: Support gathering the same type of information with a lower cognitive workload • # queries/questions • % interactions where analyst takes initiative • Number of non-content interactions with system (clarifications) • Cognitive workload measurement • H7: Enable analysts to collect more data in less time • Growth of shoebox over time • Subjective assessment • H12: QA systems should provide context and continuity for the user – coherence of dialogue! • Similarity between queries – calculate shifts in dialog trails • Redundancy of documents – count how often a snippet is found more than one time • Subjective assessment

Top-level Design of the Workshop • Systems • Domain • Scenarios • Collection • Subjects • On-site team • Block design and on-site plan

Design – Systems HITIQA Tomek Strzalkowski Ferret Sanda Harabagiu Stefano Bertolo GINKO GNIST

Design – Scenarios

Design – Document Collection

Design – Subjects • 8 reservists (7 Navy; 1 Army) • Age: 30-54 yrs (M=40.8) • Educational background: • 1 PhD; 4 Masters; 2 Bachelors; 1 HS • Military service: 2.5-31 yrs (M=18.3) • Analysis Experience: 0-23 yrs (M=10.8)

Design – On-Site Team

2-day Block Schedule

Data Collection Instruments • Questionnaires • Post-scenario (SCE) • Post-session (SES) • Post-system (SYS) • Cross-evaluation of reports • Cognitive workload • Glass Box • System Logs • Mood indicator • Status reports • Debriefing • Observer notes • Scenario difficulty assessment

Questionnaires • Coverage for 14/15 hypotheses • Other question types: • All SCE questions relate to scenario content • 3 SYS questions on Readiness • 3 SYS questions on Training

Questions for Hypothesis 7 Enable analysts to collect more data in less time • SES Q2: In comparison to other systems that you normally use for work tasks, how would you assess the length of time that it took to perform this task using the [X] system? [less … same … more] • SES Q13: If you had to perform a task like the one described in the scenario at work, do you think that having access to the [X] system would help increase the speed with which you find information? [not at all … a lot] • SYS Q23: Having the system at work would help me find information faster than I can currently find it. • SYS Q6: The system slows down my process of finding information.

Additional Analysis for Questionnaire Data – Factor Analysis • Four factors emerged • Factor 1: most questions • Factor 2: time, navigation, training • Factor 3: novel information, new way of searching • Factor 4: Skill in using the system improved • These factors distinguished between the four systems with each system being most distinguished from the others (positively or negatively) on one factor • GNIST was related to Factor 2. • Positive for navigation and training; negative for time.

Cross Evaluation Criteria Subjects rated the reports (including their own) on seven characteristics • Covers the important ground • Avoids the irrelevant materials • Avoids redundant information • Includes selective information • Is well organized • Reads clearly and easily • Overall rating

Cross-evaluation Results

NASA TLX -- Cognitive Load

Glass Box Data • Types of data captured: • Keystrokes • Mouse moves • Session start/stop times • Task times • Application focus time • Copy/paste events • Screen capture & audio track

Glass Box Data Allocation of session time

System log data • # queries/questions • ‘Good’ queries/questions • Total documents delivered • # unique documents delivered • % unique documents delivered • # documents copied from • # copies

# Questions vs # ‘Good’ Questions

$ $$ $ $$$ $$$

What Next? • Query trails are being worked on by LCC, Rutgers and others; available as part of deliverable. • Scenario difficulty has become an independent effort with potential impact on both NIMD and AQUAINT. • Thinking about alternative implementation of mood indicator. AQUAINT sponsors large-scale group evals using metrics and methodology Each project team employs metrics and methodology Or something in between

Issues to be Addressed • What constitutes a replication of the method? the whole thing? a few hypotheses with all data methods? all hypotheses with a few data methods? • Costs associated with data collection methods • Is a comparison needed? • Baseline – if so, is Google the right one? Maybe the ‘best so far’ to keep the bar high. • Past results – can measure progress over time, but requires iterative application • ‘Currency’ of data and scenarios • Analysts are sensitive to staleness • What is the effect of updating on repeatability?

Backups

Report Cross-Evaluation Results

$ $$ $ $$$ $$$ Summary of Findings

An Investigation of Evaluation Metrics for Analytic Question Answering

An Investigation of Evaluation Metrics for Analytic Question Answering

Presentation Transcript

Question-Answering

Semantic Inference for Question Answering

An Analysis of the AskMSR Question-Answering System

Answering an Essay Style Question

Question Answering

Answering an Extract Question

Question AnswerinG

Question Answering for Machine Reading Evaluation Evaluation Campaign at CLEF 2011

An Investigation of Perceived Sharpness and Sharpness Metrics*

Is Question Answering an Acquired Skill?

Is Question Answering an Acquired Skill?

Is Question Answering an Acquired Skill?

Log-Based Evaluation Resources for Question Answering

Probabilistic Question Recommendation for Question Answering Communities

Is Question Answering an Acquired Skill?

Spanish Question Answering Evaluation

Question Answering

Question Answering

Question Answering

Question Answering

Question Answering

Question Answering