370 likes | 506 Views
An Investigation of Evaluation Metrics for Analytic Question Answering. ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004. Outline. Motivation & Goals Hypothesis-driven development of metrics Design – collection, subjects, scenarios Data Collection & Results
E N D
An Investigation of Evaluation Metrics for Analytic Question Answering ARDA Metrics Challenge: PI Meeting Emile Morse October 7, 2004
Outline • Motivation & Goals • Hypothesis-driven development of metrics • Design – collection, subjects, scenarios • Data Collection & Results • Summary and issues
Motivation • Much progress in IR has been attributed to community evaluations using metrics of precision and recall with common tasks and common data. • There is no corresponding set of evaluation criteria for the interaction of users with the systems. • While performance is crucial, the utility of the system to the user is critical. • The lack of evaluation criteria prevents the comparison of systems based on utility. Acquisition of new systems is therefore based on performance of the system alone – and frequently does NOT reflect how systems will work in the user’s actual process.
Goals of Workshop • To develop metrics for process and products that will reflect the interaction of users and information systems. • To develop the metrics based on: • Cognitive task analyses of intelligence analysts • Previous experience in AQUAINT and NIMD evaluations • Expert consultation • To deliver an evaluation package consisting of: • Process and product metrics • An evaluation methodology • A data set to use in the evaluation
Hypothesis-driven development of metrics Hypotheses – QA systems should … Candidate metrics – What could we measure that would provide evidence to support or refute this hypothesis? • Collection methods: • Questionnaires • Mood meter • System logs • Report evaluation method • System surveillance tool • Cognitive workload instrument • … Measures – implementation of metric; depends on specific collection method
Examples of candidate metrics • H1: Support gathering the same type of information with a lower cognitive workload • # queries/questions • % interactions where analyst takes initiative • Number of non-content interactions with system (clarifications) • Cognitive workload measurement • H7: Enable analysts to collect more data in less time • Growth of shoebox over time • Subjective assessment • H12: QA systems should provide context and continuity for the user – coherence of dialogue! • Similarity between queries – calculate shifts in dialog trails • Redundancy of documents – count how often a snippet is found more than one time • Subjective assessment
Top-level Design of the Workshop • Systems • Domain • Scenarios • Collection • Subjects • On-site team • Block design and on-site plan
Design – Systems HITIQA Tomek Strzalkowski Ferret Sanda Harabagiu Stefano Bertolo GINKO GNIST
Design – Subjects • 8 reservists (7 Navy; 1 Army) • Age: 30-54 yrs (M=40.8) • Educational background: • 1 PhD; 4 Masters; 2 Bachelors; 1 HS • Military service: 2.5-31 yrs (M=18.3) • Analysis Experience: 0-23 yrs (M=10.8)
Data Collection Instruments • Questionnaires • Post-scenario (SCE) • Post-session (SES) • Post-system (SYS) • Cross-evaluation of reports • Cognitive workload • Glass Box • System Logs • Mood indicator • Status reports • Debriefing • Observer notes • Scenario difficulty assessment
Questionnaires • Coverage for 14/15 hypotheses • Other question types: • All SCE questions relate to scenario content • 3 SYS questions on Readiness • 3 SYS questions on Training
Questions for Hypothesis 7 Enable analysts to collect more data in less time • SES Q2: In comparison to other systems that you normally use for work tasks, how would you assess the length of time that it took to perform this task using the [X] system? [less … same … more] • SES Q13: If you had to perform a task like the one described in the scenario at work, do you think that having access to the [X] system would help increase the speed with which you find information? [not at all … a lot] • SYS Q23: Having the system at work would help me find information faster than I can currently find it. • SYS Q6: The system slows down my process of finding information.
Additional Analysis for Questionnaire Data – Factor Analysis • Four factors emerged • Factor 1: most questions • Factor 2: time, navigation, training • Factor 3: novel information, new way of searching • Factor 4: Skill in using the system improved • These factors distinguished between the four systems with each system being most distinguished from the others (positively or negatively) on one factor • GNIST was related to Factor 2. • Positive for navigation and training; negative for time.
Cross Evaluation Criteria Subjects rated the reports (including their own) on seven characteristics • Covers the important ground • Avoids the irrelevant materials • Avoids redundant information • Includes selective information • Is well organized • Reads clearly and easily • Overall rating
Glass Box Data • Types of data captured: • Keystrokes • Mouse moves • Session start/stop times • Task times • Application focus time • Copy/paste events • Screen capture & audio track
System log data • # queries/questions • ‘Good’ queries/questions • Total documents delivered • # unique documents delivered • % unique documents delivered • # documents copied from • # copies
What Next? • Query trails are being worked on by LCC, Rutgers and others; available as part of deliverable. • Scenario difficulty has become an independent effort with potential impact on both NIMD and AQUAINT. • Thinking about alternative implementation of mood indicator. AQUAINT sponsors large-scale group evals using metrics and methodology Each project team employs metrics and methodology Or something in between
Issues to be Addressed • What constitutes a replication of the method? the whole thing? a few hypotheses with all data methods? all hypotheses with a few data methods? • Costs associated with data collection methods • Is a comparison needed? • Baseline – if so, is Google the right one? Maybe the ‘best so far’ to keep the bar high. • Past results – can measure progress over time, but requires iterative application • ‘Currency’ of data and scenarios • Analysts are sensitive to staleness • What is the effect of updating on repeatability?
$ $$ $ $$$ $$$ Summary of Findings