Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase 2 Fall Workshop Tampa, FL

2004 ARDA Challenge WorkshopAn Investigation of Evaluation Metricsfor Analytic Question AnsweringOverview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase 2 Fall Workshop Tampa, FL

Northwestern Regional Research Center • Hosted by Pacific Northwest National Laboratory • Located in Richland, WA

Problem • The adoption of new QA technologies in the IC is hindered by the gap between the development and usage environments • There is no systematic way of ensuring that QA systems conform to the working practices of analysts • Systems may perform well in terms of accuracy, but do not address the needs of analysts

Solution • Develop evaluation metrics that reflect the interaction of users and QA systems to determine how and to which extent these systems meet user requirements • Determine the utility of features and functionalities • Establish and corroborate user requirements • Perform a user-centric comparison of different systems

Experimental Focus • The development of the evaluation metrics is based on empirical studies of analysts using • 3 Question Answering systems • Cycorp • LCC • SUNY@Albany • the Google search engine as the baseline system

Stakeholders • Government Champions • John Prange (ARDA) • Kelcy Allwein (DIA) • Mike Blair (NAVY) • Team Leaders • Emile Morse & Jean Scholtz (NIST) • Team Participants • Tomek Strzalkowski, Sharon Small, Sean Ryan, Hilda Hardy (SUNY@Albany) • Sanda Harabagiu, Andy Hickl, John Williams (LCC) • Stefano Bertolo (Cycorp) • Paul Kantor (Rutgers University) • Diane Kelly (University of North Carolina) • Peter LaMonica, Chuck Messenger (AFRL) • Joe Konczal (NIST) • Katherine Johnson, Frank Greitzer (PNNL) Analysts: 7 from NAVY, 1 from ARMY Graduate Students: Robert Rittman, Aleksandra Sarcevic, Ying Sun (Rutgers University) • PNNL Oversight • Rich Quadrel (NWRRC Director) • Troy Juntunen (System Installation and Connectivity) • Ben Barnett, Trina Pitcher, John Calhoun, Eileen Boiling (Admin) • Antonio Sanfilippo (Project Manager)

Roadmap Feb 23 • Project planning meeting (NIST) March-April • Preparation (contracts, purchases, data collection, initial scenario development) April 15-16 • Kickoff meeting (NIST) April-May • Finalize scenarios, metric hypotheses, and evaluation methods & materials • Work with NWRRC to set up facilities for data collection at PNNL June 7-25 • Install systems at PNNL • Carry out user studies with analysts • Collect data July • First version of data analysis • Internal progress report and agenda for the remaining work August • Final version of data analysis and final exam September • Final report

Technical Approach • Construct evaluation metric hypotheses about the utility of QA systems and test these in experimental user studies • Collect data relative to evaluation hypotheses for 8 analysts working on 8 task assignment scenarios with 4 QA systems • Analyze collected data to verify utility of evaluation metric hypotheses

Methodology

Accomplishments • Results to-date from the analysis of the data collected during the user studies at PNNL indicate that • Most of the valuation hypotheses initially set by the team proved to be useful for the user-centered assessment of QA systems • The methodology developed by the team during the course of the user studies is effective for applying these evaluation metrics • On average, the Cycorp, Albany and LCC Question Answering systems were deemed to be more useful by users than the baseline system (Google)

Results & Benefits • The workshop delivered a set of tested user-centric evaluation criteria and a methodology for applying these evaluation criteria to gain knowledge about how QA systems meet the needs of analysts • The availability of user-centric evaluation metrics enables a systematic methodology for tailoring the utility of QA systems to the specific needs of the Intelligence Community • Target feature and functionalities that are most impactful • Facilitate technology insertion

Assessment • The work has been carried out on schedule and with extreme precision, attention to details and high technical standards • Results indicate that the Workshop will be impactful in establishing a user-centered evaluation framework for interactive information systems. • Results will be presented in the next talk by Emile Morse • A version of the methodology developed will be demonstrated in today’s exercise

Parting Shots Views from the June Challenge problem in Richland

Thank You!

Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase 2 Fall Workshop Tampa, FL