1 / 26

Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase 2 Fall Workshop Tampa, FL

2004 ARDA Challenge Workshop An Investigation of Evaluation Metrics for Analytic Question Answering Overview. Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase 2 Fall Workshop Tampa, FL. Northwestern Regional Research Center. Hosted by Pacific Northwest National Laboratory Located in Richland, WA.

varen
Download Presentation

Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase 2 Fall Workshop Tampa, FL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 2004 ARDA Challenge WorkshopAn Investigation of Evaluation Metricsfor Analytic Question AnsweringOverview Antonio Sanfilippo PNNL/NWRRC AQUAINT Phase 2 Fall Workshop Tampa, FL

  2. Northwestern Regional Research Center • Hosted by Pacific Northwest National Laboratory • Located in Richland, WA

  3. Problem • The adoption of new QA technologies in the IC is hindered by the gap between the development and usage environments • There is no systematic way of ensuring that QA systems conform to the working practices of analysts • Systems may perform well in terms of accuracy, but do not address the needs of analysts

  4. Solution • Develop evaluation metrics that reflect the interaction of users and QA systems to determine how and to which extent these systems meet user requirements • Determine the utility of features and functionalities • Establish and corroborate user requirements • Perform a user-centric comparison of different systems

  5. Experimental Focus • The development of the evaluation metrics is based on empirical studies of analysts using • 3 Question Answering systems • Cycorp • LCC • SUNY@Albany • the Google search engine as the baseline system

  6. Stakeholders • Government Champions • John Prange (ARDA) • Kelcy Allwein (DIA) • Mike Blair (NAVY) • Team Leaders • Emile Morse & Jean Scholtz (NIST) • Team Participants • Tomek Strzalkowski, Sharon Small, Sean Ryan, Hilda Hardy (SUNY@Albany) • Sanda Harabagiu, Andy Hickl, John Williams (LCC) • Stefano Bertolo (Cycorp) • Paul Kantor (Rutgers University) • Diane Kelly (University of North Carolina) • Peter LaMonica, Chuck Messenger (AFRL) • Joe Konczal (NIST) • Katherine Johnson, Frank Greitzer (PNNL) Analysts: 7 from NAVY, 1 from ARMY Graduate Students: Robert Rittman, Aleksandra Sarcevic, Ying Sun (Rutgers University) • PNNL Oversight • Rich Quadrel (NWRRC Director) • Troy Juntunen (System Installation and Connectivity) • Ben Barnett, Trina Pitcher, John Calhoun, Eileen Boiling (Admin) • Antonio Sanfilippo (Project Manager)

  7. Roadmap Feb 23 • Project planning meeting (NIST) March-April • Preparation (contracts, purchases, data collection, initial scenario development) April 15-16 • Kickoff meeting (NIST) April-May • Finalize scenarios, metric hypotheses, and evaluation methods & materials • Work with NWRRC to set up facilities for data collection at PNNL June 7-25 • Install systems at PNNL • Carry out user studies with analysts • Collect data July • First version of data analysis • Internal progress report and agenda for the remaining work August • Final version of data analysis and final exam September • Final report

  8. Technical Approach • Construct evaluation metric hypotheses about the utility of QA systems and test these in experimental user studies • Collect data relative to evaluation hypotheses for 8 analysts working on 8 task assignment scenarios with 4 QA systems • Analyze collected data to verify utility of evaluation metric hypotheses

  9. Methodology

  10. Accomplishments • Results to-date from the analysis of the data collected during the user studies at PNNL indicate that • Most of the valuation hypotheses initially set by the team proved to be useful for the user-centered assessment of QA systems • The methodology developed by the team during the course of the user studies is effective for applying these evaluation metrics • On average, the Cycorp, Albany and LCC Question Answering systems were deemed to be more useful by users than the baseline system (Google)

  11. Results & Benefits • The workshop delivered a set of tested user-centric evaluation criteria and a methodology for applying these evaluation criteria to gain knowledge about how QA systems meet the needs of analysts • The availability of user-centric evaluation metrics enables a systematic methodology for tailoring the utility of QA systems to the specific needs of the Intelligence Community • Target feature and functionalities that are most impactful • Facilitate technology insertion

  12. Assessment • The work has been carried out on schedule and with extreme precision, attention to details and high technical standards • Results indicate that the Workshop will be impactful in establishing a user-centered evaluation framework for interactive information systems. • Results will be presented in the next talk by Emile Morse • A version of the methodology developed will be demonstrated in today’s exercise

  13. Parting Shots Views from the June Challenge problem in Richland

  14. Thank You!

More Related