Evaluating SME-Elicited Knowledge

Evaluating SME-Elicited Knowledge Julie Fitzgerald, Mike Pool, Bob Schrag Information Extraction and Transport, Inc. Arlington, Virginia August 9, 2003

Background Information • Evaluations conducted for DARPA’s Rapid Knowledge Formation (RKF) • Goal was development of tools for eliciting formalized knowledge from KR-naïve subject matter experts (SMEs) • Two system integrators • Cycorp developed KRAKEN system • Large knowledge based system, primarily NL driven • SRI developed SHAKEN system • Smaller, more modular knowledge based system, primarily graphical • Program evaluation was driven by Challenge Problems • Y1 (summer 2001, January 2002): Cell Biology • Asked: “Can SMEs teach RNA transcription to a knowledge base?” • Y2 (September 2002): COA Critiquing • Asked: “Can SMEs teach systems to critique a rudimentary military course of action (COA) wrt a number of critiquing criteria?”

RKF Evaluation Objectives • Can subject matter expert (SMEs) author KBs using RKF tools / processes? • How good is the authored knowledge? • How well does it work w.r.t. a performance task (e.g., textbook question answering, military course of action (COA) critiquing)? • How general / robust is the system? • How much of the knowledge was reused/can be reused? • How do KBs built by SMEs compare to KBs built by KEs? • Who (between KEs and SMEs) did what? • What kinds of knowledge did SMEs build well? • How long did it take? • What was enjoyable for SMEs?

Overview • Y1 Cell Biology CP Evaluation • Y2 COA Critiquing CP Evaluation • General RKF Evaluation Considerations • Meaningfulness of results • Evaluation methods • Target audience • Effects of human users • Types of users • Evaluation duration • User interactions and metrics • Challenge Problems

Evaluation Methods • What was Evaluated? • Functional Performance • Subjective metrics • Test questions based on section from cell biology textbook (Y1) • COA Diagnostic, COA Critiquing (Y2) • Economics • Objective metrics • Volume, rates, reuse • Intrinsic KB Quality • Subjective and non-metric for KBs • Quality Review Panels • Intrinsic Tool Quality • Subjective and non-metric from SMEs • Expert Knowledge Challenge Problem work • Questionnaires

Problems with Evaluation Methods • Mix of methods makes it difficult to know what conclusions to draw • Each evaluation evolved to improve evaluation mechanics • Subjective methods are only as good as the evaluators and the specification of the subjective measures • Objective evaluations can give false sense of confidence • Across teams, even objective measures such as counts of assertions are not clear cut • Different KR systems and ontologies makes it difficult to compare across systems regarding, e.g., • Number of assertions • Reuse statistics • Quality of knowledge entered, answers generated • Users of mixed skills and abilities

Overview • Y1 Cell Biology CP Evaluation • Y2 COA Critiquing CP Evaluation • General RKF Evaluation Considerations • Meaningfulness of results • Evaluation methods • Target audience • Effects of human users • Types of users • Evaluation duration • User interactions and metrics • Challenge Problems

Evaluations served two masters • Report Card for funding agency that reveals progress and significance • Knowledge Acquisition rates, reuse levels • SME vs. standard • SME KB vs. textbook/canonical SME answers • SME KB vs. KE KBS • SME KB vs. SME’s own answers

Evaluations served two masters • Technologists want evaluations too • A good evaluation should also be a service to the evaluees and help to focus/refocus development. • Identify and characterize: • Accomplishments • Shortcomings • Limitations • Diagnose performance • Characterize question types • Detailed scoring criteria • Quality Review Panel (QRP) reviews • SME questionnaires

Evaluation Usefulness Tension • Conflict of interests between Contracting Agencies and Researchers • Agencies want to see progress • Researchers want to do work • Evaluations take time and resources • They may not show progress • They take time away from work

Overview • Y1 Cell Biology CP Evaluation • Y2 COA Critiquing CP Evaluation • General RKF Evaluation Considerations • Meaningfulness of results • Target audience • Evaluation methods • Effects of human users • Types of users • Evaluation duration • User interactions and metrics • Challenge Problems

Importance of User Characterization • Across the two years, RKF evaluation involved both AI-naïve users and trained KEs • In evaluating how well a system enables user to do something, the nature of the user is important. • Systems that made sense to KEs were often baffling to SMEs. • RKF did not invest resources in to analyzing how the different types of users interacted with the systems. • These evaluations very much focused on the systems and what was produced using the systems. • Ignored the interactions as a target for evaluation except in SME questionnaires.

Evaluation Duration • RKF evaluated experimental systems • As such, they were works in progress. • Human users were…only human • In RKF Year 1, the evaluation period was quite long (over four weeks) • As a result, the systems received a good workout but so did the users • Frustration as a result of bugs and evaluation mechanisms which kept users isolated • Productivity dropped off as summer progressed • Patience was non-existent by the end of summer • In RKF Year 2, evaluation was shorter but task was more complex • Users felt like there was more they wanted to teach the systems • Evaluation was not long enough for the task at hand

What is being tested? • In RKF, the systems were the focus of the evaluation • Systems were supposed to be designed to aid users in developing knowledge • In Year 1, users were kept apart from technology developers to make experiment more pure. • In Year 2, interaction was allowed and encouraged. • User interaction had to be characterized so that we could take effects into account when evaluating results • Sufficient metrics for this purpose have not been stated. • KR is a very creative process • The process would need to be teased apart more to trace the contributions of KEs versus SMEs.

Scientific Validity • We wanted to isolate variables, e.g., determine which factor(s) led to performance differences • For this, we needed: • a) Quantity of data • Longer evaluations and/or more users • Enough data to help establish that differences are statistically significant • b) To avoid ceiling and floor effects • Diversity in kinds and difficulty of performance tasks • Sufficient, but not misleading, amount of granularity in scoring • c) An effort to isolate variables appropriately • Identify controls • Characterize users systematically

Challenge Problems • CP Objectives • Test technology • Feedback for DARPA -- we’re doing what we should be doing. • Feedback for teams -- here’s where you can improve. • Focus development • CPs provide a theme and make collaboration more targeted • Possibility of developers (just) “teaching to the test” • Reflect development • Show off what can be shown off

Challenge Problem Lessons Learned • Importance of strong evaluation focus • Gives technology a communal focus • Competitions are not always necessary • Can promote unhealthy levels of competition • Makes focus on grades rather than results • There can be several Challenge Problem foci • But you do sacrifice comparability • Evaluation is a collaborative sport • Evaluators need to listen to the tech providers • Tech providers must accept a bar set slightly higher than their comfort level

Challenge Problem Lessons Learned • Evaluation methodology needs to be well known • Get specs out early and hammer out a consensus • Dry runs iron out the wrinkles • Mini-evaluations keep the data coming and allow teams to continually test/improve their systems • Targeted testing can really focus on particular system components • Subjectivity can be managed with good criteria

For more information… • IET’s RKF page: www.iet.com/Projects/RKF/ • Y1 Spec: http://www.iet.com/Projects/RKF/TKCP-spec--v2.1.doc • Y2 Spec: http://www.iet.com/Projects/RKF/COA-CP-spec--v1.2.doc • Y1 Evaluation Paper: http://www.iet.com/Projects/RKF/PerMIS02.doc • Schrag, B. et al, “Experimental Evaluation of Subject Matter Expert-oriented Knowledge Base Authoring Tools” Measuring the Performance and Intelligence of Systems: Proceedings of the 2002 PerMIS Workshop, August 13-15, 2002, NIST Special Publication 990, pp. 272-279 • Y2 Evaluation Paper: http://www.iet.com/Projects/RKF/QRP02/KCAP-03-COACritiquing.pdf • Pool, M., Murray, K., Fitzgerald, J., Mehrotra, M., Schrag, R., Blythe, J., Kim, J., Chalupsky, H., Miraglia, P., Russ, T., Schneider, D. “Evaluating SME Authored COA Critiquing Knowledge,” submitted to K-CAP, 2003.

Evaluating SME-Elicited Knowledge

Evaluating SME-Elicited Knowledge

Presentation Transcript

SME Management

Elicited or Unconditioned Behavior

Chapter 2: Elicited behavior, Habituation, and sensitization

SME

Targeted SME Programmes: Evaluating Market System Projects

SME Offer

Evaluating knowledge development processes

SME-Master

Elicited and Emitted Behavior The Reflex: Elicitation Properties of Elicited Behavior

SME Primer

SME NOTES

SME NOTEs

BSE SME Presents – SME and ITP

SME Convention

SME

SUPER-SME

SME Toolkit

SME Lending