1 / 23

Evaluating SME-Elicited Knowledge

Evaluating SME-Elicited Knowledge. Julie Fitzgerald, Mike Pool, Bob Schrag Information Extraction and Transport, Inc. Arlington, Virginia August 9, 2003. Background Information. Evaluations conducted for DARPA’s Rapid Knowledge Formation (RKF)

fran
Download Presentation

Evaluating SME-Elicited Knowledge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating SME-Elicited Knowledge Julie Fitzgerald, Mike Pool, Bob Schrag Information Extraction and Transport, Inc. Arlington, Virginia August 9, 2003

  2. Background Information • Evaluations conducted for DARPA’s Rapid Knowledge Formation (RKF) • Goal was development of tools for eliciting formalized knowledge from KR-naïve subject matter experts (SMEs) • Two system integrators • Cycorp developed KRAKEN system • Large knowledge based system, primarily NL driven • SRI developed SHAKEN system • Smaller, more modular knowledge based system, primarily graphical • Program evaluation was driven by Challenge Problems • Y1 (summer 2001, January 2002): Cell Biology • Asked: “Can SMEs teach RNA transcription to a knowledge base?” • Y2 (September 2002): COA Critiquing • Asked: “Can SMEs teach systems to critique a rudimentary military course of action (COA) wrt a number of critiquing criteria?”

  3. RKF Evaluation Objectives • Can subject matter expert (SMEs) author KBs using RKF tools / processes? • How good is the authored knowledge? • How well does it work w.r.t. a performance task (e.g., textbook question answering, military course of action (COA) critiquing)? • How general / robust is the system? • How much of the knowledge was reused/can be reused? • How do KBs built by SMEs compare to KBs built by KEs? • Who (between KEs and SMEs) did what? • What kinds of knowledge did SMEs build well? • How long did it take? • What was enjoyable for SMEs?

  4. Overview • Y1 Cell Biology CP Evaluation • Y2 COA Critiquing CP Evaluation • General RKF Evaluation Considerations • Meaningfulness of results • Evaluation methods • Target audience • Effects of human users • Types of users • Evaluation duration • User interactions and metrics • Challenge Problems

  5. Overview • Y1 Cell Biology CP Evaluation • Y2 COA Critiquing CP Evaluation • General RKF Evaluation Considerations • Meaningfulness of results • Evaluation methods • Target audience • Effects of human users • Types of users • Evaluation duration • User interactions and metrics • Challenge Problems

  6. Evaluation Methods • What was Evaluated? • Functional Performance • Subjective metrics • Test questions based on section from cell biology textbook (Y1) • COA Diagnostic, COA Critiquing (Y2) • Economics • Objective metrics • Volume, rates, reuse • Intrinsic KB Quality • Subjective and non-metric for KBs • Quality Review Panels • Intrinsic Tool Quality • Subjective and non-metric from SMEs • Expert Knowledge Challenge Problem work • Questionnaires

  7. Problems with Evaluation Methods • Mix of methods makes it difficult to know what conclusions to draw • Each evaluation evolved to improve evaluation mechanics • Subjective methods are only as good as the evaluators and the specification of the subjective measures • Objective evaluations can give false sense of confidence • Across teams, even objective measures such as counts of assertions are not clear cut • Different KR systems and ontologies makes it difficult to compare across systems regarding, e.g., • Number of assertions • Reuse statistics • Quality of knowledge entered, answers generated • Users of mixed skills and abilities

  8. Overview • Y1 Cell Biology CP Evaluation • Y2 COA Critiquing CP Evaluation • General RKF Evaluation Considerations • Meaningfulness of results • Evaluation methods • Target audience • Effects of human users • Types of users • Evaluation duration • User interactions and metrics • Challenge Problems

  9. Evaluations served two masters • Report Card for funding agency that reveals progress and significance • Knowledge Acquisition rates, reuse levels • SME vs. standard • SME KB vs. textbook/canonical SME answers • SME KB vs. KE KBS • SME KB vs. SME’s own answers

  10. Evaluations served two masters • Technologists want evaluations too • A good evaluation should also be a service to the evaluees and help to focus/refocus development. • Identify and characterize: • Accomplishments • Shortcomings • Limitations • Diagnose performance • Characterize question types • Detailed scoring criteria • Quality Review Panel (QRP) reviews • SME questionnaires

  11. Evaluation Usefulness Tension • Conflict of interests between Contracting Agencies and Researchers • Agencies want to see progress • Researchers want to do work • Evaluations take time and resources • They may not show progress • They take time away from work

  12. Overview • Y1 Cell Biology CP Evaluation • Y2 COA Critiquing CP Evaluation • General RKF Evaluation Considerations • Meaningfulness of results • Target audience • Evaluation methods • Effects of human users • Types of users • Evaluation duration • User interactions and metrics • Challenge Problems

  13. Importance of User Characterization • Across the two years, RKF evaluation involved both AI-naïve users and trained KEs • In evaluating how well a system enables user to do something, the nature of the user is important. • Systems that made sense to KEs were often baffling to SMEs. • RKF did not invest resources in to analyzing how the different types of users interacted with the systems. • These evaluations very much focused on the systems and what was produced using the systems. • Ignored the interactions as a target for evaluation except in SME questionnaires.

  14. Overview • Y1 Cell Biology CP Evaluation • Y2 COA Critiquing CP Evaluation • General RKF Evaluation Considerations • Meaningfulness of results • Target audience • Evaluation methods • Effects of human users • Types of users • Evaluation duration • User interactions and metrics • Challenge Problems

  15. Evaluation Duration • RKF evaluated experimental systems • As such, they were works in progress. • Human users were…only human • In RKF Year 1, the evaluation period was quite long (over four weeks) • As a result, the systems received a good workout but so did the users • Frustration as a result of bugs and evaluation mechanisms which kept users isolated • Productivity dropped off as summer progressed • Patience was non-existent by the end of summer • In RKF Year 2, evaluation was shorter but task was more complex • Users felt like there was more they wanted to teach the systems • Evaluation was not long enough for the task at hand

  16. Overview • Y1 Cell Biology CP Evaluation • Y2 COA Critiquing CP Evaluation • General RKF Evaluation Considerations • Meaningfulness of results • Target audience • Evaluation methods • Effects of human users • Types of users • Evaluation duration • User interactions and metrics • Challenge Problems

  17. What is being tested? • In RKF, the systems were the focus of the evaluation • Systems were supposed to be designed to aid users in developing knowledge • In Year 1, users were kept apart from technology developers to make experiment more pure. • In Year 2, interaction was allowed and encouraged. • User interaction had to be characterized so that we could take effects into account when evaluating results • Sufficient metrics for this purpose have not been stated. • KR is a very creative process • The process would need to be teased apart more to trace the contributions of KEs versus SMEs.

  18. Scientific Validity • We wanted to isolate variables, e.g., determine which factor(s) led to performance differences • For this, we needed: • a) Quantity of data • Longer evaluations and/or more users • Enough data to help establish that differences are statistically significant • b) To avoid ceiling and floor effects • Diversity in kinds and difficulty of performance tasks • Sufficient, but not misleading, amount of granularity in scoring • c) An effort to isolate variables appropriately • Identify controls • Characterize users systematically

  19. Overview • Y1 Cell Biology CP Evaluation • Y2 COA Critiquing CP Evaluation • General RKF Evaluation Considerations • Meaningfulness of results • Target audience • Evaluation methods • Effects of human users • Types of users • Evaluation duration • User interactions and metrics • Challenge Problems

  20. Challenge Problems • CP Objectives • Test technology • Feedback for DARPA -- we’re doing what we should be doing. • Feedback for teams -- here’s where you can improve. • Focus development • CPs provide a theme and make collaboration more targeted • Possibility of developers (just) “teaching to the test” • Reflect development • Show off what can be shown off

  21. Challenge Problem Lessons Learned • Importance of strong evaluation focus • Gives technology a communal focus • Competitions are not always necessary • Can promote unhealthy levels of competition • Makes focus on grades rather than results • There can be several Challenge Problem foci • But you do sacrifice comparability • Evaluation is a collaborative sport • Evaluators need to listen to the tech providers • Tech providers must accept a bar set slightly higher than their comfort level

  22. Challenge Problem Lessons Learned • Evaluation methodology needs to be well known • Get specs out early and hammer out a consensus • Dry runs iron out the wrinkles • Mini-evaluations keep the data coming and allow teams to continually test/improve their systems • Targeted testing can really focus on particular system components • Subjectivity can be managed with good criteria

  23. For more information… • IET’s RKF page: www.iet.com/Projects/RKF/ • Y1 Spec: http://www.iet.com/Projects/RKF/TKCP-spec--v2.1.doc • Y2 Spec: http://www.iet.com/Projects/RKF/COA-CP-spec--v1.2.doc • Y1 Evaluation Paper: http://www.iet.com/Projects/RKF/PerMIS02.doc • Schrag, B. et al, “Experimental Evaluation of Subject Matter Expert-oriented Knowledge Base Authoring Tools” Measuring the Performance and Intelligence of Systems:  Proceedings of the 2002 PerMIS Workshop, August 13-15, 2002, NIST Special Publication 990, pp. 272-279 • Y2 Evaluation Paper: http://www.iet.com/Projects/RKF/QRP02/KCAP-03-COACritiquing.pdf • Pool, M., Murray, K., Fitzgerald, J., Mehrotra, M., Schrag, R., Blythe, J., Kim, J., Chalupsky, H., Miraglia, P., Russ, T., Schneider, D. “Evaluating SME Authored COA Critiquing Knowledge,” submitted to K-CAP, 2003.

More Related