450 likes | 620 Views
Empirical Evaluations of Organizational Memory Information Systems. Felix-Robinson Aschoff & Ludger van Elst. Empirical Evaluations of OMIS. 1. Evaluation: Definition and general approaches 2. Contributions from related fields 3. Implications for FRODO. What is Empirical Evaluation?.
E N D
Empirical Evaluations of Organizational Memory Information Systems Felix-Robinson Aschoff & Ludger van Elst
Empirical Evaluations of OMIS 1. Evaluation: Definition and general approaches 2. Contributions from related fields 3. Implications for FRODO
What is Empirical Evaluation? Empirical evaluation refers to the appraisal of a theory by observation in experiments. Chin, 2001
Empirical Evaluations of OMIS 1. Evaluation: Definition and general approaches 2. Contributions from related fields 2. Contributions from related fields 3. Implications for FRODO
Contributions from related fields 1. Knowledge Engineering 1.1 General Approaches - The Sysiphus Initiative - High Performance Knowledge Bases - Essential Theory Approach - Critical Success Metrics 1.2 Knowledge Acquisition 1.3 Ontologies 2. Human Computer Interaction 3. Information Retrieval 4. Software Engineering (Goal-Question-Metric Techniqe)
The Sisyphus Initiative A series of challenge problems for the development of KBS by different research groups with a focus on PSM: Sisyphus-I: Room allocation Sisyphus-II: Elevator configuration Sisyphus-III: Lunar igneous rock classification Sisyphus-IV:Integration over the web Sisyphus-V: High quality knowledge base initiative (hQkb) (Menzies, `99)
Problems of the Sisyphus Initiative • Sisyphus I + II: • No „higher referees“ • No common metrics • Focus on modelling of knowledge. Effort to build a model of the • domain knowledge was usually not recorded. • Important aspects like the accumulation of knowledge and cost- • effectiveness calculation were not paid any attention. • Sisyphus III: • Funding • Willingness of researchers to participate „...none of the Sisyphus experiments have yielded much evaluation information (though at the time of this writing Sisyphus-III is not complete)“ (Shadbolt et al `99)
High Performance Knowledge Bases • run by the Defence Advanced Research Project Agency (DARPA) • in the USA • goal: to increase the rate at which knowledge can be modified in a KBS • three groups of researchers: • 1) challenge problem developers • 2) technology developers • 3) integration teams
HPKB Challenge Problem • International Crisis Scenario in the Persian Gulf: • Hostilities between Saudia Arabia and Iran • Iran closes the Strait of Hormuz to international shipping • Integration of the following KBs: • the HPKB upper-level ontology (Cycorp) • the World Fact Book knowledge base (Central Intelligence Agency) • the Units and Measures Ontology (Stanford) • Example Questions the system should be able to answer: • With what weapons is Iran capable of firing upon tankers in the Strait of H.? • What risk would Iran face in closing the strait to shipping? • Answer key to second question contains for expample: • Economic sanctions from {Saudi Arabia, GCC, U.S., UN,}, because Iran • violates an international norm promoting freedom of the seas. • Source: The Convention on the Law of the Sea
HPKB Evaluation • System`s answers were rated on four official criteria • by challenge problem developers and subject matter experts • Scale: 0 – 3 • The correctness of the answer • The quality of the explanation of the answer • The completeness and quality of the cited sources • The quality of the representation of the question • two phase, test-retest schedule
Essential Theory Approach Menzies & van Harmelen, 1999 Different schools of knowledge engineering
Technical evaluation of ontologies Gòmez-Pérez, 1999 1) Consistency 2) Completeness 3) Conciseness 4) Expandability 5) Sensitiveness • Errors in developing taxonomies: • Circularity errors • Partition errors • Redundancy errors • Grammatial errors • Semantic errors • Incompleteness errors
Related Fields • Knowledge Acquisition • Shadbolt, N., O'Hara, K. & Crow, L. (1999).The experimental evaluation • of knowledge acquisition techniques and methods: history, problems and • new directions. International Journal of Human-Computer Studies, 51, • 729-755. • Human Computer Interaction • „HCI is the study of how people design, implement, and use • interactive computer systems, and how computers affect • individuals and society.“ (Myers et al. 1996) • facilitate interaction between users and computer systems • make computers useful to a wider population • Information Retrieval • Recall and Precision • e.g. key-word based IR vs. ontology-enhanced IR • (Aitken & Reid, 2000)
Empirical Evaluations of OMIS 1. Evaluation: Definition and general approaches 2. Contributions from related fields 3. Implications for FRODO 3. Implications for FRODO
Guideline for Evaluation • Formulate the main purposes of your framework or application. • Formulate precise hypothesis. • Define clear performance metrics. • Standardize the measurement of your performance metrics. • Be thourough with designing your (experimental) research design. • Consider the use of inference statistics. (Cohen, 1995) • Meet common standards for the report of your results.
Evaluation of Frameworks Frameworks are general in scope and designed to cover a wide range of tasks and problems. The systematic control of influencing variables becomes very difficult „Only a whole series of experiments across a number of different tasks and a number of different domains could controll for all the factors that would be essential to take into account.“ Shadbolt et al. 1999 • Approaches: • Sisyphus Initiative • Essential Theory Approach (Menzies & van Harmelen, 1999)
Problems with the Evaluation of FRODO • Difficulty to control influencing variables when evaluating entire frameworks • Frodo is not a running system (yet) • Only few prototypic implementations that are based on FRODO • Frodo is probably underspecified for evaluation in many areas
Goal-Question-Metric Technique Goal 1 Goal 2 Metric Metric Metric Metric Metric Metric Question Question Question Question Question Basili, Caldiera & Rombach 1994
Informal FRODO Projekt Goals • FRODO will provide a flexible, scalable framework for evolutionary growth for distributed OMs • FRODO will provide a comprehensive toolkit for the automatic or semi- automatic construction and maintenance of domain ontologies • FRODO will improve information delivery by the OM by developing more integrated and easier adaptable DAU techniques • FRODO will develop a methodology and tool for business-process oriented knowledge management relying on the notion of weakly-structured workflows • FRODO is based on the assumption that a hybrid solution where the system supports humans in the decision-making process is more appropriate for OMIS than mind-imitating AI systems (IA>AI)
Task Type and Workflows FRODO KiTs FRODO wf > classical wf FRODO wf => classical wf Task Type negotiation co-decisison making projects workflow-processes unique low volume communication intensive repetitive high volume heads down
FRODO GQM – Goal concerning workflows Conceptual level (goals) • GQM-Goals should specify: • a Purpose • a quality Issue • a measurement Object • a Viewpoint • Object of Measurement can be: • Products • Processes • Resources
GQM Questions and Metrics Metric Question What is the efficiency of task completion using FRODO weakly-structured flexible workflows for KiTs? What is the efficiency of task completion using a-priori strictly-structured workflows for KiTs? What is the efficiency or task completion using FRODO weakly-structured flexible workflows for classical workflow processes? Efficiency of task completion: quality of result [expert judgement] divided by the time needed for completion of the task. user-friendliness judged by users
Hypothesis H1: For KiTs weakly-structured flexible workflows as proposed by FRODO will yield higher efficiency of task completion than a-priori strictly-structured workflows. H2: For classical workflow processes FRODO weakly-structured flexible workflows will be as good as a-priori strictly-structured workflows or better.
Experimental Design 2 x 2 factorial experiment independent variables: workflows task type Dependent variable: efficiency of task completion Within Subject Design vs. Between Subject Design Randomized Groups (15-20 for statistical inference) Possibilities: Degradation Studies, Benchmarking
Empirical Evaluation of Organizational Memory Information Systems Felix-Robinson Aschoff & Ludger van Elst 1 Introduction 2 Contributions from Related Fields 2.1 Knowledge Engineering 2.1.1 Generel Methods and Guidelines (Essential Theories, Critical Success Metrics, Sisyphus, HPKB) 2.1.2 Knowledge Acquisition 2.1.3 Ontologies 2.2 Human Computer Interaction 2.3 Information Retrieval 2.4 Software Engineering (Goal-Question-Metric Technique) 3 Implications for Organizational Memory Information Systems 3.1 Implications for the evaluation of OMIS 3.2 Relevant aspects of OMs for evaluations and rules of thumb for conducting evaluative research 3.3 Preliminary sketch of an evaluation of FRODO References Appendix A: Technical evaluation of Ontologies
References Aitken, S. & Reid, S. (2000). Evaluation of an ontology-based information retrieval tool. Proceedings of 14th European Conference on Artificial Intelligence. http://delicias.dia.fi.upm.es/WORKSHOP/ECAI00/accepted-papers.html Basili, V.R., Caldiera, G. & Rombach, H.D. (1994). Goal question metric paradigm. In John J. Marciniak, editor, Encyclopedia of Software Engineering, volume 1, 528532. John Wiley & Sons Berger, B., Burton, A.M., Christiansen, T., Corbridge, C., Reichelt, H. & Shadbolt, N.R.(1989) Evaluation criteria for knowledge acquisition, ACKnowledgeproject deliverable ACK-UoN-T4.1-DL-001B. University of Nottingham, Nottingham Chin, D. N. (2001). Empirical evaluation of user models and user-adapted systems. User Modeling and User-Adapted Interaction, 11: 181-194 Cohen, P. (1995). Empirical Methods for Artificial Intelligence. Cambridge: MIT Press. Cohen, P.R., Schrag,R., Jones E., Pease, A., Lin, A., Starr, B., Easter, D., Gunning D., & Burke, M. (1998). The DARPA high performance knowledge bases project. Artificial Intelligence Magazine. Vol. 19, No. 4, pp.25-49. Gómez-Pérez, A. (1999). Evaluation of taxonomic knowledge in ontologies and knowledge bases. Proceedings of KAW'99. http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html Grüninger, M. & Fox, M.S. (1995) Methodology for the design and evaluation of ontologies, Workshop on Basic Ontological Issues in Knowledge Sharing, IJCAI-95, Montreal. Hays, W. L. (1994). Statistics. Orlando: Harcourt Brace. Kagolovsky, Y., Moehr, J.R. (2000). Evaluation of Information Retrieval: Old problems and new perspectives. Proceedings of 8th International Congress on Medical Librarianship. http://www.icml.org/tuesday/ir/kagalovosy.htm Martin, D.W. (1995). Doing Psychological Experiments. Pacific Grove: Brooks/Cole. Menzies, T. (1999a). Critical sucess metrics: evaluation at the business level. International Journal of Human-Computer Studies, 51, 783-799. Menzies, T. (1999b). hQkb - The high quality knowledge base initiative (Sisyphus V: learning design assessment knowledge). Proceedings of KAW'99.http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html Menzies, T. & van Harmelen, F. (1999). Editorial: Evaluating knowledge engineering techniques. International Journal of Human-Computer Studies, 51, 715-727. Myers, B., Hollan, J. & Cruz, I. (Ed.) (1996). Strategic directions in human computer interaction. ACM Computing Surveys, 28, 4 Nick, M., Althoff, K., & Tautz, C. (1999). Facilitating the practical evaluation of knowledge-based systems and organizational memories using the goal-question-metric technique. Proceedings of KAW ´99. http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html Shadbolt, N., O'Hara, K. & Crow, L. (1999).The experimental evaluation of knowledge acquisition techniques and methods: history, problems and new directions. International Journal of Human-Computer Studies, 51, 729-755. Tallis, M., Kim, J., & Gil, Y. (1999). User studies of knowledge acquisition tools: methodology and lessons learned. Proceedings of KAW ´99 http://sern.ucalgary.ca/KSI/KAW/KAW99/papers.html Tennison, J., O’Hara, K., Shadbolt, N. (1999) Evaluating KA tools: Lessons from an experimental evaluation of APECKS. Proceedings of KAW’99 http://sern.ucalgary.ca/KSI/KAW/KAW99/papers/Tennison1/
Tasks for Workflow Evaluation Possibles Tasks for workflow evaluation experiment: KiT: Please write a report about your personal greatest learning achievements during the last semester. Find sources related to these scientific areas in the Internet. Prepare a Power Point Presentation. To help you with these task you will be provided with FRODO weakly-structured wf / classical workflow Simple structured task: Please implement Netscape on your computer and use the Internet to find all universities in Iowa that offer computer sciences. Use e-mail to ask for further information. To help you with these task you will be provided with FRODO weakly-structured wf / classical workflow
GQM – Goals for CBR-PEB Conceptual level (goals) • GQM-Goals should specify: • a Purpose • a quality Issue • a measurement Object • a Viewpoint • Object of Measurement can be: • Products • Processes • Resources
GQM – Abstraction Sheet for CBR-PEB Goal 2 „Economic Utility“ for CBR - PEB
GQM – Questions and Metrics GQM plan for CBR-PEB Goal 2 „Economic Utility“ Q-9 What is the impact of the case origin on the degree of maturity? Q-9.1 What is the case origin ? M-9.1.1 per retrieval attempt: for each chosen case: case origin [university, industrial research, industry] Q-9.2 What is the degree of maturity of the system? M-9.2.1 per retrieval attempt: for each chosen case: case attribute „status“ [„prototype“, „being developed“, „pilot system“, „application in practical use“; „unknown“]
FRODO GQM-Goal concerning Ontologies For the circumstances FRODO is designed for hybrid solutions are more successful than AI solutions Purpose: Compare Issue: the efficiency of Object (process): ontology construction and use with respect to: Stability, Sharing Scope, Formality of Information Viewpoint: from the user‘s viewpoint
GQM Questions and Metrics What is the efficiency of the ontology construction and use process using FRODO for a situation with high sharing scope, medium stability and low formality? What is the efficiency of the ontology construction and use process using FRODO for a situation with low sharing scope, high stability and high Formality? What is the efficiency of the ontology construction and use process using AI systems for these situations. Metrics: efficiency of ontology construction: number of definitions / time efficiency of ontology use: Information Retrieval (Recall and Precision)
Hypothesis H1: for Situation 1 (high sharing scope, medium stability, low formality) FRODO will yield a higher efficiency of ontology construction and use. H2: for Situation 2 (low sharing scope, high stability and high formality) an AI system will yield higher efficiency of ontology construction and use.
Experimental Design 2 x 2 factorial experiment independent variables: Situation (1/2) Systems (FRODO/AI) Dependent variable: efficiency of ontology construction and use Within Subject Design vs. Between Subject Design Randomized Groups (15-20 for statistical inference)
Big evaluation versus small evaluation Van Harmelen, ‘98 • Distinguish different types of evaluation: • Big evaluation = evaluation of KA/KE methodologies • Small evaluation = evaluation of KA/KE components (e.g. a particular PSM) • Micro evaluation = evaluation of KA/KE product (e.g. a single system) • Some are more interesting than others: • Big evaluation is impossible to control • Micro evaluation is impossible to generalize • Small evaluation might just be the only option
Knowledge Acquisition • Problems with the Evaluation of the KA process (Shadbolt et al, 1999) • the availability of human experts • the need for a „gold standard“ of knowledge • the question of how many different domains and tasks should be included • the difficulty of isolating the value-added of a single technique or tool • how to quantify knowledge and knowledge engineering effort
Knowledge Acquisition 4) the difficulty of isolating the value-added of a single technique or tool • Conduct a series of experiments • Test different implementations of the same techniqe against each other or against a paper and pencil version • Test groups of tools in complementary pairings or different orderings of the same set of tools • Test the value of single sessions against multiple sessions and the effect of feedback in multiple sessions • Exploit techniques from the evaluation of standard software to control for effects from interface, implementation etc. • Problem: Scale-up of experimental programme
Essential Theory Approach • Identify a process of interest. • Create an essential theory t for that process. • Identify some competing process description, T. • Design a study that explores core pathways in both T and T. • Acknowledge that your study may not be definitive. Advantage: Broad conceptual approach; results are of interest for the entire comunity Problem: Interpretation of results is difficult (due to KE school or due to concrete technology like implementation, interface etc?)
Three Aspects of Ontology Evaluation • Three aspects of evaluating ontologies: • the process of constructing the ontology • the technical evaluation • end user assessment and ontology-user interaction
Assessment and Ontology-User Interaction „Assessment is focused on judging the understanding, usability, usefulness, abstraction, quality and portability of the definitions by the user‘s-point of view.“ (Gómez-Pérez, 1999) • Ontology-user interaction in OMIS: • more dynamic • success of OMIS rely on active use • users with heterogen skills, backgrounds and tasks