Evaluating Ontology-Mapping Tools: Requirements and Experience

Evaluating Ontology-Mapping Tools:Requirements and Experience Natalya F. Noy Mark A. Musen Stanford Medical Informatics Stanford University

Types Of Ontology Tools Ontology Tools Development Tools Mapping Tools Protégé-2000, OntoEdit OilEd, WebODE, Ontolingua PROMPT, ONION, OBSERVER, Chimaera, FCA-Merge, GLUE There is not just ONE class of ONTOLOGY TOOLS

Evaluation Parameters forOntology-Development Tools • Interoperability with other tools • Ability to import ontologies from other languages • Ability to export ontologies to other languages • Expressiveness of the knowledge model • Scalability • Extensibility • Availability and capabilities of inference services • Usability of tools

Evaluation Parameters ForOntology-Mapping Tools • Can try to reuse evaluation parameters for development tools, but: Ontology Tools Development Tools Mapping Tools Similar tasks, inputs, and outputs Different tasks, inputs, and outputs

Domain knowledge Ontologies to reuse Requirements Development Tools Input Task Output Create an ontology Domain ontology

Mapping Tools: Tasks A B A B A B C=Merge(A, B) Articulation ontology Map(A, B) iPROMPT, Chimaera Anchor-PROMPT, GLUE FCA-Merge ONION

Mapping Tools: Inputs Classes Classes Classes Classes Classes REQUIRES Instance data Shared instances DL definitions Slots and facets Slots and facets USES Chimaera iPROMPT GLUE FCA-Merge OBSERVER

GUI for interactive merging Lists of pairs of related terms List of articulation rules Anchor-PROMPT, GLUE FCA-Merge iPROMPT, Chimaera ONION Mapping Tools: Outputs and User Interaction

Can We Compare Mapping Tools? • Yes, we can! • We can compare tools in the same group • How do we define a group?

Architectural Comparison Criteria • Input requirements • Ontology elements • Used for analysis • Required for analysis • Modeling paradigm • Frame-based • Description Logic • Level of user interaction: • Batch mode • Interactive • User feedback • Required? • Used?

Architectural Criteria (cont’d) • Type of output • Set of rules • Ontology of mappings • List of suggestions • Set of pairs of related terms • Content of output • Matching classes • Matching instances • Matching slots

From Large Pool To Small Groups Space of mapping tools Architectural criteria Performancecriterion(within a single group)

Resources Required For Comparison Experiments • Source ontologies • Pairs of ontologies covering similar domains • Ontologies of different size, complexity, level of overlap • “Gold standard” results • Human-generated correspondences between terms • Pairs of terms, rules, explicit mappings

Suggestions that the user followed Suggestions that the tool produced Operations that the user performed Resources Required (cont’d) • Metrics for comparing performance • Precision (how many of the tool’s suggestions are correct) • Recall (how many of the correct matches the tool found) • Distance between ontologies • Use of inference techniques • Analysis of taxonomic relationships (a-la OntoClean) • Experiment controls • Design • Protocol

Where Will The Resources Come From? • Ideally, from researchers that do not belong to any of the evaluated projects • Realistically, as a side product of stand-alone evaluation experiments

Evaluation Experiment: iPROMPT • iPROMPT is • A plug-in to Protégé-2000 • An interactive ontology-merging tool • iPROMPT uses for analysis • Class hierarchy • Slots and facet values • iPROMPT matches • Classes • Slots • Instances

Evaluation Experiment • 4 users merged the same 2 source ontologies • We measured • Acceptability of iPrompt’s suggestions • Differences in the resulting ontologies

Sources • Input: two ontologies from the DAML ontology library • CMU ontology: • Employees of academic organization • Publications • Relationships among research groups • UMD ontology: • Individals • CS departments • Activities

Experimental Design • User’s expertise: • Familiar with Protégé-2000 • Not familiar with PROMPT • Experiment materials: • The iPROMPT software • A detailed tutorial • A tutorial example • Evaluation files • Users performed the experiment on their own. No questions or interaction with developers.

Experiment Results • Quality of iPROMPT suggestions: • Recall: 96.9% • Precision: 88.6% • Resulting ontologies • Difference measure: fraction of frames that have different name and type • Ontologies differ by ~30%

Limitations In The Experiment • Only 4 participants • Variability in Protégé expertise • Recall and precision figures without comparison to other tools are not very meaningful • Need better distance metrics

Research Questions • Which pragmatic criteria are most helpful in finding the best tool for a task • How do we develop a “gold standard” merged ontology? Does such an ontology exist? • How do we define a good distance metric to compare results to the gold standard? • Can we reuse tools and metrics developed for evaluating ontologies themselves?

Evaluating Ontology-Mapping Tools: Requirements and Experience