Justifying Semantic-Web-based Resource Registration and Discovery

Justifying Semantic-Web-based Resource Registration and Discovery Eric Lee Peterson Chief Ontologist McDonald Bradley

McDonald BradleyWeb Ontology Team Members • Eric Peterson – Chief Ontologist • Jon Pastor – Senior Ontologist • Phuc Nguyen – Senior Rapid Development Engineer • Maurita Soltis – Ontologist/Domain Lead • Darren Govoni – Principal Architect • Eric Monk – Senior Architect • Joseph Rajkumar – Mapping Ontologist • Chuan Shi – Mapping Ontologist • Mark Joncas – Functional Requirements • Gary Gomez – Terrorism SME • Jay Hess – Weapons of Mass Destruction SME • John Kamenelis – Tactical Ballistic Missile SME

Resource Registration and Discovery • Registration: The process where agents (human or automated) store references to web resources for ease of subsequent discovery • Discovery: The process where agents search for and find previously registered web resources • Web Resources: Usually common web-deployed data such as XML or MS Word files represented and identified by uniform resource identifiers (URI’s)

Key Success Criteria • Precision: The ratio of relevant discovered resources over all discovered resources • Recall: The ratio of relevant discovered resources over the number of relevant discoverable resources • Ease of Use: The lack of difficulty and complexity of discovery tool

Present Generation Capability • Autonomy – shotgun approach • Investment is in training every concept against its own document list • …then easy to use • Precision is in the low 80% • Great tool when less search precision and recall are acceptable: • When missed results are acceptable • When analyst have time to wade through irrelevant search results

The Technological Approach • Register resources against the nodes of an ontology • Classes in a hierarchy • Instances in a data model • Registration and discovery indistinguishable... • Entry • String/pattern search • Query expression search • Bookmark lookup • Subsequent Navigation • Hierarchical • Other relations

Hierarchical view of Al Qaidashowing overall structure

Cluster view of properties of Al Qaida

Inspecting Al Qaida

Inspecting Mohamed Atta

Registered Documents

Why Ontology-basedResource Discovery? • Precision in un/semi-structured resource discovery (no limit to precision and recall) • More buckets - formal deep ontologies (higher resolution in search) • Formal unambiguous concept definitions (fewer mistakes in registration) • Ease of resource discovery • Additional paths/relations for navigating to resources • Federation of legacy code/sets taxonomies (users can use familiar taxonomies) • Power in resource discovery • Search broadening and narrowing • Inclusion of related terms in search

Surpassing the Global-Index-based Search Engine • Complete precision within terms of class hierarchy • Guaranteed by • unique definitions • sound subclass relations • See inductive proof in paper • Made useful by • depth • breadth • exhaustive partitions • Watch for multiple exhaustive partitions (overlapping sibling classes) • Ultimate precision extendable by • more salience • registration against instances • querying against attributes (attributes can be expressed as classes)

When the Cost is Justified • Cost and use comparison • Google registration is automatic and cheap • Accurate registration against ontology requires painstaking registration • Must know ontology • Multiple registration • The subjective comparison: When the precision and recall of the discovery process are important enough to justify the cost • Saved analyst time • Lives lost due to missed data • Being forced to register • DoD Discovery Metadata Standard (DDMS) requires some characterization of content

Cost Mitigation • Human Memory: Registration and extraction tool users will naturally remember regularly used ontology nodes. • Favorite Lists: Favorite lists can record such commonly used ontology nodes • Knowledge Extraction: Extraction engines can reduce the effort of registration (at the cost of registration and recall) • Simple Knowledge-Based Work Environments: Current task knowledge • Knowledge-Based Resource Authoring Tools: An authoring tool could provide a post-editing process • NLP Research: Naturally, full NLP capability would allow full automation of the registration process

The Fundamental Mismatch and its Eventual Departure • Semantic web (SW) registration inherently mixes generations of technology • Class hierarchy is small part of SW • SW documents can replace most/all other documents • SW documents will allow “Google-style” automatic registration • No loss of precision or recall • All document instances are already classified

Justifying Why Taxonomic Registration Does Not Suffice • Taxonomies are specialized views • Views are not unique • Models can be unique • Taxonomies can be tied together with a hub: • Peterson, E., Customized Resource Discovery: Linking Formalized Web Taxonomies to a Web Ontology Hub, AAAI Workshop on Semantic Web Personalization, San Diego, CA, 2004. • http://semweb.mcdonaldbradley.com/Papers/AAAITaxonomyPaper.doc

Justifying Web-Deployment of a Registration/Search Ontology • Inherently a web-deployed activity • SW language automatically create URI’s • URI’s are guaranteed unique • Leverages existing XML tools and technology • Yet not strictly required • Java programs do not create URI’s for classes • Java programs work without guaranteed uniqueness

Justifying Use of a Single Registration Ontology • Allows a single search on an ontology terms to search whole SW • Single Google search searches entire available web • Multiple conflicting ontologies must be mapped • The mapping ontology becomes one ontology in time

Choosing the Most Appropriate Ontology • Rigor (formal foundations) • Relevance • Size • Depth • Bushiness • Exhaustive Partitioning • Clarity (mappability) • Robustness • Popularity • Support

Cycorp Incorporated • Designed for complete high level breadth • They started with *many* random topics • Largest ontology by far (3,000,000 triple facts) • Deep on DoD/intel/terrorism (DARPA funded) • 20 years in the ontology business • Staff of professional ontologists

HPKB/RKF Cyc Translationinto OWL • Three varieties of Cyc • Full Cyc • Commercial product • $200K plus • Free-to-government Cyc • Most of the military/intel of Full Cyc • 12,000 Classes • 900 Relations • 23,000 Instances • Opencyc • Free • 3000 Classes

Cyc Translation Leverage – Search • Total Classes Translated: 12,000 • vs. ~600 in HF • 20-fold increase in search resolution • Total Instances Translated: 23,200 • vs. 0 in HF • Instances increase search resolution • What if we could register documents against just “Terrorism” and not “Osama bin Laden” • Total Relations Translated: 900 • vs. 4 in HF • Relations provide “suggested” pathways to a possibly more appropriate mapping target for • Increased resolution • Simpler target selection in registration and mapping

Conclusion • Much more accurate search with large salient ontology • Registration labor intensive • This labor may be required • Better results with better ontologies

Web Presence • http://semweb.mcdonaldbradley.com • OWL code • http://semweb.mcdonaldbradley.com/OWL • Protégé-loadable Opencyc stripped translation (3 minutes to load) • Jena2-loadable FreeToGov Cyc • Taxonomy mapping framework • Semantic Web Tools • http://semweb.mcdonaldbradley.com/tools • Semantic Surf Board • Taxonomy Editor

OWL Wish List • Functions (First Order Logical) • Do not want to reify every class into a topic • A-political OWL subset • OWL dialects split on borders of academic compromise • OWL FullSansBaroqueness OWL version

Backup Slides

Why Taxonomiesfor Resource Discovery? • Top-down navigation of formal ontologies can be onerous – even for the ontologist • Not semantic indices for the masses • Taxonomies are still first-class indices • Many legacy taxonomies is existence • Intuitive to understand • Easy to build • Typically customized to a particular domain • Customizable to a particular domain

A Taxonomic Definition • A hierarchy of things • Taxonomic nodes are all instances of type thing • The taxonomic relation is of type mostGeneralRelation • It is the most general relation that subsumes all others

Downside to Taxonomies • Resources registered in just one taxonomy do not appear in overlapping taxonomies • Resources must be registered against all relevant taxonomies or… • Searchers must look in all appropriate taxonomies

A Solution • Map each taxonomy to a single hub ontology • Every node in each taxonomy maps to at least one ontological class instance or topic in the hub • Relations are not mapped (the semantics do not exist) • Mapping to a more general class looses search precision

General Taxonomic Representation • All nodes are instances of type Thing • Classes (under OWL-Full-style) • Topics • Instances of Thing subtypes • All relations are of type mostGeneralRelation • Top of relation lattice • Relations are directional (pointing down by fiat)

Model Subsets as TaxonomiesSpecified Taxonomies • A taxonomy Τ is a triple {ω, σ, φ} • ω is an ontology expressible in OWL-Full • σ is T’sstarting node in ω • φ is a function that • takes one argument - a taxonomic node of type Thing and • returns a subset of that node’s referenced node • Allows any contrived model subset (any DAG) • Enables creation of personal taxonomies • Subset designation by elimination • Subset designation by addition • Allows taxonomy software to work on ontologies • Allows ontology software to work on taxonomies

Model Subsets as Taxonomies:Natural Taxonomies • φ reduces to simply function that returns all nodes linked by a particular relation • Use transitive closures as taxonomies • A taxonomy is given by a root node and a relation type (directional) • Natural examples • subClassOf • partOf • subordinateOf • Allows easy taxonomy mining

Simple Taxonomies • φ reduces to simply function that returns all nodes linked by the most general relation • Simple to encode • One node type • One relation type • Can be specialized over time • Thing nodes can be replaced with class, instance, and topic nodes • Relations can be replace with more specific relations • Classhierarchy taxonomies should start vague until proven • Underspecified yet real knowledge artifacts • They are the basis for sound inference • Encoding weak artifacts in semantically powerful environments makes sense

Key Mapping Relations Web Taxonomy Hub Ontology sameAs instanceOf subclassOf Node representing Instance

Mapping Relations:sameAs • Ideal • Lossless • Use when a precise target exists • Hard to truly determine • Allow no contradictions • Allow consistent differences • Vague or missing node documentation

Mapping Relations:subClassOf • The likely choice • Lossy • A cry for help • Such taxonomic nodes must be upgraded and moved to hub • Clarify and standardize node name • Clarify and standardize node documentation • Driven by economics (expensive) • Similarly hard to truly determine

Mapping Dirty HierarchicalCode Sets • Government code sets such as catcodes, equipment codes, IFC codes. • Basic three relations did not suffice • keyTopicSubject maps topics to classes or instance • Simple one-step mappings did not suffice • Topic subjects did not exist in ontology hub • Had to be created • Had to be linked to superclass(es) in hub • Disjoined complex nodes missing subclasses in hub • Complex nodes were always disjoined • and usually means logical or (ATF is not logical and) • Had to create a subclass for each disjoined terms • Eg: Nuclear_biological_chemical_warfare_antidotes_and_medical_treatment

Disjoined Complex Nodes NBC_wrfr_trtmt_and_ant NBC_wrfr_trtmt NBC_wrfr_ant Nuclear_warefare_antedote Nuclear_warefare_treatment Biological_warefare_antedote Biological_warefare_treatment Chemical_warfare_antedote Chemical_warfare_treatment

Clean Taxonomy Mapping • The DDMS Taxonomy Focus Group has created a 420 node core taxonomy • Terms (URI’s) will be used to characterize subject metadata of registered resources • Community of interest (COI) taxonomy terms also used for registration • Preliminary results • Many sameAs mappings • Mostly subClassOf mappings

Challenges and Mitigations • Labor intensive process • The DoD is already committed to registration of document metadata • Yet they lack a means to characterize document content • Map key taxonomies first • Register high payoff data first • Continue to use standard search techniques when they suffice • Google search is easy and registration is automatic • Non-trivial process • Non-ontologists can be trained to map • Taxonomies can be simply characterized

Justifying Semantic-Web-based Resource Registration and Discovery

Justifying Semantic-Web-based Resource Registration and Discovery

Presentation Transcript

Semantic Web Service Discovery: Methods, Algorithms and Tools

Web Explanations for Semantic Heterogeneity Discovery

Resource Discovery

Resource discovery

Resource Curation and Automated Resource Discovery

CS690L Semantic Web and Knowledge Discovery: Concept, Technologies, Tools

Approaching Web-Based Expertise with Semantic Web

WSMO Discovery Realization in Semantic Web Fred

Conceptual Model Based Semantic Web Services

Resource Description Framework Building the Semantic Web

Resource Description Framework Building the Semantic Web

Anthelmintic discovery and registration

Resource Discovery...

CS690L Semantic Web and Knowledge Discovery: Concept, Technologies, Tool

Summary of Knowledge Discovery for Semantic Web

Semantic Discovery: Discovering Com-plex Relationships in Semantic Web

Semantic Web Service Based Geospatial Knowledge Discovery

The Semantic Web Resource Description Framework (RDF)

Resource Description Framework Building the Semantic Web

Semantic Web based Collaborative Knowledge Management