1 / 42

Justifying Semantic-Web-based Resource Registration and Discovery

Justifying Semantic-Web-based Resource Registration and Discovery. Eric Lee Peterson Chief Ontologist McDonald Bradley. McDonald Bradley Web Ontology Team Members. Eric Peterson – Chief Ontologist Jon Pastor – Senior Ontologist Phuc Nguyen – Senior Rapid Development Engineer

bsipe
Download Presentation

Justifying Semantic-Web-based Resource Registration and Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Justifying Semantic-Web-based Resource Registration and Discovery Eric Lee Peterson Chief Ontologist McDonald Bradley

  2. McDonald BradleyWeb Ontology Team Members • Eric Peterson – Chief Ontologist • Jon Pastor – Senior Ontologist • Phuc Nguyen – Senior Rapid Development Engineer • Maurita Soltis – Ontologist/Domain Lead • Darren Govoni – Principal Architect • Eric Monk – Senior Architect • Joseph Rajkumar – Mapping Ontologist • Chuan Shi – Mapping Ontologist • Mark Joncas – Functional Requirements • Gary Gomez – Terrorism SME • Jay Hess – Weapons of Mass Destruction SME • John Kamenelis – Tactical Ballistic Missile SME

  3. Resource Registration and Discovery • Registration: The process where agents (human or automated) store references to web resources for ease of subsequent discovery • Discovery: The process where agents search for and find previously registered web resources • Web Resources: Usually common web-deployed data such as XML or MS Word files represented and identified by uniform resource identifiers (URI’s)

  4. Key Success Criteria • Precision: The ratio of relevant discovered resources over all discovered resources • Recall: The ratio of relevant discovered resources over the number of relevant discoverable resources • Ease of Use: The lack of difficulty and complexity of discovery tool

  5. Present Generation Capability • Autonomy – shotgun approach • Investment is in training every concept against its own document list • …then easy to use • Precision is in the low 80% • Great tool when less search precision and recall are acceptable: • When missed results are acceptable • When analyst have time to wade through irrelevant search results

  6. The Technological Approach • Register resources against the nodes of an ontology • Classes in a hierarchy • Instances in a data model • Registration and discovery indistinguishable... • Entry • String/pattern search • Query expression search • Bookmark lookup • Subsequent Navigation • Hierarchical • Other relations

  7. Hierarchical view of Al Qaidashowing overall structure

  8. Cluster view of properties of Al Qaida

  9. Inspecting Al Qaida

  10. Inspecting Mohamed Atta

  11. Registered Documents

  12. Why Ontology-basedResource Discovery? • Precision in un/semi-structured resource discovery (no limit to precision and recall) • More buckets - formal deep ontologies (higher resolution in search) • Formal unambiguous concept definitions (fewer mistakes in registration) • Ease of resource discovery • Additional paths/relations for navigating to resources • Federation of legacy code/sets taxonomies (users can use familiar taxonomies) • Power in resource discovery • Search broadening and narrowing • Inclusion of related terms in search

  13. Surpassing the Global-Index-based Search Engine • Complete precision within terms of class hierarchy • Guaranteed by • unique definitions • sound subclass relations • See inductive proof in paper • Made useful by • depth • breadth • exhaustive partitions • Watch for multiple exhaustive partitions (overlapping sibling classes) • Ultimate precision extendable by • more salience • registration against instances • querying against attributes (attributes can be expressed as classes)

  14. When the Cost is Justified • Cost and use comparison • Google registration is automatic and cheap • Accurate registration against ontology requires painstaking registration • Must know ontology • Multiple registration • The subjective comparison: When the precision and recall of the discovery process are important enough to justify the cost • Saved analyst time • Lives lost due to missed data • Being forced to register • DoD Discovery Metadata Standard (DDMS) requires some characterization of content

  15. Cost Mitigation • Human Memory: Registration and extraction tool users will naturally remember regularly used ontology nodes. • Favorite Lists: Favorite lists can record such commonly used ontology nodes • Knowledge Extraction: Extraction engines can reduce the effort of registration (at the cost of registration and recall) • Simple Knowledge-Based Work Environments: Current task knowledge • Knowledge-Based Resource Authoring Tools: An authoring tool could provide a post-editing process • NLP Research: Naturally, full NLP capability would allow full automation of the registration process

  16. The Fundamental Mismatch and its Eventual Departure • Semantic web (SW) registration inherently mixes generations of technology • Class hierarchy is small part of SW • SW documents can replace most/all other documents • SW documents will allow “Google-style” automatic registration • No loss of precision or recall • All document instances are already classified

  17. Justifying Why Taxonomic Registration Does Not Suffice • Taxonomies are specialized views • Views are not unique • Models can be unique • Taxonomies can be tied together with a hub: • Peterson, E., Customized Resource Discovery: Linking Formalized Web Taxonomies to a Web Ontology Hub, AAAI Workshop on Semantic Web Personalization, San Diego, CA, 2004. • http://semweb.mcdonaldbradley.com/Papers/AAAITaxonomyPaper.doc

  18. Justifying Web-Deployment of a Registration/Search Ontology • Inherently a web-deployed activity • SW language automatically create URI’s • URI’s are guaranteed unique • Leverages existing XML tools and technology • Yet not strictly required • Java programs do not create URI’s for classes • Java programs work without guaranteed uniqueness

  19. Justifying Use of a Single Registration Ontology • Allows a single search on an ontology terms to search whole SW • Single Google search searches entire available web • Multiple conflicting ontologies must be mapped • The mapping ontology becomes one ontology in time

  20. Choosing the Most Appropriate Ontology • Rigor (formal foundations) • Relevance • Size • Depth • Bushiness • Exhaustive Partitioning • Clarity (mappability) • Robustness • Popularity • Support

  21. Cycorp Incorporated • Designed for complete high level breadth • They started with *many* random topics • Largest ontology by far (3,000,000 triple facts) • Deep on DoD/intel/terrorism (DARPA funded) • 20 years in the ontology business • Staff of professional ontologists

  22. HPKB/RKF Cyc Translationinto OWL • Three varieties of Cyc • Full Cyc • Commercial product • $200K plus • Free-to-government Cyc • Most of the military/intel of Full Cyc • 12,000 Classes • 900 Relations • 23,000 Instances • Opencyc • Free • 3000 Classes

  23. Cyc Translation Leverage – Search • Total Classes Translated: 12,000 • vs. ~600 in HF • 20-fold increase in search resolution • Total Instances Translated: 23,200 • vs. 0 in HF • Instances increase search resolution • What if we could register documents against just “Terrorism” and not “Osama bin Laden” • Total Relations Translated: 900 • vs. 4 in HF • Relations provide “suggested” pathways to a possibly more appropriate mapping target for • Increased resolution • Simpler target selection in registration and mapping

  24. Conclusion • Much more accurate search with large salient ontology • Registration labor intensive • This labor may be required • Better results with better ontologies

  25. Web Presence • http://semweb.mcdonaldbradley.com • OWL code • http://semweb.mcdonaldbradley.com/OWL • Protégé-loadable Opencyc stripped translation (3 minutes to load) • Jena2-loadable FreeToGov Cyc • Taxonomy mapping framework • Semantic Web Tools • http://semweb.mcdonaldbradley.com/tools • Semantic Surf Board • Taxonomy Editor

  26. OWL Wish List • Functions (First Order Logical) • Do not want to reify every class into a topic • A-political OWL subset • OWL dialects split on borders of academic compromise • OWL FullSansBaroqueness OWL version

  27. Backup Slides

  28. Why Taxonomiesfor Resource Discovery? • Top-down navigation of formal ontologies can be onerous – even for the ontologist • Not semantic indices for the masses • Taxonomies are still first-class indices • Many legacy taxonomies is existence • Intuitive to understand • Easy to build • Typically customized to a particular domain • Customizable to a particular domain

  29. A Taxonomic Definition • A hierarchy of things • Taxonomic nodes are all instances of type thing • The taxonomic relation is of type mostGeneralRelation • It is the most general relation that subsumes all others

  30. Downside to Taxonomies • Resources registered in just one taxonomy do not appear in overlapping taxonomies • Resources must be registered against all relevant taxonomies or… • Searchers must look in all appropriate taxonomies

  31. A Solution • Map each taxonomy to a single hub ontology • Every node in each taxonomy maps to at least one ontological class instance or topic in the hub • Relations are not mapped (the semantics do not exist) • Mapping to a more general class looses search precision

  32. General Taxonomic Representation • All nodes are instances of type Thing • Classes (under OWL-Full-style) • Topics • Instances of Thing subtypes • All relations are of type mostGeneralRelation • Top of relation lattice • Relations are directional (pointing down by fiat)

  33. Model Subsets as TaxonomiesSpecified Taxonomies • A taxonomy Τ is a triple {ω, σ, φ} • ω is an ontology expressible in OWL-Full • σ is T’sstarting node in ω • φ is a function that • takes one argument - a taxonomic node of type Thing and • returns a subset of that node’s referenced node • Allows any contrived model subset (any DAG) • Enables creation of personal taxonomies • Subset designation by elimination • Subset designation by addition • Allows taxonomy software to work on ontologies • Allows ontology software to work on taxonomies

  34. Model Subsets as Taxonomies:Natural Taxonomies • φ reduces to simply function that returns all nodes linked by a particular relation • Use transitive closures as taxonomies • A taxonomy is given by a root node and a relation type (directional) • Natural examples • subClassOf • partOf • subordinateOf • Allows easy taxonomy mining

  35. Simple Taxonomies • φ reduces to simply function that returns all nodes linked by the most general relation • Simple to encode • One node type • One relation type • Can be specialized over time • Thing nodes can be replaced with class, instance, and topic nodes • Relations can be replace with more specific relations • Classhierarchy taxonomies should start vague until proven • Underspecified yet real knowledge artifacts • They are the basis for sound inference • Encoding weak artifacts in semantically powerful environments makes sense

  36. Key Mapping Relations Web Taxonomy Hub Ontology sameAs instanceOf subclassOf Node representing Instance

  37. Mapping Relations:sameAs • Ideal • Lossless • Use when a precise target exists • Hard to truly determine • Allow no contradictions • Allow consistent differences • Vague or missing node documentation

  38. Mapping Relations:subClassOf • The likely choice • Lossy • A cry for help • Such taxonomic nodes must be upgraded and moved to hub • Clarify and standardize node name • Clarify and standardize node documentation • Driven by economics (expensive) • Similarly hard to truly determine

  39. Mapping Dirty HierarchicalCode Sets • Government code sets such as catcodes, equipment codes, IFC codes. • Basic three relations did not suffice • keyTopicSubject maps topics to classes or instance • Simple one-step mappings did not suffice • Topic subjects did not exist in ontology hub • Had to be created • Had to be linked to superclass(es) in hub • Disjoined complex nodes missing subclasses in hub • Complex nodes were always disjoined • and usually means logical or (ATF is not logical and) • Had to create a subclass for each disjoined terms • Eg: Nuclear_biological_chemical_warfare_antidotes_and_medical_treatment

  40. Disjoined Complex Nodes NBC_wrfr_trtmt_and_ant NBC_wrfr_trtmt NBC_wrfr_ant Nuclear_warefare_antedote Nuclear_warefare_treatment Biological_warefare_antedote Biological_warefare_treatment Chemical_warfare_antedote Chemical_warfare_treatment

  41. Clean Taxonomy Mapping • The DDMS Taxonomy Focus Group has created a 420 node core taxonomy • Terms (URI’s) will be used to characterize subject metadata of registered resources • Community of interest (COI) taxonomy terms also used for registration • Preliminary results • Many sameAs mappings • Mostly subClassOf mappings

  42. Challenges and Mitigations • Labor intensive process • The DoD is already committed to registration of document metadata • Yet they lack a means to characterize document content • Map key taxonomies first • Register high payoff data first • Continue to use standard search techniques when they suffice • Google search is easy and registration is automatic • Non-trivial process • Non-ontologists can be trained to map • Taxonomies can be simply characterized

More Related