420 likes | 432 Views
Justifying Semantic-Web-based Resource Registration and Discovery. Eric Lee Peterson Chief Ontologist McDonald Bradley. McDonald Bradley Web Ontology Team Members. Eric Peterson – Chief Ontologist Jon Pastor – Senior Ontologist Phuc Nguyen – Senior Rapid Development Engineer
E N D
Justifying Semantic-Web-based Resource Registration and Discovery Eric Lee Peterson Chief Ontologist McDonald Bradley
McDonald BradleyWeb Ontology Team Members • Eric Peterson – Chief Ontologist • Jon Pastor – Senior Ontologist • Phuc Nguyen – Senior Rapid Development Engineer • Maurita Soltis – Ontologist/Domain Lead • Darren Govoni – Principal Architect • Eric Monk – Senior Architect • Joseph Rajkumar – Mapping Ontologist • Chuan Shi – Mapping Ontologist • Mark Joncas – Functional Requirements • Gary Gomez – Terrorism SME • Jay Hess – Weapons of Mass Destruction SME • John Kamenelis – Tactical Ballistic Missile SME
Resource Registration and Discovery • Registration: The process where agents (human or automated) store references to web resources for ease of subsequent discovery • Discovery: The process where agents search for and find previously registered web resources • Web Resources: Usually common web-deployed data such as XML or MS Word files represented and identified by uniform resource identifiers (URI’s)
Key Success Criteria • Precision: The ratio of relevant discovered resources over all discovered resources • Recall: The ratio of relevant discovered resources over the number of relevant discoverable resources • Ease of Use: The lack of difficulty and complexity of discovery tool
Present Generation Capability • Autonomy – shotgun approach • Investment is in training every concept against its own document list • …then easy to use • Precision is in the low 80% • Great tool when less search precision and recall are acceptable: • When missed results are acceptable • When analyst have time to wade through irrelevant search results
The Technological Approach • Register resources against the nodes of an ontology • Classes in a hierarchy • Instances in a data model • Registration and discovery indistinguishable... • Entry • String/pattern search • Query expression search • Bookmark lookup • Subsequent Navigation • Hierarchical • Other relations
Why Ontology-basedResource Discovery? • Precision in un/semi-structured resource discovery (no limit to precision and recall) • More buckets - formal deep ontologies (higher resolution in search) • Formal unambiguous concept definitions (fewer mistakes in registration) • Ease of resource discovery • Additional paths/relations for navigating to resources • Federation of legacy code/sets taxonomies (users can use familiar taxonomies) • Power in resource discovery • Search broadening and narrowing • Inclusion of related terms in search
Surpassing the Global-Index-based Search Engine • Complete precision within terms of class hierarchy • Guaranteed by • unique definitions • sound subclass relations • See inductive proof in paper • Made useful by • depth • breadth • exhaustive partitions • Watch for multiple exhaustive partitions (overlapping sibling classes) • Ultimate precision extendable by • more salience • registration against instances • querying against attributes (attributes can be expressed as classes)
When the Cost is Justified • Cost and use comparison • Google registration is automatic and cheap • Accurate registration against ontology requires painstaking registration • Must know ontology • Multiple registration • The subjective comparison: When the precision and recall of the discovery process are important enough to justify the cost • Saved analyst time • Lives lost due to missed data • Being forced to register • DoD Discovery Metadata Standard (DDMS) requires some characterization of content
Cost Mitigation • Human Memory: Registration and extraction tool users will naturally remember regularly used ontology nodes. • Favorite Lists: Favorite lists can record such commonly used ontology nodes • Knowledge Extraction: Extraction engines can reduce the effort of registration (at the cost of registration and recall) • Simple Knowledge-Based Work Environments: Current task knowledge • Knowledge-Based Resource Authoring Tools: An authoring tool could provide a post-editing process • NLP Research: Naturally, full NLP capability would allow full automation of the registration process
The Fundamental Mismatch and its Eventual Departure • Semantic web (SW) registration inherently mixes generations of technology • Class hierarchy is small part of SW • SW documents can replace most/all other documents • SW documents will allow “Google-style” automatic registration • No loss of precision or recall • All document instances are already classified
Justifying Why Taxonomic Registration Does Not Suffice • Taxonomies are specialized views • Views are not unique • Models can be unique • Taxonomies can be tied together with a hub: • Peterson, E., Customized Resource Discovery: Linking Formalized Web Taxonomies to a Web Ontology Hub, AAAI Workshop on Semantic Web Personalization, San Diego, CA, 2004. • http://semweb.mcdonaldbradley.com/Papers/AAAITaxonomyPaper.doc
Justifying Web-Deployment of a Registration/Search Ontology • Inherently a web-deployed activity • SW language automatically create URI’s • URI’s are guaranteed unique • Leverages existing XML tools and technology • Yet not strictly required • Java programs do not create URI’s for classes • Java programs work without guaranteed uniqueness
Justifying Use of a Single Registration Ontology • Allows a single search on an ontology terms to search whole SW • Single Google search searches entire available web • Multiple conflicting ontologies must be mapped • The mapping ontology becomes one ontology in time
Choosing the Most Appropriate Ontology • Rigor (formal foundations) • Relevance • Size • Depth • Bushiness • Exhaustive Partitioning • Clarity (mappability) • Robustness • Popularity • Support
Cycorp Incorporated • Designed for complete high level breadth • They started with *many* random topics • Largest ontology by far (3,000,000 triple facts) • Deep on DoD/intel/terrorism (DARPA funded) • 20 years in the ontology business • Staff of professional ontologists
HPKB/RKF Cyc Translationinto OWL • Three varieties of Cyc • Full Cyc • Commercial product • $200K plus • Free-to-government Cyc • Most of the military/intel of Full Cyc • 12,000 Classes • 900 Relations • 23,000 Instances • Opencyc • Free • 3000 Classes
Cyc Translation Leverage – Search • Total Classes Translated: 12,000 • vs. ~600 in HF • 20-fold increase in search resolution • Total Instances Translated: 23,200 • vs. 0 in HF • Instances increase search resolution • What if we could register documents against just “Terrorism” and not “Osama bin Laden” • Total Relations Translated: 900 • vs. 4 in HF • Relations provide “suggested” pathways to a possibly more appropriate mapping target for • Increased resolution • Simpler target selection in registration and mapping
Conclusion • Much more accurate search with large salient ontology • Registration labor intensive • This labor may be required • Better results with better ontologies
Web Presence • http://semweb.mcdonaldbradley.com • OWL code • http://semweb.mcdonaldbradley.com/OWL • Protégé-loadable Opencyc stripped translation (3 minutes to load) • Jena2-loadable FreeToGov Cyc • Taxonomy mapping framework • Semantic Web Tools • http://semweb.mcdonaldbradley.com/tools • Semantic Surf Board • Taxonomy Editor
OWL Wish List • Functions (First Order Logical) • Do not want to reify every class into a topic • A-political OWL subset • OWL dialects split on borders of academic compromise • OWL FullSansBaroqueness OWL version
Why Taxonomiesfor Resource Discovery? • Top-down navigation of formal ontologies can be onerous – even for the ontologist • Not semantic indices for the masses • Taxonomies are still first-class indices • Many legacy taxonomies is existence • Intuitive to understand • Easy to build • Typically customized to a particular domain • Customizable to a particular domain
A Taxonomic Definition • A hierarchy of things • Taxonomic nodes are all instances of type thing • The taxonomic relation is of type mostGeneralRelation • It is the most general relation that subsumes all others
Downside to Taxonomies • Resources registered in just one taxonomy do not appear in overlapping taxonomies • Resources must be registered against all relevant taxonomies or… • Searchers must look in all appropriate taxonomies
A Solution • Map each taxonomy to a single hub ontology • Every node in each taxonomy maps to at least one ontological class instance or topic in the hub • Relations are not mapped (the semantics do not exist) • Mapping to a more general class looses search precision
General Taxonomic Representation • All nodes are instances of type Thing • Classes (under OWL-Full-style) • Topics • Instances of Thing subtypes • All relations are of type mostGeneralRelation • Top of relation lattice • Relations are directional (pointing down by fiat)
Model Subsets as TaxonomiesSpecified Taxonomies • A taxonomy Τ is a triple {ω, σ, φ} • ω is an ontology expressible in OWL-Full • σ is T’sstarting node in ω • φ is a function that • takes one argument - a taxonomic node of type Thing and • returns a subset of that node’s referenced node • Allows any contrived model subset (any DAG) • Enables creation of personal taxonomies • Subset designation by elimination • Subset designation by addition • Allows taxonomy software to work on ontologies • Allows ontology software to work on taxonomies
Model Subsets as Taxonomies:Natural Taxonomies • φ reduces to simply function that returns all nodes linked by a particular relation • Use transitive closures as taxonomies • A taxonomy is given by a root node and a relation type (directional) • Natural examples • subClassOf • partOf • subordinateOf • Allows easy taxonomy mining
Simple Taxonomies • φ reduces to simply function that returns all nodes linked by the most general relation • Simple to encode • One node type • One relation type • Can be specialized over time • Thing nodes can be replaced with class, instance, and topic nodes • Relations can be replace with more specific relations • Classhierarchy taxonomies should start vague until proven • Underspecified yet real knowledge artifacts • They are the basis for sound inference • Encoding weak artifacts in semantically powerful environments makes sense
Key Mapping Relations Web Taxonomy Hub Ontology sameAs instanceOf subclassOf Node representing Instance
Mapping Relations:sameAs • Ideal • Lossless • Use when a precise target exists • Hard to truly determine • Allow no contradictions • Allow consistent differences • Vague or missing node documentation
Mapping Relations:subClassOf • The likely choice • Lossy • A cry for help • Such taxonomic nodes must be upgraded and moved to hub • Clarify and standardize node name • Clarify and standardize node documentation • Driven by economics (expensive) • Similarly hard to truly determine
Mapping Dirty HierarchicalCode Sets • Government code sets such as catcodes, equipment codes, IFC codes. • Basic three relations did not suffice • keyTopicSubject maps topics to classes or instance • Simple one-step mappings did not suffice • Topic subjects did not exist in ontology hub • Had to be created • Had to be linked to superclass(es) in hub • Disjoined complex nodes missing subclasses in hub • Complex nodes were always disjoined • and usually means logical or (ATF is not logical and) • Had to create a subclass for each disjoined terms • Eg: Nuclear_biological_chemical_warfare_antidotes_and_medical_treatment
Disjoined Complex Nodes NBC_wrfr_trtmt_and_ant NBC_wrfr_trtmt NBC_wrfr_ant Nuclear_warefare_antedote Nuclear_warefare_treatment Biological_warefare_antedote Biological_warefare_treatment Chemical_warfare_antedote Chemical_warfare_treatment
Clean Taxonomy Mapping • The DDMS Taxonomy Focus Group has created a 420 node core taxonomy • Terms (URI’s) will be used to characterize subject metadata of registered resources • Community of interest (COI) taxonomy terms also used for registration • Preliminary results • Many sameAs mappings • Mostly subClassOf mappings
Challenges and Mitigations • Labor intensive process • The DoD is already committed to registration of document metadata • Yet they lack a means to characterize document content • Map key taxonomies first • Register high payoff data first • Continue to use standard search techniques when they suffice • Google search is easy and registration is automatic • Non-trivial process • Non-ontologists can be trained to map • Taxonomies can be simply characterized