610 likes | 620 Views
This article explores the key principles and requirements for effective data integration, including the need for a common syntax, central set of semantics, and a scalable hub. It also discusses how XML is well-positioned as a data integration language and how different structured data types can be integrated using ontology mapping. Additionally, the article highlights the importance of expressivity and normalization in data integration, and provides examples of OWL-based schema implementations for the DRM abstract model.
E N D
An OWL-based Schema as an Implementation for the DRM Abstract Model A Data Integration Prospective
The Premises • Data integration is helped by a common syntax • Data integration requires one central set of semantics • Lossless integration requires expressivity • Scalable integration needs a hub • XML is well positioned to be a data integration language • All structured data types can be cut from the same cloth • Higher normalization is better
Data Integration is Helped by a Common Syntax • Ontology integration not a new topic • All data can be expressed in first order logic • Most legacy data can be expressed in OWL – a subset • All data integration can reduce to ontology2ontology integration
Integration requires mappings Mappings must losslessly relate sources to targets Even hub-less mapping requires one set of centrally understood semantics Data Integration Requires One Central Set of Semantics
The Query Extension Leverage of the subClass Relation COI Taxonomy 1 Hub Ontology COI Taxonomy 2
Lossless Integration Requires Hub Expressivity • Hubs must be expressive • Must be express anything in the hub’s wheel • Aboriginal languages: 1, 2, 3, ∞ • Must accommodate future data sets
Scalable Integration Needs a Hub • Hub-based integration makes sense: Airline Hubs • O(n) vs. O(n²) • 3 node breaks even • 4 node is already 50% improvement
XML as a Data Integration Language • Not a language/syntax • A language for writing tag-based languages and markup languages: XMLSchema, RDF, many others • de facto metaformat for knowledge sharing • W3C standard for standards • Tools abound: parsers, editors, validators,
All Structured Data Types from Same Cloth • Relational tables, Linnaean taxonomic nodes, OO classes, and XMLSchema quasi-entities all have same FOL representation (OWL classes) • Relational tuples (rows), taxonomic instances nodes, OO instances, XMLSchema data instances all have same FOL representation (OWL instances) • Relational binary relations, OO bean properties, and XMLSchema quasi-relations all have same FOL (OWL object properties)
Relational Normalization:Higher is Better • Myth: Normalized relational data access is slower • Oracle speed literature advises 3rd or higher • Redisign/reoptimization of schema is slow • Attribute semantics entombed in attribute description or the mind of the designer • No ability to easily query over all SSN’s • birthDate attribute does not know about birthPlace attribute • Will not support sophisticated temporal/spatial analysis • Attributes cannot be mapped without the precise characterization of inherent relation and class of data • Mainstream content ontologies are 5th or close • MBI OWL encoding encourages/enforces normalization
Disjoined Complex Nodes NBC_warfare_treatment_and_antedote NBC_warfare_treatment NBC_warfare_antedote Nuclear_warefare_antedote Nuclear_warefare_treatment Biological_warefare_antedote Biological_warefare_treatment Chemical_warfare_antedote Chemical_warfare_treatment
OWL DRM Implementation • DataReferenceModel • Categorization • Description • Relational • PersonSampleData.owl • PersonSampleSchema.owl • PersonSampleSchemaMappings.owl • RelationalMapping.owl • RelationalSchemaReferenceModel.owl • Core-Taxonomy-Version0_73.owl • SampleTaxonomyToHubMappings.owl • TaxonomyMapping.owl • TaxonomyReferenceModel.owl • Sharing • DataReferenceModel.owl • IntegrationHubOntology.owl • OWLExtensions.owl • Topics.owl
Core-Taxonomy-Version0_73.owl • <Taxonomy tax:categoryIdentifier="CoreCESMWGTaxonomy"> • <displayName>Core Enterprise Services Metadata Working Group Taxonomy</displayName> • <description></description> • <versionInfo>Version: 0.75a; Date: 7/14/2005</versionInfo> • <taxonomicDefinitions tax:resource="&root;DataReferenceModel/Categorization/TaxonomyReferenceModel.owl" • <topCategory tax:referencedItem="Core_taxonomy_top_category"/> • </Taxonomy> • <Category tax:categoryIdentifier="Core_taxonomy_top_category" tax:displayName="core taxonomy top category"> • <parentCategoryOf tax:referencedItem="Account"/> • <parentCategoryOf tax:referencedItem="Action"/> • <parentCategoryOf tax:referencedItem="Agreement"/> • … • </Category> • <Category tax:categoryIdentifier="Account" tax:displayName="account"> • <definition tax:value="A separate financial reporting unit for budget, management, and or accounting proposes." tax:source="GAO/AFMD2.1.1"/> • <parentCategoryOf tax:referencedItem="Accounting_account"/> • <parentCategoryOf tax:referencedItem="Federal_fund_account"/> • <parentCategoryOf tax:referencedItem="Nonfederal__account"/> • <parentCategoryOf tax:referencedItem="Settlement_account"/> • <subCategoryOf tax:referencedItem="Core_taxonomy_top_category"/> • <source tax:value="Content_Team"/> • <wordnet_id_synonyms tax:value="WN-ID 564 account, business relationship"/> • </Category>
SampleTaxonomyToHubMappings.owl <TaxonomicMappingSet taxmap:identifier="CoreTaxonomySampleMappings"> <description></description> <taxonomyMappingMetaDefinitions taxmap:referencedItem="&root;/DataReferenceModel/Categorization/TaxonomyMapping.owl"/> <taxonomyToBeMapped taxmap:referencedItem="&root;/DataReferenceModel/Categorization/Core-Taxonomy-Version0_73.owl"/> <taxonomyMappingTarget taxmap:referencedItem="&root;/DataReferenceModel/IntegrationHubOntology.owl"/> </TaxonomicMappingSet> <MapCategoryToClass taxmap:categoryToBeMapped="Person"> <!--This is an simple example of category mapping where there exists a class exacly matching the entity--> <identicalTo taxmap:mappingTargetInHubOntology="&hub;Person"/> </MapCategoryToClass> <MapCategoryToClass taxmap:categoryToBeMapped="Federal_fund_account"> <!--This is an simple example of category mapping where there does not existds a class exacly matching the entity.--> <isAParticularKindOf taxmap:mappingTargetInHubOntology="&hub;Account"/> </MapCategoryToClass>
PersonSampleSchema.owl <RelationalSchema rel:relationalIdentifier="PersonSampleSchema"> <description>A simple schema for a person which is intended to demonstrate proper encoding of schema data within this system.</description> <rel:schemaDefinitionsUsed rel:referencedItem="&root;/DataReferenceModel/Description/Relational/RelationalSchemaReferenceModel.owl"/> </RelationalSchema> <Entity rel:relationalIdentifier="PERSON"> <description>A person table</description> </Entity> <AttributeRelation rel:relationalIdentifier="SSN"> <attributeOf rel:referencedItem="PERSON"/> <attributeType rel:referencedItem="SSN__entity"/> </AttributeRelation> <AttributeEntity rel:relationalIdentifier="SSN__entity"> <entityDataType rel:referencedItem="&rel;String"/> </AttributeEntity> <Relation rel:relationalIdentifier="CHILD_MOTHER"> <entity1 rel:referencedItem="PERSON" rel:entity1Name="CHILD"/> <entity2 rel:referencedItem="WOMAN" rel:entity1Name="MOTHER"/> </Relation> … </rdf:RDF>
PersonSampleData.owl <rel:RelationalData rel:relationalIdentifier="PersonSampleData"> <rel:description>Actual Person data.</rel:description> <rel:schemaDefinitionsUsed rel:referencedItem="&root;/DataReferenceModel/Description/Relational/PersonSampleSchema.owl"/> <rel:schemaUsed rel:referencedItem="&root;/DataReferenceModel/Description/Relational/PersonSampleSchema.owl"/> </rel:RelationalData> <per:PERSON rel:relationalIdentifier="Alice"> <per:UNIQUE_ID> <rel:String> <rel:value rdf:datatype="&xsd;string">537-A3</rel:value> </rel:String> </per:UNIQUE_ID> <per:SSN> <rel:SSN_entity> <rel:value rdf:datatype="&xsd;string">537-A3</rel:value> </rel:SSN_entity> </per:SSN> </per:PERSON> <per:PERSON rel:relationalIdentifier="Jim"> <per:UNIQUE_ID> <rel:UNIQUE_ID_entity> <rel:value rdf:datatype="&xsd;string">109-A3</rel:value> </rel:UNIQUE_ID_entity> </per:UNIQUE_ID> <per:CHILD_MOTHER rel:referencedItem="Alice"/> </per:PERSON>
PersonSampleSchemaMappings.owl … <MapEntity relmap:relationalItemToBeMapped="PERSON"> <identicalTo relmap:mappingTargetInHubOntology="&hub;Person"/> </MapEntity> <MapAttributeRelation relmap:relationalItemToBeMapped="UNIQUE_ID"> <isAMoreParticularRelationFor relmap:mappingTargetInHubOntology="&hub;identificationString"/> </MapAttributeRelation> <MapEntity relmap:relationalItemToBeMapped="UNIQUE_ID__entity"> <isAParticularKindOf relmap:mappingTargetInHubOntology="&hub;UNIQUE_ID"/> </MapEntity> <MapAttributeRelation relmap:relationalItemToBeMapped="SSN"> <isAMoreParticularRelationFor relmap:mappingTargetInHubOntology="&hub;identificationString"/> </MapAttributeRelation> <MapEntity relmap:relationalItemToBeMapped="SSN__entity"> <isAParticularKindOf relmap:mappingTargetInHubOntology="&hub;UNIQUE_ID"/> </MapEntity> <MapRelation relmap:relationalItemToBeMapped="CHILD_MOTHER"> <isIdenticalTo relmap:mappingTargetInHubOntology="&hub;mother"/> </MapRelation>
Remaining OWL DRM Work • Handle all data types • Formalize external references • Model general XMLSchema • Encode DRM Sharing • Pre-alpha – needs more use • Incorporate feedback • Editing tools • Full integration
OWL DRM Value • OWL provides a common syntax for data integration • OWL DRM Framework fosters the use of one central set of integration semantics • OWL supports lossless integration • Its expressivity exceeds relational, OO, and XMLSchema models • XML is well positioned to be a data integration language • OWL represents all structured data • The framework gives a unified theory of data • High normalization supports complete integration and better analysis • OWL DRM Framework mandates high normalization
Web Presence • http://semweb.mcdonaldbradley.com • OWL code • http://semweb.mcdonaldbradley.com/OWL • Protégé-loadable Opencyc stripped translation (3 minutes to load) • Jena2-loadable FreeToGov Cyc • Taxonomy mapping framework • Semantic Web Tools • http://semweb.mcdonaldbradley.com/tools • Semantic Surf Board • Taxonomy Editor
What is an Ontology? • Stuffy definition: “A specialization of a conceptualization” • A more practical definition: A high fidelity data model • Most powerful definition: A First-Order Logic-based data model
What are Ontologies Good For? • High fidelity data modeling for • Providing large quasi-standard Java runtime data structures - hierarchies and accessor methods • Providing large quasi-standard relational schema • Data integration • Lossless schema integration • Lossless taxonomy integration • Knowledge Sharing • Agent Communication • Knowledge discovery • No limit eventually
Ontology-based Data Integration • Hub-based integration makes sense • O(n) vs. O(n²) • 3 node breaks even • 4 node is already 50% improvement • Hubs must be expressive • Must be express anything in the hub’s wheel • Aboriginal languages: 1, 2, 3, ∞ • Must accommodate future data sets
How Expressive is Your Hub? • The relational calculus? • The object-oriented model? • The First Order Predicate Calculus (logic)?
What is Logic? • Invented by Gotlieb Frege • Components: • Constants • Predicates • Functions • Logical Operators: • AND • OR • NOT • Quantifiers: • (X) (Y) (AND (Person X) (Person Y) (loves X Y)) • Basis for symbolic artificial intelligence A E A E
Why Should We Care? • All of the field of mathematics can be expressed in logic • Bertrand Russell and Alfred North Whitehead • “non-mathematical” artifacts like love, religion, and opinion are expressible • Charles Saunders Pierce • The ULTIMATE INTEGRATION GLUE
Our Present Generation Capability • COTS trainable concept filtering – shotgun approach • Investment is in training every concept against its own document list • …then easy to use • Precision is in the low 80% • Great tool when less search precision and recall are acceptable: • When missed results are acceptable • When analyst have time to wade through irrelevant search results
Why Ontology-basedResource Discovery? • Precision in resource discovery (no limit to precision and recall) • More buckets - formal deep ontologies (higher resolution in search) • Formal unambiguous concept definitions (fewer mistakes in registration) • Ease of resource discovery • Additional paths/relations for navigating to resources • Federation of legacy code/sets taxonomies (users can use familiar taxonomies) • Power in resource discovery • Search broadening and narrowing • Inclusion of related terms in search
Key Metrics For Ontology Selection • Breadth • Depth • Relevance • Robustness • Rigor (formal foundations) • Support • Designers Credentials
Cycorp Incorporated • Designed for complete high level breadth • They started with *many* random topics • Largest ontology by far (3,000,000 triple facts) • Deep on DoD/intel/terrorism (DARPA funded) • 20 years in the ontology business • Staff of professional ontologists
HPKB/RKF Cyc Translationinto OWL • Three varieties of Cyc • Full Cyc • Over 3,000,000 Axioms (pieces of knowledge) • Commercial product • $200K plus • Free-to-government Cyc • Most of the military/intel of Full Cyc • 12,000 Classes • 900 Relations • 23,000 Instances • Opencyc • Free • 3000 Classes • 1500 Relations
Cyc Translation Leverage – Search • Total Classes Translated: 12,000 • vs. ~600 in HF • 20-fold increase in search resolution • Total Instances Translated: 23,200 • vs. 0 in HF • Instances increase search resolution • What if we could register documents against just “Terrorism” and not “Osama bin Laden” • Total Relations Translated: 900 • vs. 4 in HF • Relations provide “suggested” pathways to a possibly more appropriate mapping target for • Increased resolution • Simpler target selection in registration and mapping
Why Taxonomiesfor Resource Discovery? • Top-down navigation of formal ontologies can be onerous – even for the ontologist • Not semantic indices for the masses • Taxonomies are still first-class indices • Many legacy taxonomies is existence • Intuitive to understand • Easy to build • Typically customized to a particular domain • Customizable to a particular domain
Downside to Taxonomies • Often idiosyncratic and not generally useful • Resources registered in just one taxonomy do not appear in overlapping taxonomies • Resources must be registered against all relevant taxonomies or… • Searchers must look in all appropriate taxonomies
A Taxonomic Definition • A hierarchy of things • Taxonomic nodes are all instances of type thing • The taxonomic relation is of type relatedTo • It is the most general relation that subsumes all others
A Solution • Map each taxonomy to a single hub ontology • Every node in each taxonomy maps to at least one ontological class instance or topic in the hub • Relations between nodes are not mapped (the semantics do not exist) • Mapping to a more general class looses search precision for some searches
General Taxonomic Representation • All nodes are instances of type Thing • Classes (under OWL-Full-style) • Topics • Instances of Thing subtypes • All relations are of type relatedTo Top of relation lattice • Relations are directional (pointing down by fiat)
Key Mapping Relations Web Taxonomy Hub Ontology sameAs instanceOf subclassOf Node representing Instance
In the ideal world: Every node of every COI taxonomy would be mapped to the core Every Mapping would be sameAs All node-based queries would be lossless for both precision and recall Sometimes hard to truly determine Allow no contradictions Allow consistent differences Vague or missing node documentation sameAs and the Ideal World COI Taxonomy Hub COI Taxonomy
The likely choice Lossy A cry for help Such taxonomic nodes must be upgraded and moved to hub Clarify and standardize node name Clarify and standardize node documentation Driven by economics (expensive) Similarly hard to truly determine subClassOf as a Backup COI Taxonomy Hub COI Taxonomy
The Query Extension Leverage of the subClass Relation COI Taxonomy 1 Hub Ontology COI Taxonomy 2
Clean Taxonomy Mapping • The DDMS Taxonomy Focus Group has created a 420 node core taxonomy • Terms (URI’s) will be used to characterize subject metadata of registered resources • Community of interest (COI) taxonomy terms also used for registration • Preliminary results • Many sameAs mappings • Mostly subClassOf mappings
Mapping Dirty HierarchicalCode Sets • Government code sets such as catcodes, equipment codes, IFC codes. • Simple one-step mappings did not suffice • Disjoined complex nodes missing subclasses in hub • Complex nodes were always disjoined • and usually means logical or (ATF is not logical and) • Had to create a subclass for each disjoined terms • Eg: Nuclear_biological_chemical_warfare_antidotes_and_medical_treatment • Node names contradicted node documentation • Basic three relations did not suffice • Added relatedTo • Topic subjects did not exist in ontology hub • Had to be created • Had to be linked to superclass(es) in hub