140 likes | 427 Views
Semantic Interoperability – Yes!. Presentation to the CIO Council June 18th 2007 Lucian Russell, Ph.D. Semantic Interoperability. What is it? The Data Reference Model Version 2.0 states: 3.2. Introduction 3.2.1. What is Data Description and Why is it Important …
E N D
Semantic Interoperability – Yes! Presentation to the CIO Council June 18th 2007 Lucian Russell, Ph.D.
Semantic Interoperability • What is it? • The Data Reference Model Version 2.0 states: • 3.2. Introduction • 3.2.1. What is Data Description and Why is it Important … • Semantic Interoperability: Implementing information sharing infrastructures between discrete content owners (even with using service-oriented architectures or business process modeling approaches) still has to contend with problems with different contexts and their associated meanings. Semantic interoperability is a capability that enables enhanced automated discovery and usage of data due to the enhanced meaning (semantics) that are provided for data. • Semantic Interoperability is a condition that is created with respect to a Data Resource that is under the control of an Agency. • Associated with each Data Resource is another that allows a “reasoning service” to identify its “semantics” to determine its value w.r.t. a query • Left in the table: how are “reasoning services” created and what are the necessary additional data resources needed?
In 2005 there was no direct answer, only a template • See DRM Version 2.0 Chapter 2, Figure 2.5 • Digital Data Resources can be Structured, Semi-Structured or Unstructured and can be contained within a document. These can describe a Data Asset. • On the other hand a Data Asset can provide a management context for a Digital Data Resources. • Topics in a language can categorize either (i.e. they are instances of a class designated by the topic word.) • To support “enhanced automated discovery”, though we need to use a combination and constellation of some collection of instances of these three entities. • Interoperability would then depend on the adequacy of the combination. • In 2005 the way was unclear, but there was a template. • On Page 18 “…Data Description artifacts are an output of the process of providing data syntax and semantics and a meaningful identification for a data resource so as to make it visible and usable by a COI.” • The most effective government COI is the Global Change Master Directory • The GCMD indexes 18 petabytes of multi-agency data: it was the template
In 2006 there were several breakthroughs • The unclassified R&D sponsored by the Intelligence Community had several important breakthroughs that impact “enhanced discovery services” • AQUAINT – Advanced Question Answering for Intelligence • WordNet was enhanced to create a disambiguated description of the most common words in the English Language, some 115,000 words and their meanings. • A markup language for time, TIMEML • An extraction technique to parse English Language text and create logical relations • NIMD – Novel Intelligence from Massive Data (FOUO) • Released (not-FOUO) the slides announced a breakthrough from the IKRIS project • Interoperable Knowledge Representation for Intelligence Support (IKRIS) • The IRIS Project’s Challenge: • “How to enable interoperability of knowledge representation and reasoning (KR&R) technology developed by multiple organizations in multiple DTO programs and designed to perform different tasks” • The Results: • A new language IKL that translates among knowledge representation languages • An extension of logic to 2nd order and non-monotonic expressions • A proof of equivalence among process specifications
These results open the way to SI using - English! • The implication of the results are staggering – English Descriptions in documents can be used to enable enhanced automated discovery. • There were limitations on concepts that could be represented: • Prior “semantic” technology (e.g. OWL-DL) only allowed for precise descriptions of concepts represented by nouns, i.e. taxonomies. “Ontologies” were defined as overlapping taxonomies. • WordNet now allows nouns to be unambiguously described. • WordNet has clearly demonstrated that nouns have single-subtype taxonomies but verbs do not: because there is a time element in all verbs’ meanings they have four sub-classes (verbs describe 4-D motions or state changes). • Consequently nouns and verbs cannot be intermixed meaningfully (without inconsistency) in OWL-DL Ontologies”. • Representing concepts using verbs entails describing processes, which are multiple verbs in a “Part-of” (meronymic/holonymic) relationships. • English descriptions of processes were imprecise because relative time concepts were heretofore too poorly understood to support automation. • With WordNet and TIMEML we can now precisely describe the processes that create and change data as well as the nouns used for the real world.
TimeML • Markup Language for Temporal and Event Expressions • TimeML is a robust specification language for events and temporal expressions in natural language. It is designed to address four problems in event and temporal expression markup: • (1) Time stamping of events (identifying an event and anchoring it in time); • (2) Ordering events with respect to one another (lexical versus discourse properties of ordering); • (3) Reasoning with contextually underspecified temporal expressions (temporal functions such as 'last week' and 'two weeks before'); • (4) Reasoning about the persistence of events (how long does an event or the outcome of an event last). • The rules that identify temporal dependencies can be used to insert tags into text. These can be processed. • Processes that entail other sub-processes can also be processed logically, i.e. infer from “A filed an application” the fact that “A filled out an application”. • Language Computer Corporation (AQUANT) finds logical relations in text
Joined train IS-A IS-A MEANS THEME conduct transport freight train passenger train AGENT THEME MEANS AGENT AGENT arrive carry ship board board run stop LCC’s Jaguar: Knowledge Extraction • LCC’s Jaguar product can automatically generate ontologies and structured knowledge bases from text • Ontologies form the framework or “skeleton” of the knowledge base • Rich set of semantic relations form the “muscle” that connects concepts in the knowledge base
It is now Cost Effective to “Document” Databases! • Previously documentation of databases was a black hole for budget $$ • Only people would read the documentation • It was never kept up to date • Rules within it “evolved” over time • Hence people never read the documentation anyway and the data was inconsistent • ETL techniques, Data warehouses and Data Marts were used to get uniformity, but substituting computer generated data for stored data is no guarantee of accuracy. • Now text descriptions of databases can be processed automatically • The correct WordNet sense of each word can be used. A correct description of the relationships among data attributes and the processes that describe how they were created can now be used for semantic processing. • The text can be extracted and used to create knowledge repositories! • AQUAINT and NIMD also enhanced the CYC Knowledge Base • CYCORP has the world’s largest general ontology and knowledge base describing the real world. It can be extended and used for Interoperability.
How can this be done? Carefully! Data are samples Old fashioned 1970s Data Modeling destroys distinctions: Lost Gold! Real World Data Gray mass of sameness Social World Data Data about Individuals Data are State Changes Data are Both Mathematical Patterns
Look at each type of data and how it comes into being! Example: A USCIS form has 10 Object types 1: Data Elements: Name & Country of Citizenship 2: Data Elements: Identification Numbers 3: Data Elements: Residence History 4: Data Elements: Education History 5: Data Elements: Employment History 6: Data Elements: Arrivals & Departures 7: Data Elements: Arrests & Citations 8: Data Elements: Marital Information 9: Data Elements: Children’s Names 10: Data Elements: Parents Country of Citizenship Photograph Signature Fingerprints
Structured Data and Schema Mismatch • Syntactic Schema Mismatch: • IEEE Computer December 1991 showed a large number of syntactic mismatches among representations of data were a barrier to data integration or sharing. • Entities = Attributes = Data Values – Nonsense or Computer Science? • Computer Science: Semantic Schema Mismatches • In 1986 it was published in Computing Surveys that when looking at how to integrate databases we could see that one Database’s Entity was another Database’s Attribute • In 1991 a research result showed that an Attribute in one Database could be a Data Value in another Database • So, with a potential for this degree of mismatch sending XML schemas to a repository is not necessarily a help to semantic interoperability. • The field of database integration essentially went dead in 1991 • HOWEVER, another side effect of IKRIS is that it is now possible to detect semantic similarities among databases even when there are different representations of the data as entity, attribute and data values – it won’t be perfect but it will be a lot better than we have. • Additional work is starting on using ANSI Data Dictionary structures and populating them automatically.
In Conclusion • It is possible to increase Data Sharing in the government • To enable enhanced automated discovery • Start with the Global Change Master Directory as a template and expand • Create new data descriptions • Use the English language correctly • Build process descriptions that show how and when data was generated • Use advanced Linguistic tools to extract data relationships • Integrate with a general knowledge base • To overcome Schema Mismatch • Revisit old data models and carefully expand existing definitions to show the full semantics of the data schema • Keep in mind that in the Real World one collects data samples of continuous processes whereas the Social World records state changes. Individuals’ data combines both. • There is no easy solution but advanced tools ensure hat any effort spent today is re-usable tomorrow and so there is no loss of value for investments in improving data descriptions.