220 likes | 232 Views
Explore advancements in Data Reference Model (DRM) and Service-Oriented Architectures (SOAs) for improved context management and data sharing. Learn about challenges, potential solutions, and the collaboration driving progress in this field.
E N D
Building The DRM 3.0 – and SOAs and the Web 3.0 Too! Can We Start Now? Lucian Russell, PhD SICoP Special Conference February 6th, 2007
The Thesis and the Speakers • Thesis: We can manage context across multiple documents and organizations because of new developments from Computer Science R&D • Lucian Russell & Bryan Aucoin: we agree that the DRM 2.0 was a good start but we can and should go further • Bryan Aucoin: An updated perspective on Data Sharing (Ch 5) and what these services need from Data Description (Ch 3) and Data Context (Ch 4) • Lucian Russell: Overcoming limitations of DRM Data Description and Context • A more exact use of language is possible • We can build upon an existing mixed knowledge structure in use in the government 1for over 15 years • It is now possible to mine text documents as well as schemas to build a unified structure to support Data Sharing • The same structures enable a new SOA capability and the Web 3.0 • Prof. Christiane Fellbaum: The new capabilities of WordNet 2.1 • Lola Olsen: Success of the Global Change Master Directory program, the DRM 2.0 template for blending Data Description and Data Context • Dr. John Prange: Knowledge discovery using natural language processing • Dr. Michael Witbrock: CYC and its ability to realize the promise of IKRIS • Panel Discussion: for SOAs, DRM and Web 3.0 – can we start now?
Background: Federal Enterprise Architecture Models • The Clinger-Cohen Act of 1996 mandates that each agency • Have a Chief Information Officer (CIO) • Create an Enterprise Architecture (EA) • The President’s Management Agenda required compliance with the Act • EA Reference Models were created starting in 2002 • The last was the Data Reference Model (DRM) • The DRM “took longer” • Version 1.0 was finished in December 2003 • Version 2.0 was finished in November 2005 • Version 2.0 was approved in December 2005 • The DRM had three components • Data Description – Chapter 3 • Data Context – Chapter 4 • Data Sharing – Chapter 5 • The final writing team took the penultimate document's Abstract Model and added new Introductions and Guidance. Data Sharing became primary!
The Data Reference Model 2.0: Writing Assignments Introduction & Guidance Sub-sections Bryan Aucoin Lucian Russell
Current Challenges • First one must understand the challenges and the areas where there is a need for specific change • Current approaches to making (1) data available among agencies, and (2) among computer systems on the Web are just pouring the old vinegar into a new wine bottle • They still require human intervention to work • They at most match words (i.e. XML tags) vs. reasoning about them • They isolate concepts from content • They are too weak to manage the infoglut; we need to move towards a better means of managing context across multiple documents and organizations • Bryan Aucoin and Lucian Russell wanted to continue a collaboration on how to improve the DRM 2.0 but there was no OMB funding at the time • This funding is no longer needed so the collaboration can resume, and the Semantic Interoperability Community of Practice will enable a continuing community-wide collaboration
Data Context: The DRM 2.0 Abstract Model Taxonomy 1 Data Asset Topic - Directory - File of Documents - Document Database - File of Data … Word 1 … … …
Data Description: The DRM 2.0 Abstract Model Data Schema Entity Word 1 contains Attribute Relates is constrained by Participates-in refers to Relationship Data Type Document
In Implementing Data Sharing there are Barriers • The distinction between Entity and Attribute is not clear: one can be the other! • An Entity in one database can be an Attribute in different databases (1986) • The distinction between Attribute and Data Value is not clear: one can be the other! • Data Values in one database can become Attributes in another database via pivoting (1991) • This interchangeability, plus syntactic differences (enumerated in Dec. 1991 IEEE Computer), create “schema mismatch”, the major barrier to data integration! • There too much of a distinction between Data Description and Data Context: there is no necessary relationship between text fields in any Data Description and any words in any taxonomy in Data Context • There are also no provisions for dynamic schemas. • Files with embedded schemas or schemas inside application programs (e.g. JPG) are mislabeled “unstructured” • In the DRM 2.0 these schemas are not made visible for sharing! • How would they be queried: what service would be requested and what would be returned? • SOA: the Enterprise Service Bus needs better knowledge representations: it accesses services that have a service description but this entails a data description! • “Semantic Interoperability” means being able to translate among different schemas for the same data • There needs to be a service that reasons about whether databases are the same, about schema mismatches • There’s no SOA without being able to specify the data that is affected by the service!
The Background Battle of Context: Search Engines • The context model looks innocuous but there is a serious cost issue debate • The policy of the administration is to share data – the question is how? • “Metadata” originated because we could not access the actual data, so the next best thing was to make a description (card catalogue) of the data available in a “separate place” (registry, clearinghouse, etc.) But those “separate places” have taken on a life of their own even after the actual data was accessible. Recent advances in DRM 2.0, Semantic Interoperability, Web 2.0, ISO Data Standards, etc., have led to bringing the data and metadata back together and the use of a much more powerful form of metadata, called “executable metadata” ” Dr. Brand Niemann, EPA • However, in 2006 an open question “how good is search/clustering?” was closed • The Context • Search Engines have two types of errors, False Positives and Missing Values (measured by Precision and Recall) which need to be reduced in large data sets • The size of the document collection impacts the chance of finding information: the more documents the higher the likelihood that within the first N what you want is missing • The Hope: clustering (e.g. Vivissimo at usa.gov) would somehow improve enough • The Reality: there is a know gold standard and retrievals using clustering don’t meet it • The gold standard for retrieval: 2-pass queries where 1st pass terms are added to the query. This is the current limit of word based Precision/Recall • Using LDA we get 95% of the standard, and LDA beats topics. We won’t get better any time soon, so another means of reducing errors is needed
Hidden in the Fine Print of Data Context "4.2.2. Purpose of the Data Context Section of the DRM Abstract Model: The Data Context section of the DRM abstract model exists to identify the structures used for Data Context artifacts. Context often takes the form of a set of terms, i.e. words or phrases, that are themselves organized in lists, hierarchies, or trees; they may be referred to as “context items”. Collectively, Data Context can be also be called “categorization” or “classification”. In this case the groupings of the context items can be called “categorization schemes” or “classification schemes.” More complex Data Context artifacts may also be generated, e.g. networks of classification terms or schemes. Classification schemes can include simple lists of terms (or terms and phrases) that are arranged using some form of relationship, such as • sets of equivalent terms, • a hierarchy or a tree relationship structure, or • a general network.”
What Does this Mean? • If you know what WordNet is, this looks suspiciously like WordNet • Your suspicions are right BUT • You probably haven’t read the book • You probably don’t know WordNet 2.1 • WordNet 2.1 is now a logically coherent collection of distinct classes, instances and many relationships: • It is Standard-Ready • It can be extended to encompass the terminology of the Federal metadata working groups • Prof. Fellbaum will be explaining its current status • The promise: using only topics that are registered in an extended WordNet would enable far more make the government far better at managing context across multiple documents and organizations ISBN:0262561166
What else is hidden? • On the surface the DRM 2.0 Data Context’s “topic” looks like a keyword assigned to a document • There’s more: the topics can point to Data Catalogs, which can contain data descriptions • This has already been done successfully! • A large part of the total Federal holdings of data is ALREADY available in a Data Sharing system, the Global Change Master Directory (GCMD) • The GCMD mixes Topics and Data Descriptions • It is wildly successful – 18 years of operation and petabytes of data • It is also used by many government civilian agencies (i.e. ones without large budgets for data management) • Lola Olsen will explain the GCMD • Use it as a template now BUT also look at how the new technology described here today will allow the government to create a more powerful version of the template, which can become the government standard.
Data & Information & Knowledge Repository The Data Reference Model 3.0: A Look Ahead dynamic static Data Resource Awareness Agent Figure 3-1 DRM standardization Areas
Actually, we CAN Start Now! • The world has changed radically in one year; there have been 3 major advances: • On April 19th 2006 the success of the IKRIS project was announced • In May 2006 the AQUAINT project announced the success of several linguistic projects • In August 2006 Latent Dirichlet Allocation (LDA) was shown to find the principle vectors for collections of text documents; this sets an effectiveness limit for document retrieval non-semantic text collections • (1) We now have an accurate disambiguation of word senses, WordNet that is machine readable; it can be extended • (2) We have in the GCMD a DRM 2.0 version of a joint Data Context & Data Description artifact that enables massive Data Sharing • (3) We can extract Meaning and Relationships from data (LCC) • (4) We can store knowledge (CYCORP)
What’s needed (and we now have): language & logic • What is needed is an effective approach to unify logic and language so that both can be utilized by automated processes: The DRM 2.0 now separates the two. Also: • SOAs have human readable “Service Descriptions” that separate the two • The Web 2.0 cannot read the content of postings and reason about them. Web 3.0 will only come into existence when computers can read and write content based on what was read • The Intelligence Community created the Advanced Research and Development Activity (ARDA) which started several new information exploitation initiatives (perhaps soon to be IARPA) • The Advanced Question Answering of Intelligence (AQUAINT) project looked at how to understand documents: Dr. John Prange was the AQUAINT project manager • The NIMD project looked at how to reason about what was found and the IKRIS project an new language, an extension to Common Logic that greatly expands knowledge representation. • It allows for the contrafactual conditional a.k.a. the laws of science
Background Reading: The 1st Attempt Ed. Frederick Suppe
Data & Information & Knowledge Repository The Data Reference Model 3.0, Web 3.0 & SOAs dynamic static Data Resource Awareness Agent Language Logic Figure 3-1 DRM standardization Areas
Panel Discussion: Why not build the DRM 3.0 and Web 3.0 now
Panelists • Lucian Russell • Bryan Aucoin • Christiane Fellbaum • Lola Olsen • John Prange • Michael Witbrock
Why Not Build the DRM 3.0 today with Web 3.0? • The Discussion Topic • The Data Reference Model exists to give Guidance to Federal Agencies on how to share data of all kinds. Currently it provides separate abstract structures for Data Description and Data Context. However, although prior to 2006 this separation made sense it would appear that it no longer does:. • There is now a lexical Ontology that disambiguates English. It is well developed and consistent and can be extended with additional terms that are discipline specific. If that precise vocabulary were used then the effectiveness of search engines would increase and topics could be inferred. • There is now a more powerful linguistic extraction technology than heretofore was used, one that can understand the content of text documents well enough to know which ones may contain the answer to questions. It can also extract logical relationships. This means we can capture facts about concepts and processes that occur over time. • There is now a Standard for logic which extends to real world logical reasoning principles. This makes it possible to effectively reason about the real world consequences of facts that have been captured. All the knowledge is interoperable. • Therefore it is possible to unify the Data Description and Data Contents by creating an intelligentDirectory Interchange Format type structure which will be used to build a knowledge base. This would be the model in the DRM 3.0 • This is Web 3.0 Technology because it reasons about content and adds it
Can we start today to build a Federal Knowledge Base? • The Knowledge Base would be started Data Sharing Services • It would be possible to start with the agencies where data is least ambiguous, i.e. the science collections. Additional literature could be “read” to build a repository of scientific knowledge. The Data collections would then be instances of this knowledge. The components would be the new iDIFs specified in the DRM 3.0. • The Knowledge Base would support Data Sharing Services • The Services would be described as processes using IKL or another equivalent language – which could be translated into the OWL-S Services Ontology. They would return logical descriptions of data structures • New services could be added, including ones that reasoned about Schema Mis-matches to increase interoperability • The Knowledge Base would support automatic content reasoning (W 3.0) • Because “what-if” scenarios can be supported using the CONTEXT feature in IKL intelligent agents could infer new knowledge about the real world • We can start now, or we can wait and waste our resources on Data Contexts and Data Descriptions that will just have to be redone later • Will anything get so much better in the near future that it pays to wait?