180 likes | 409 Views
DRM & Semantic Advances. Lucian Russell, PhD Expert Reasoning & Decisions LLC Feb 17 th , 2009. DRM 2.0 Semantic Baseline. DRM 2.0’s purpose was to advance Data Sharing among agencies The writing team wanted to avoid a document that created unnecessary work for Federal agencies
E N D
DRM & Semantic Advances Lucian Russell, PhD Expert Reasoning & Decisions LLC Feb 17th, 2009
DRM 2.0 Semantic Baseline • DRM 2.0’s purpose was to advance Data Sharing among agencies • The writing team wanted to avoid a document that created unnecessary work for Federal agencies • It took the “do no harm” approach • The DRM audience was Enterprise Architects and Data Architects • It emphasized Communities of Interest
The Environment Was Hostile • Data Sharing was of great interest, but was hard to do because • There was preparation work with no budget • The technology was decades old • Search Engine vendors were claiming that they could do everything and Political Appointees were hearing “slashed budgets” • Metadata specialists were under attack
The Approach Was to Build • There was a new technology based on web interfaces, Service Oriented Architecture (SOA) which could overcome the “plumbing” issues preventing sharing • Specialists who knew their data collections were motivated to apply that knowledge • The DRM 2.0 did some organization of concepts that represented best practice and suggested that the specialists be allowed to use them
What Changed? • April 19th 2006 the DNI DTO (now DNI IARPA) announced the success of the IKRIS project (It was briefed at a SICoP workshop October 2006) • Taken in concert with other IARPA advances is changed what could be expected from Semantic technology • The advances were briefed Feb 6th, 2007
Key Elements from Feb 6th ‘07 • A vision of DRM 2.0 moving forward was presented by Chapter 3 & 4 specialist Lucian Russell and Chapter 5 specialist Bryan Aucoin (the “how to” chapters) • A vision of DRM 3.0 with a combined Data Context and Data Description was envisioned • Speakers described enabling technology
The Specialists Speakers • From the DNI DTO (IARPA) work • Dr. Christiane Fellbaum – Princeton’s WordNet program • Lola Olson – NASA Goddard Master Directory program • Dr. John Prange – Language Computer Corporation • Dr. Michael Witbrock - CYCORP
English as an Exact Language • The first key advance was from WordNet – Dr. Christiane Fellbaum • Heretofore English was too ambiguous • Now the 115K major words of English had been analyzed carefully and all meanings were distinguished, and more words could be added as needed • Nouns are not verbs! (sorry OWL specialists) • The result: unambiguous English could be generated for use by commuters • Note: this is not ambiguity resolution!
Weaving Words and Directories • The NASA Goddard Master Directory program describes how to access 18 petabytes of data. • It is already massive, multi-agency data sharing and has worked for 10+ years • It combines topic words with data collection descriptions • The DRM 2.0 did not directly address massive data collections, but the wording on guidance indirectly let in this approach
Parsing Meaning from Documents • Documents with English already can be parsed to detect meaning. • Language Computer Corporation had scored the best in the DTO (IARPA) program as measured through NIST challenge competitions (TREC) • The team showed that there are some 40 major relationship classes in English
Putting Meanings Together • CYCORP has a unique Ontology CYC of millions of assertions about facts in the real world and in others (e.g. mythology) • The concepts that could be uncovered by the prior work could be woven together in the CYC Ontology • CYC has logic constructs that exceed First Order Logic, and because IKRIS showed CYC interoperable with other Knowledge Constructs the CYC Ontology is the most powerful known
DRM 2.0 Could Be Improved! • The final Writing team found a DRM draft that covered only 3 of the 8 major categories of government information and offered little technical guidance • New technology existed that could advance the State of the Art for Data – that was the take away Feb 6th • It was put in a White Paper June 18th ‘07
What Was Next? Feb 5th 2008 • Improvements on February 6th • Further elucidation of the role of Ontology in data sharing • Further elucidation of the role of DRM artifacts in Data Sharing and a suggestion of their content • A new approach, via Sorted Logic, on the challenge of Schema Mismatch: a barrier to Data Sharing for fixed field databases
Data Sharing: Alpha & Beta • In the context of Relational Databases current practice is to develop data models that are sparse in semantics – a practice that dates back to the 1970s when storage was expensive • This leads to two types of errors that can disable Data Sharing • Type Alpha Errors: A and B are the same but we miss the equivalence • Type Beta errors: A and B look the same but are different
Type Beta Errors – Data Context • The DRM 2.0 explained that Data Context Artifacts should be used for Data Assets, such as relational databases, so as to provide more information. The goal is to make be able to distinguish different instances of data that appear the same • In relational databases this would mean adding semantic content to distinguish schemas – metadata about the data
How Can This Be Done? • Data models themselves provide too little information – deliberately so • What is needed is a new semantic artifact, a “Data Descriptor” among the Data Description (Chapter 3) artifacts that explains the processes (using verbs) behind the collection of the data. • The context can be abstracted from the Data Descriptor
The Logic Transformation! • The biggest technical challenge in Data Sharing for Relational Database is schema mismatch – type alpha errors • Dr. Selmer Bringsjord, a DTO and especially IKRIS investigator, showed how transformations of database schemas can be used to detect two identical databases that have different schemas
The Conclusion • With more careful semantics, disambiguated words, and an Ontology that supports process descriptions and counterfactuals: • It is possible to build more advanced mechanisms to support data sharing in the Federal Government • These must be the basis for DRM 3.0 • Because they involve just using English better they are far less expensive than was feared!