340 likes | 456 Views
SC 32/WG 2 Tutorial Metadata Registry Standards July 16, 2007. JTC1 SC32 N1649. Bruce Bargmeyer University of California, Berkeley and Lawrence Berkley National Laboratory Tel: +1 510-495-2905 bebargmeyer@lbl.gov. Topics. Standards development: OMG, ISO (TC 37 & JTC 1/SC 32), W3C, OASIS
E N D
SC 32/WG 2 Tutorial Metadata Registry Standards July 16, 2007 JTC1 SC32 N1649 Bruce Bargmeyer University of California, Berkeley and Lawrence Berkley National Laboratory Tel: +1 510-495-2905 bebargmeyer@lbl.gov
Topics • Standards development: OMG, ISO (TC 37 & JTC 1/SC 32), W3C, OASIS • Align, Coordinate, Integrate: Standards, Recommendations, Specifications • Semantics Challenges and Future Directions
Align, Coordinate, IntegrateStandards WG 2 doing OK internally: 24707 11179 E3 19763 20944
Align, Coordinate, IntegrateStandards SC 32? WG 1 WG 2 WG 3 WG 4 Clearwater meeting a step forward
Users CONCEPT Metadata Registry Refers To Symbolizes TerminologyThesaurus Taxonomy “Rose”, “ClipArt Rose” Ontology Stands For Referent Data Standards Structured Metadata 11179 Metadata Registry Align, Coordinate, Integrate Standards/Recommendations/Specificationsfor Semantic Computing Semantic Web Terminology Object Management ISO/IEC 11179 Metadata Registries Graph RDF MOF ODM CWM IMM Subject Node Predicate Edge Node Object W3C OMG ISO/IEC JTC 1/SC 32 ISO TC 37
Standards DevelopmentSemantics Management and Semantics Services – Semantic Computing Align, Co-develop, Fast Track, PAS Submission … OMG W3C ISO/IEC JTC 1 SC 32 ISO TC 37
Standards DevelopmentSemantics Management and Semantics Services – Semantic Computing Align, integrate, co-develop, Fast Track, PAS Submission … Can we coordinate content? OMG ISO/IEC JTC 1 SC 32 W3C W3C
A Success Some text and figures are identical in the two standards. OMG ISO/IEC 24707 OMG ODM ISO/IEC JTC 1 SC 32 ISO/IEC 20944 – Common Logic OMG Ontology Definition Metamodel
Standards DevelopmentSemantics Management and Semantics Services – Semantic Computing Ongoing effort ISO/IEC JTC 1 SC 32 ISO/IEC 11179 (Edition 3)
Standards DevelopmentSemantics Management and Semantics Services – Semantic Computing Possible effort OMG RFP - MOF? IMM 11179 E3 proposals
Standards DevelopmentSemantics Management and Semantics Services – Semantic Computing Hopeful? OMG IMM & ISO/IEC JTC 1 SC 32 ISO/IEC 11179 (Edition 3)
Other Possibilities • OASIS ebXML Registry • W3C Semantic Web Deployment WG • TC 37
The Ageless Information Problemcf: Data, Information, Knowledge, Wisdom Getting the information that we need, when we need it, without afflicting the excellent minds of humans with toil and drudgery The litany: • Too much or too little, irrelevant, not authoritative, out of date • Unknown quality, not trustable, lacks provenance, no certainty measures • Difficult to find, difficult to access, difficult to use • Meaning not clear, relationship to other information not clear • Data creators do not have the same understanding of the data as end users • Recorded data loses much real world meaning, context, relationships • Much of the meaning of data is buried in the processes used to manipulate the data (e.g., in computer code) • Need improvements in efficiency and effectiveness Every time we solve it, we re-create it.
New Semantics Capabilities Proposed for ISO/IEC 11179 MDR (Edition 3) • Improve traditional data management/data administration • Use stronger semantics management and semantics services capabilities • Enable something new • Semantic computing
Semantic Computing: The Nub of It • Processing that takes “meaning” into account • Makes use of concept systems, e.g., thesauri and/or ontologies • Moves some of the “meaning” of data from computer code to managed semantics • Processing that uses (e.g., reasons across) the relations between things not just computing about the things themselves. • Processing that helps to take people out of the computation, reducing the human toil • Semantics “grounding” for data, data discovery, extraction, mapping, translation, formatting, validation, inferencing, … • Delivering higher-level results that are more helpful for the user’s thought and action
In The Epic Information StruggleWe Have Made Heroic Progress Files Computer Processing Cards Tape Disk Machine Processing
In The Epic Information StruggleWe Have Made Heroic Progress In structuring data and text -- • Structured Data • Columns on cards & tape (possibly comma separated) • Hierarchical (DBMS) • Network • Table (relational DBMS) • Hierarchy (XML) • Graph (RDF) • Semi-structured text • Nrof, trof, LaTeX … • SGML • HTML • XML
In The Epic Information StruggleWe Have Made Heroic Progress In documenting data and text (e.g., semantics management) – • Data Standards • Code sets • (Meta)Data Standards • Data element definitions, valid values, value meanings • Metadata registries (MDR, ISO/IEC 11179) • Other standards as presented at this conference • Concept systems (or KOS) • Glossaries • Dictionaries • Thesauri • Taxonomies • Ontologies • Graphs
Semantic ManagementProposals for 11179 Edition 3 • Improve data management through use of stronger semantics management • Databases • XML data • Other “traditional” data • Enable new wave of semantic computing • Take meaning of data into account • Process across relations as well as properties • May use reasoning engines, e.g., to draw inferences
Semantic Computing Application: Find and process non-explicit data Analgesic Agent For example… Patient data on drugs contains brand names (e.g. Tylenol, Anacin-3, Datril,…); However, want to study patients taking analgesic agents Non-Narcotic Analgesic Analgesic and Antipyretic Nonsteroidal Antiinflammatory Drug Acetominophen Datril Tylenol Anacin-3
A Semantics Application: Specify and compute across Relations, e.g., within a food web in an Arctic ecosystem An organism is connected to another organism for which it is a source of food energy and material by an arrow representing the direction of biomass transfer. Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)
Contamination Biological Radioactive Chemical mercury lead cadmium Semantics Application: Combine Data, Metadata & Concept Systems Inference Search Query: “find water bodies downstream from Fletcher Creek where chemical contamination was over 10 micrograms per liter between December 2001 and March 2003” Concept system: Data: Metadata:
Dublin Core Registries Software Component Registries Common Content Common Content Challenge: Use data from systems that record the same facts with different terms Database Catalogs Common Content ISO 11179Registries UDDIRegistries Table Column Data Element Common Content Common Content Business Specification Country Identifier OASIS/ebXMLRegistries CASE Tool Repositories XML Tag Attribute Common Content Common Content Business Object Coverage TermHierarchy OntologicalRegistries Common Content
Name: Country Identifiers Context: Definition: Unique ID: 5769 Conceptual Domain: Maintenance Org.: Steward: Classification: Registration Authority: Others DataElementConcept Algeria Belgium China Denmark Egypt France . . . Zimbabwe Same Fact, Different Terms Data Elements Algeria Belgium China Denmark Egypt France . . . Zimbabwe L`Algérie Belgique Chine Danemark Egypte La France . . . Zimbabwe DZ BE CN DK EG FR . . . ZW DZA BEL CHN DNK EGY FRA . . . ZWE 012 056 156 208 818 250 . . . 716 Name: Context: Definition: Unique ID: 4572 Value Domain: Maintenance Org. Steward: Classification: Registration Authority: Others ISO 3166 3-Alpha Code ISO 3166 English Name ISO 3166 French Name ISO 3166 2-Alpha Code ISO 3166 3-Numeric Code
Challenge: Draw information together from a broad range of studies, databases, reports, etc.
A semantics application: Information Extraction and Use Extraction Engine Segment Classify Associate Normalize Deduplicate Discover patterns Select models Fit parameters Inference Report results 11179-3 (E3) XMDR Actionable Information Decision Support
Metadata Registries are Useful Registered semantics • For “training” extraction engines • The “Normalize” function can make use of standard code sets that have mapping between representation forms. • The “Classify” function can interact with pre-established concept systems. Provenance • High precision for proper nouns, less precision (e.g., 70%) for other concepts -> impacts downstream processing, Need to track precision
Challenge: Gain Common Understanding of meaning between Data Creators and Data Users text text data data environ agriculture climate human health industry tourism soil water air ambiente agricultura tiempo salud hunano industria turismo tierra agua aero 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 A common interpretation of what the data represents EEA USGS text data environ agriculture climate human health industry tourism soil water air DoD 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 Users text data environ agriculture climate human health industry tourism soil water air EPA 123 345 445 670 248 591 308 123 345 445 670 248 591 308 3268 0825 1348 5038 2708 0000 2178 3268 0825 1348 5038 2708 0000 2178 text data 3268 0825 1348 5038 2708 0000 2178 123 345 445 670 248 591 308 ambiente agricultura tiempo salud huno industria turismo tierra agua aero 123 345 445 670 248 591 308 3268 0825 1348 5038 Others . . . Users Information systems Data Creation
Practical Vocabulary Management • Vocabulary Management is essential for use of semantic technologies • Define concepts and relationships • Harmonize terminology, resolve conflicts • Collaborate with stakeholders • An approach • Select a domain of interest • Enter core concepts and relationships • Engage community in vocabulary review • Harmonize, validate and vet the vocabulary • Enter metadata describing enterprise data • Link concept system to metadata
Use eXtended MDR Capabilities • For vocabulary repository • Register, harmonize, validate, and vet definitions and relations • To register mappings between multiple vocabularies • To register mappings of concepts to data • To provide semantics services • To register and manage the provenance of data 11179-3 (E3) is part of the infrastructure for semantics and data management. These capabilities are proposed for ISO/IEC 11179 Edition 3
11179 (E3) Use • Upside • Collaborative • Supports interaction with community of interest • Shared evolution and dissemination • Enables Review Cycle • Standards-based – don’t lock semantics into proprietary technology • Foundation for strategic data centric applications • Lays the foundation for Ontology-based Information Management • Content is reusable for many purposes • Downside • Managing semantics is HARD WORK- No matter how friendly the tools • Needs integration with other components
Some Challenges • Data management and metadata management must evolve to address more complex data structures (relational, object, hierarchies, graphs) • Query capabilities • More than SQL, XQuery, SPARQL • Discovery mechanisms • More than Google • Access, mining, extraction We need stronger semantics management
Metadata Registry Support for • Registering and mapping ontologies • Ontology Evolution • Registering Process Ontologies
Thank You Bruce Bargmeyer Lawrence Berkeley National Laboratory & Berkeley Water Center University of California, Berkeley Tel: +1 510-495-2905 bebargmeyer@lbl.gov • Acknowledgements • Karlo Berket, LBNL • Kevin Keck, LBNL • John McCarthy, LBNL • Harold Solbrig, Apelon This material is based upon work supported by the National Science Foundation under Grant No. 0637122, USEPA and USDOD. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, USEPA or USDOD.