300 likes | 306 Views
Peter Fox (RPI) discusses the challenges of accessing and integrating scientific data from various sources, as well as the importance of metadata in facilitating its use.
E N D
In Search of What Some of It Means RDA Semantics and Metadata Workshop Feb 23, 2015 Peter Fox (RPI) pfox@cs.rpi.edu Tetherless World Constellation
What I wanted ~ 1994-6 Scientists should be able to access a global, distributed knowledge base of scientific data that: • appears to be integrated • appears to be locally available But… data is obtained by multiple means (instruments, models, analysis) using various protocols, in differing vocabularies, using (sometimes unstated) assumptions, with inconsistent (or non-existent) metadata. It may be inconsistent, incomplete, evolving, and distributed. And, it is almost always created in a manner to facilitate its generation not its use. And… there exist(ed) significant levels of semantic heterogeneity, large-scale data, complex data types, legacy systems, inflexible and unsustainable implementation technology…
What I was doing… tmp_id = ncdf_dimid(ncid, "comment_dim") ncdf_diminq,ncid, tmp_id, dummy, comment_dim tmp_id=ncdf_dimid(ncid, "mu_dim") ncdf_diminq,ncid, tmp_id, dummy, mu_dim tmp_id=ncdf_dimid(ncid, "wave_dim") ncdf_diminq,ncid, tmp_id, dummy, wave_dim tmp_id=ncdf_dimid(ncid, "model_dim") ncdf_diminq,ncid, tmp_id, dummy, model_dim tmp_id=ncdf_dimid(ncid, "smodel_dim") ncdf_diminq,ncid, tmp_id, dummy, smodel_dim tmp_id=ncdf_dimid(ncid, "item_dim") ncdf_diminq,ncid, tmp_id, dummy, item_dim pro read_spec, spectra_name, description, auxiliary_info, model_size, mu_size, wave_size, model, smodel, mu, wave0, wavelength, intensity, brightness_temperature, index1, index2, percent ncopts = 0; description_start=0 description_edges=80 i=0 j=0 k=0 ; Construct the DB filename ncid=ncdf_open(string(getenv("SPECTRA"))) inq_struct=ncdf_inquire(ncid) ; /* get dimension info */
What I was doing… etc. start=intarr(1) edges=intarr(1) start(0)=0 edges(0)=model_size tmp_id=ncdf_varid(ncid, "mu_size") ncdf_varget,ncid, tmp_id, mu_size, OFFSET=start, COUNT=edges tmp_id=ncdf_varid(ncid, "model") ncdf_varget,ncid, tmp_id, model, OFFSET=start, COUNT=edges start=intarr(2) edges=intarr(2) start(0)=0 edges(0)=smodel_dim start(1)=0 edges(1)=model_size tmp_id=ncdf_varid(ncid, "smodel") ncdf_varget,ncid, tmp_id, smodel, OFFSET=start, COUNT=edges tmp_id = ncdf_varid (ncid, "description") ncdf_varget,ncid, tmp_id, OFFSET=0,COUNT=comment_dim, description ; Id's for variables tmp_id=ncdf_varid(ncid, "spectra_name") ncdf_varget,ncid, tmp_id, OFFSET=0,COUNT=comment_dim, spectra_name tmp_id=ncdf_varid(ncid, "auxiliary_info") ncdf_varget,ncid, tmp_id, OFFSET=0,COUNT=comment_dim, auxiliary_info tmp_id=ncdf_varid(ncid, "model_size") ncdf_varget,ncid, tmp_id, OFFSET=0,COUNT=item_dim, model_size
Some version of this… ~Metadata? Experience Data Information Knowledge Creation Gathering Presentation Organization Integration Conversation Context
It and Meaning • It = things that matter • Context • Meaning = duh -> semantics • Relations!! Real ones! • But it was more than that, though that often comes later… • Syntax (structure/form) • Semantics (meaning) • Pragmatics (use)
Metadata-Information-Knowledge Ecosystem Experience Metadata Information Knowledge Creation Gathering Formalization Organization Integration Shared Conceptualization Context
Provenance • Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility • Provenance: metadata in a given context! Swallow that. • Knowledge provenance; meaning and relations in multiple contexts!
Origins … • In 2000-2001 the need for capturing and preserving knowledge in science data became very clear but the barriers were high • In 2004 we started a virtual observatory project based on semantic technologies • Use case driven – in solar and solar-terrestrial physics with an emphasis on instrument-based measurements and real data pipelines; we needed implementations • We knew we also needed integration and provenance (but that came later) • We aimed to push semantics into our systems to build new ‘prototypes’ but we ‘failed’ ;-) Tetherless World Constellation
In 2004 • 2004 – OWL was a W3 recommendation!! • Protégé 2.x and the Protégé-Java-OWL API • SWOOP was a viable editor • Jena and the Jena API were in good shape • Pellet worked • SPARQL was still a twinkle in the RDF working group’s eye • Semantics were still the realm of computer scientists Tetherless World Constellation
Design and Development • We made a conscious decision only to develop ontologies that were required to answer specific use cases and migrate metadata • Both Classes AND Properties (uh-oh…) • We made a conscious effort to use whatever ontologies were available (cf. trends in metadata… nuff said) • We were pretty sure that rules would be needed (complex logic or late semantic binding) • We ignored query (see implementation) Tetherless World Constellation
Use Case example • Plot the neutral temperature from the Millstone-Hill Fabry Perot, operating in the non-vertical mode during January 2000 as a time series. • Plot the neutral temperaturefrom the Millstone-HillFabry Perot, operatingin thenon-vertical modeduringJanuary 2000as atime series. • Meanings and relations • Objects=Things! • Neutral temperature is a (temperature is a) parameter • Millstone Hill is a (ground-based observatory is a) observatory • Fabry-Perot is a interferometer is a optical instrument is a instrument • Non-vertical mode is a instrument operating mode • January 2000 is a date-time range • Time is a independent variable/ coordinate • Time series is a data plot is a data product • Metadata just appeared everywhere…
Semantics - Modern informatics enables a new scale-free** framework approach • Use cases • Stakeholders • Distributed authority • Access control • Ontologies • Maintaining Identity
Semantics between 2004 and 2009 • Ontologies were needed for data integration and provenance and mediation for data mining • Protégé 3.x and then 4.0 came out • SWOOP development was interrupted • Cmap added OWL predicate support* • SPARQL became a recommendation • Triple stores exploded in use and capability • Linked Open Data started to take off • Pellet 2.0 came out • I used the “M” word less frequently! Tetherless World Constellation
Working with knowledge Expressivity Implementability Maintainability/ Extensibility
Working with semantics Query Inference Rule execution
Semantics between 2009 and now • Semantic data framework (SeSF) • Substantial knowledge provenance work • Data quality, uncertainty and bias representations and applications (oh, these are in production at NASA) • Multi-sensor data synergy advisor • Applications: • Sea Ice, Carbon Observatory, Integrated Ecosystem Assessments, globalchange.gov, ocean.data.gov, energy.data.gov …. Tetherless World Constellation
NCA links to GCIS entities http://data.globalchange.gov
Information model Ontology
Core and Framework Semantics - Multi-tiered interoperability used by
Closing thoughts • Go ahead, create all the metadata you want, we’ll “materialize” some of it into triples based on semantics for use! • Go ahead, create all the schema and encodings you want but remember – semantics now lives in an open-world (some of it). You are not the only source of metadata. Not all formal. Link over map. • Semantics make metadata useful but we do not need all of your metadata Tetherless World Constellation
Contact • pfox@cs.rpi.edu • http://tw.rpi.edu • @taswegian