430 likes | 579 Views
Shifting the Burden from the User to the Data Provider. Peter Fox High Altitude Observatory, NCAR (***) With thanks to eGY and various NSF, DoE and NASA projects. Outline. Background, definitions Informatics -> e-Science Data has lots of uses Virtual Observatories: use cases
E N D
Shifting the Burden from the User to the Data Provider Peter Fox High Altitude Observatory, NCAR (***) With thanks to eGY and various NSF, DoE and NASA projects
Outline • Background, definitions • Informatics -> e-Science • Data has lots of uses • Virtual Observatories: use cases • Data Framework: Examples • Data ingest, integration, mining and … • Discussion Fox HDF: Semantic Data Burden Shift Oct 15, 2008
Background Scientists should be able to access a global, distributed knowledge base of scientific data that: • appears to be integrated • appears to be locally available But… data is obtained by multiple instruments, using various protocols, in differing vocabularies, using (sometimes unstated) assumptions, with inconsistent (or non-existent) meta-data. It may be inconsistent, incomplete, evolving, and distributed And… there exist(ed) significant levels of semantic heterogeneity, large-scale data, complex data types, legacy systems, inflexible and unsustainable implementation technology… Fox HDF: Semantic Data Burden Shift Oct 15, 2008
Information products have Information But data has Lots of Audiences More Strategic Less Strategic From “Why EPO?”, a NASA internal report on science education, 2005 SCIENTISTS TOO Fox HDF: Semantic Data Burden Shift Oct 15, 2008
The Information Era: Interoperability Modern information and communications technologies are creating an “interoperable” information era in which ready access to data and information can be truly universal. Open access to data and services enables us to meet the new challenges of understand the Earth and its space environment as a complex system: • managing and accessing large data sets • higher space/time resolution capabilities • rapid response requirements • data assimilation into models • crossing disciplinary boundaries. Fox HDF: Semantic Data Burden Shift Oct 15, 2008
Shifting the Burden from the Userto the Provider Fox HDF: Semantic Data Burden Shift Oct 15, 2008
Modern capabilities Fox HDF: Semantic Data Burden Shift Oct 15, 2008
Mind the Gap! • As a result of finding out who is doing what, sharing experience/ expertise, and substantial coordination: • There is/ was still a gap between science and the underlying infrastructure and technology that is available • Informatics - information science includes the science of (data and) information, the practice of information processing, and the engineering of information systems. Informatics studies the structure, behavior, and interactions of natural and artificial systems that store, process and communicate (data and) information. It also develops its own conceptual and theoretical foundations. Since computers, individuals and organizations all process information, informatics has computational, cognitive and social aspects, including study of the social impact of information technologies. Wikipedia. • Cyberinfrastructure is the new research environment(s) that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing and information processing services over the Internet. Fox HDF: Semantic Data Burden Shift Oct 15, 2008
Informatics Progression after progression Fox HDF: Semantic Data Burden Shift Oct 15, 2008
Virtual Observatories • Conceptual examples: • In-situ: Virtual measurements • Related measurements • Remote sensing: Virtual, integrative measurements • Data integration • Managing virtual data products/ sets
Virtual Observatories Make data and tools quickly and easily accessible to a wide audience. Operationally, virtual observatories need to find the right balance of data/model holdings, portals and client software that researchers can use without effort or interference as if all the materials were available on his/her local computer using the user’s preferred language: i.e. appear to be local and integrated Likely to provide controlled vocabularies that may be used for interoperation in appropriate domains along with database interfaces for access and storage and “smart” tools for evolution and maintenance.
? Early days of discipline specific VOs VO2 VO3 VO1 DBn DB2 DB3 … … … … DB1
Lightweight semantics Limited meaning, hard coded Limited extensibility Under review The Astronomy approach; data-types as a service Limited interoperability • VOTable • Simple Image Access Protocol • Simple Spectrum Access Protocol • Simple Time Access Protocol VO App2 VO App3 VO App1 Open Geospatial Consortium: Web {Feature, Coverage, Mapping} Service Sensor Web Enablement: Sensor {Observation, Planning, Analysis} Service use the same approach VO layer DBn DB2 DB3 … … … … DB1
Added value Education, clearinghouses, other services, disciplines, et c. Semantic interoperability Added value Added value Semantic query, hypothesis and inference Semantic mediation layer - mid-upper-level Added value VO API Web Serv. VO Portal Query, access and use of data Mediation Layer • Ontology - capturing concepts of Parameters, Instruments, Date/Time, Data Product (and associated classes, properties) and Service Classes • Maps queries to underlying data • Generates access requests for metadata, data • Allows queries, reasoning, analysis, new hypothesis generation, testing, explanation, et c. Semantic mediation layer - VSTO - low level Metadata, schema, data DBn DB2 DB3 … … … … DB1
Content: Coupling Energetics and Dynamics of Atmospheric Regions WEB Community data archive for observations and models of Earth's upper atmosphere and geophysical indices and parameters needed to interpret them. Includes browsing capabilities by periods, > 310 instruments, models, > 820 parameters…
Content: Mauna Loa Solar Observatory Near real-time data products from Hawaii from a variety of solar instruments. Source for space weather, solar variability, and basic solar physics Other content used too - Center for Integrated Space Weather Modeling
Semantic Web Methodology and Technology Development Process • Establish and improve a well-defined methodology vision for Semantic Technology based application development • Leverage controlled vocabularies, et c. Adopt Technology Approach Leverage Technology Infrastructure Science/Expert Review & Iteration Rapid Prototype Open World: Evolve, Iterate, Redesign, Redeploy Use Tools Analysis Use Case Develop model/ ontology Small Team, mixed skills
Science and technical use cases Find data which represents the state of the neutral atmosphere anywhere above 100km and toward the arctic circle (above 45N) at any time of high geomagnetic activity. • Extract information from the use-case - encode knowledge • Translate this into a complete query for data - inference and integration of data from instruments, indices and models Provide semantically-enabled, smart data query services via a SOAP web for the Virtual Ionosphere-Thermosphere-Mesosphere Observatory that retrieve data, filtered by constraints on Instrument, Date-Time, and Parameter in any order and with constraints included in any combination.
Web Service VSTO - semantics and ontologies in an operational environment: vsto.hao.ucar.edu, www.vsto.org Fox RPI: Semantic Data Frameworks May 14, 2008
Semantic filtering by domain or instrument hierarchy Partial exposure of Instrument class hierarchy - users seem to LIKE THIS
Inferred plot type and return formats for data products Fox RPI: Semantic Data Frameworks May 14, 2008
Inferred plot type and return required axes data Fox RPI: Semantic Data Frameworks May 14, 2008
Semantic Web Benefits • Unified/ abstracted query workflow: Parameters, Instruments, Date-Time • Decreased input requirements for query: in one case reducing the number of selections from eight to three • Generates only syntactically correct queries: which was not always insurable in previous implementations without semantics • Semantic query support: by using background ontologies and a reasoner, our application has the opportunity to only expose coherent query (portal and services) • Semantic integration: in the past users had to remember (and maintain codes) to account for numerous different ways to combine and plot the data whereas now semantic mediation provides the level of sensible data integration required, now exposed as smart web services • understanding of coordinate systems, relationships, data synthesis, transformations, et c. • returns independent variables and related parameters • A broader range of potential users (PhD scientists, students, professional research associates and those from outside the fields)
What is a Non-Specialist Use Case? Someone should be able to query a virtual observatory without having specialist knowledge Teacher accesses internet goes to An Educational Virtual Observatory and enters a search for “Aurora”.
What should the User Receive? Teacher receives four groupings of search results: 1) Educational materials: http://www.meted.ucar.edu/topics_spacewx.php and http://www.meted.ucar.edu/hao/aurora/ 2) Research, data and tools: via VSTO, VSPO and VITMO, knows to search for brightness, or green/red line emission 3) Did you know?: Aurora is a phenomena of the upper terrestrial atmosphere (ionosphere) also known as Northern Lights 4) Did you mean?: Aurora Borealis or Aurora Australis, et c.
Semantic Information Integration: Concept map for educational use of science data in a lesson plan Fox RPI: Semantic Data Frameworks May 14, 2008
Issues for Virtual Observatories • Scaling to large numbers of data providers and redefining the role(s)/ relations with them • Crossing discipline boundaries • Security, access to resources, policies • Branding and attribution (where did this data come from and who gets the credit, is it the correct version, is this an authoritative source?) • Provenance/derivation (propagating key information as it passes through a variety of services, copies of processing algorithms, …) • Data quality, preservation, stewardship These are currently burden areas for users
Problem definition • Data is coming in faster, in greater volumes and outstripping our ability to perform adequate quality control • Data is being used in new ways and we frequently do not have sufficient information on what happened to the data along the processing stages to determine if it is suitable for a use we did not envision • We often fail to capture, represent and propagate manually generated information that need to go with the data flows • Each time we develop a new instrument, we develop a new data ingest procedure and collect different metadata and organize it differently. It is then hard to use with previous projects • The task of event determination and feature classification is onerous and we don't do it until after we get the data
Use cases • Determine which flat field calibration was applied to the image taken on January, 26, 2005 around 2100UT by the ACOS Mark IV polarimeter. • Which flat-field algorithm was applied to the set of images taken during the period November 1, 2004 to February 28, 2005? • How many different data product types can be generated from the ACOS CHIP instrument? • What images comprised the flat field calibration image used on January 26, 2007 for all ACOS CHIP images? • What processing steps were completed to obtain the ACOS PICS limb image of the day for January 26, 2005? • Who (person or program) added the comments to the science data file for the best vignetted, rectangular polarization brightness image from January, 26, 2005 1849:09UT taken by the ACOS Mark IV polarimeter? • What was the cloud cover and atmospheric seeing conditions during the local morning of January 26, 2005 at MLSO? • Find all good images on March 21, 2008. • Why are the quick look images from March 21, 2008, 1900UT missing? • Why does this image look bad?
Provenance • Origin or source from which something comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility
Discussion (1) • Taken together, an emerging set of collected experience manifests an emerging informatics core capability that is starting to take data intensive science into a new realm of realizability and potentially, sustainability • Use cases (i.e. real users) • X-informatics • Core Informatics • Cyber Informatics • There are implications for data models
Informatics Progression after progression • Example: • CI = OPeNDAP server running over HTTP/HTTPS • Cyberinformatics = Data (product) and service ontologies, triple store • Core informatics = Reasoning engine (Pellet), OWL • Science (X) informatics = Use cases, science domain terms, concepts in an ontology
Discussion (2) • Data and information science is becoming the ‘fourth’ column (along with theory, experiment and computation) • Semantics (of the data) are a very key ingredient -> may imply richer data models
Summary • Informatics is playing a key role in filling the gap between science (and the spectrum of non-expert) use and generation and the underlying cyberinfrastructure, i.e. in shifting the burden • This is evident due to the emergence of Xinformatics (world-wide) • Our experience is implementing informatics as semantics in Virtual Observatories (as a working paradigm) and Grid environments • VSTO is only one example of success • Data mining, data integration, smart search, provenance are close behind • Informatics is a profession and a community activity and requires efforts in all 3 sub-areas (science, core, cyber) and must be synergistic Fox RPI: Semantic Data Frameworks May 14, 2008
More Information • Virtual Solar Terrestrial Observatory (VSTO): http://vsto.hao.ucar.edu, http://www.vsto.org • Semantically-Enalbed Science Data Integration (SESDI): http://sesdi.hao.ucar.edu • Semantic Provenance Capture in Data Ingest Systems (SPCDIS): http://spcdis.hao.ucar.edu • Semantic Knowledge Integration Framework (SKIF/SAM): http://skif.hao.ucar.edu • Semantic Web forEarth and Environmental Terminology (SWEET): http://sweet.jpl.nasa.gov • Conferences: AGU 2008, EGU 2009, ISWC 2008, CIKM 2008, … • Peter Fox pfox@ucar.edu