730 likes | 748 Views
The Semantics of Quality, Uncertainty and Bias Representations of NASA Atmospheric Remote Sensing Data and Information Products ON THE WEB. Peter Fox, and … Gregory Leptoukh 2 , Stephan Zednik 1 , Chris Lynnes 2. Tetherless World Constellation, Rensselaer Polytechnic Inst.
E N D
The Semantics of Quality, Uncertainty and Bias Representations of NASA Atmospheric Remote Sensing Data and Information Products ON THE WEB Peter Fox, and … Gregory Leptoukh2, Stephan Zednik1, Chris Lynnes2 Tetherless World Constellation, Rensselaer Polytechnic Inst. NASA Goddard Space Flight Center, Greenbelt, MD, United States
Webs of data • Early Web - Web of pages • http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html • Semantic web started as a way to facilitate “machine accessible content” • Initially was available only to those with familiarity with the languages and tools, e.g. your parents could not use it • Webs of data grew out of this • One specific example is W3C’s Linked Open Data
Semantic Web • http://www.w3.org/2001/sw/ • “The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF). See also the separate FAQ for further information.”
Linked open data • http://linkeddata.org/guides-and-tutorials • http://tomheath.com/slides/2009-02-austin-linkeddata-tutorial.pdf • And of course: • http://logd.tw.rpi.edu/
September 2011 “Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
Deep web • Data behind web services • Data behind query interfaces (databases or files)
Data on the internet • http://www.dataspaceweb.org/ • http://mp-datamatters.blogspot.com/ • Data files on other protocols • FTP • RFTP • GridFTP • SABUL • XMPP/AMQP • Others…
Acronyms AOD Aerosol Optical Depth MDSA Multi-sensor Data Synergy Advisor MISR Multi-angle Imaging Spectro-Radiometer MODIS Moderate Resolution Imaging Spectro-radiometer OWL Web Ontology Language PML Proof Markup Language REST Representational State Transfer UTC Coordinated Universal Time XML eXtensible Markup Language XSL eXtensibleStylesheet Language XSLT XSL Transformation
Where are we in respect to the data challenge? “The user cannot find the data; If he can find it, cannot access it; If he can access it, ; he doesn't know how good they are; if he finds them good, he can not merge them with other data” The Users View of IT, NAS 1989
Giovanni Earth Science Data Visualization & Analysis Tool • Developed and hosted by NASA/ Goddard Space Flight Center (GSFC) • Multi-sensor and model data analysis and visualization online tool • Supports dozens of visualization types • Generate dataset comparisons • ~1500 Parameters • Used by modelers, researchers, policy makers, students, teachers, etc.
Giovanni Allows Scientists to Concentrate on the Science Exploration Initial Analysis Use the best data for the final analysis Derive conclusions Write the paper Submit the paper The Old Way: The Giovanni Way: Web-based Services: Jan Pre-Science Find data Minutes Retrieve high volume data Read Data Feb Extract Parameter Learn formats and develop readers Days for exploration Filter Quality Mirador Extractparameters Mar Use the best data for the final analysis Subset Spatially Giovanni Perform spatial and other subsetting DO SCIENCE Derive conclusions Reformat Apr Identify quality and other flags and constraints Write the paper Reproject Submit the paper Perform filtering/masking Visualize May Develop analysis and visualization Explore Accept/discard/get more data (sat, model, ground-based) Analyze Jun Web-based tools like Giovanni allow scientists to compress the time needed for pre-science preliminary tasks: data discovery, access, manipulation, visualization, and basic statistical analysis. Jul DO SCIENCE Aug Sep Scientists have more time to do science! Oct
Data Usage Workflow *Giovanni helps streamline / automate Subset / Constrain Reformat Filtering Re-project Integration
Data Usage Workflow Precision Requirements Integration Planning Quality Assessment Requirements *Giovanni helps streamline / automate Intended Use Subset / Constrain Reformat Filtering Re-project Integration
Challenge • Giovanni streamlines data processing, performing required actions on behalf of the user • but automation amplifies the potential for users to generate and use results they do not fully understand • The assessment stage is integral for the user to understand fitness-for-use of the result • but Giovanni did not assist in assessment • We were challenged to instrument the system to help users understand results
Producers Consumers Quality Control Quality Assessment Fitness for Purpose Fitness for Use Trustor Trustee 17
Definitions – for an atmospheric scientist • Quality • Is in the eyes of the beholder – worst case scenario… or a good challenge • Uncertainty • has aspects of accuracy (how accurately the real world situation is assessed, it also includes bias) and precision (down to how many digits)
Quality Control vs. Quality Assessment Quality Control (QC) flags in the data (assigned by the algorithm) reflect “happiness” of the retrieval algorithm, e.g., all the necessary channels indeed had data, not too many clouds, the algorithm has converged to a solution, etc. Quality assessment is done by analyzing the data “after the fact” through validation, intercomparison with other measurements, self-consistency, etc. It is presented as bias and uncertainty. It is rather inconsistent and can be found in papers, validation reports all over the place.
Definitions – for an atmospheric scientist • Bias has two aspects: • Systematic error resulting in the distortion of measurement data caused by prejudice or faulty measurement technique • A vested interest, or strongly held paradigm or condition that may skew the results of sampling, measuring, or reporting the findings of a quality assessment: • Psychological: for example, when data providers audit their own data, they usually have a bias to overstate its quality. • Sampling: Sampling procedures that result in a sample that is not truly representative of the population sampled. (Larry English)
Data quality needs: fitness for use • Measuring Climate Change: • Model validation: gridded contiguous data with uncertainties • Long-term time series: bias assessment is the must , especially sensor degradation, orbit and spatial sampling change • Studying phenomena using multi-sensor data: • Cross-sensor bias is needed • Realizing Societal Benefits through Applications: • Near-Real Time for transport/event monitoring - in some cases, coverage and timeliness might be more important that accuracy • Pollution monitoring (e.g., air quality exceedance levels) – accuracy • Educational (users generally not well-versed in the intricacies of quality; just taking all the data as usable can impair educational lessons) – only the best products
Level 2 data • Swathfor MISR, orbit 192 (2001)
Same parameter Same space & time MODIS vs. MERIS MODIS MERIS Different results – why? A threshold used in MERIS processing effectively excludes high aerosol values. Note: MERIS was designed primarily as an ocean-color instrument, so aerosols are “obstacles” not signal.
Spatial and temporal sampling – how to quantify to make it useful for modelers? • MODIS Aqua AOD July 2009 • MISR Terra AOD July 2009 • Completeness: MODIS dark target algorithm does not work for deserts • Representativeness: monthly aggregation is not enough for MISR and even MODIS • Spatial sampling patterns are different for MODIS Aqua and MISR Terra: “pulsating” areas over ocean are oriented differently due to different orbital direction during day-time measurement Cognitive bias
Three projects with data quality flavor • Multi-sensor Data Synergy Advisor • Product-level Quality: how closely the data represent the actual geophysical state • Data Quality Screening Service • Pixel-level Quality: algorithmic guess at usability of data point • Granule-level Quality: statistical roll-up of Pixel-level Quality • Aerosol Statistics • Record-level Quality: how consistent and reliable the data record is across generations of measurements
Multi-Sensor Data Synergy Advisor (MDSA) • Goal: Provide science users with clear, cogent information on salient differences between data candidates for fusion, merging and intercomparison • Enable scientifically and statistically valid conclusions • Develop MDSA on current missions: • NASA - Terra, Aqua, (maybe Aura) • Define implications for future missions
How MDSA works? MDSA is a service designed to characterize the differences between two datasets and advise a user (human or machine) on the advisability of combining them. • Provides the Giovanni online analysis tool • Describes parameter and products • Documents steps leading to the final data product • Enables better interpretation and utilization of parameter difference and correlation visualizations. • Provides clear and cogent information on salient differences between data candidates for intercomparison and fusion. • Provides information on data quality • Provides advice on available options for further data processing and analysis.
Correlation – same instrument, different satellites Anomaly MODIS Level 3 dataday definition leads to artifact in correlation
Effect of the Data Day definition on Ocean Color data correlation with Aerosol data Only half of the Data Day artifact is present because the Ocean Group uses the better Data Day definition! Correlation between MODIS Aqua AOD (Ocean group product) and MODIS-Aqua AOD (Atmosphere group product) Pixel Count distribution
Research approach • Systematizing quality aspects • Working through literature • Identifying aspects of quality and their dependence of measurement and environmental conditions • Developing Data Quality ontologies • Understanding and collecting internal and external provenance • Developing rulesets allows to infer pieces of knowledge to extract and assemble • Presenting the data quality knowledge with good visual, statement and references Web Science!
Semantic Web Basics • The triple: {subject-predicate-object} Interferometeris-aoptical instrument Optical instrumenthasfocal length • W3C is the primary (but not sole) governing org. languages • RDF programming environment for 14+ languages, including C, C++, Python, Java, Javascript, Ruby, PHP,...(no Cobol or Ada yet ;-( ) • OWL 1.0 and 2.0 - Ontology Web Language - programming for Java • Query, rules, inference… • Closed World - where complete knowledge is known (encoded), AI relied on this • Open World - where knowledge is incomplete/ evolving, SW promotes this
Ontology Spectrum Thesauri “narrower term” relation Selected Logical Constraints (disjointness, inverse, …) Frames (properties) Formal is-a Catalog/ ID Informal is-a Formal instance General Logical constraints Terms/ glossary Value Restrs. Originally from AAAI 1999- Ontologies Panel by Gruninger, Lehmann, McGuinness, Uschold, Welty; – updated by McGuinness. Description in: www.ksl.stanford.edu/people/dlm/papers/ontologies-come-of-age-abstract.html
Semantic Web Layers http://www.w3.org/2003/Talks/1023-iswc-tbl/slide26-0.html, http://flickr.com/photos/pshab/291147522/
Working with knowledge Rule execution Expressivity Implement -ability Query Inference Maintainability/ Extensibility
Data Quality Ontology Development (Quality flag) Working together with Chris Lynnes’s DQSS project, started from the pixel-level quality view.
Data Quality Ontology Development (Bias) http://cmapspublic3.ihmc.us:80/servlet/SBReadResourceServlet?rid=1286316097170_183793435_22228&partName=htmltext
Modeling quality (Uncertainty) Link to other cmap presentations of quality ontology: http://cmapspublic3.ihmc.us:80/servlet/SBReadResourceServlet?rid=1299017667444_1897825847_19570&partName=htmltext
MDSA Aerosol Data Ontology Example Ontology of Aerosol Data made with cmap ontology editor http://tw.rpi.edu/web/project/MDSA/DQ-ISO_mapping
RuleSet Development [DiffNEQCT: (?s rdf:type gio:RequestedService), (?s gio:input ?a), (?a rdf:type gio:DataSelection), (?s gio:input ?b), (?b rdf:type gio:DataSelection), (?a gio:sourceDataset ?a.ds), (?b gio:sourceDataset ?b.ds), (?a.ds gio:fromDeployment ?a.dply), (?b.ds gio:fromDeployment ?b.dply), (?a.dply rdf:type gio:SunSynchronousOrbitalDeployment), (?b.dply rdf:type gio:SunSynchronousOrbitalDeployment), (?a.dply gio:hasNominalEquatorialCrossingTime ?a.neqct), (?b.dply gio:hasNominalEquatorialCrossingTime ?b.neqct), notEqual(?a.neqct, ?b.neqct) -> (?s gio:issueAdvisory giodata:DifferentNEQCTAdvisory)]
Advisor Knowledge Base Advisor Rules test for potential anomalies, create association between service metadata andanomaly metadata in Advisor KB
Assisting in Assessment Precision Requirements Quality Assessment Requirements Integration Planning Provenance & Lineage Visualization Intended Use MDSA Advisory Report Subset / Constrain Reformat Filtering Re-project Integration
Thus - Multi-Sensor Data Synergy Advisor • Assemble semantic knowledge base • Giovanni Service Selections • Data Source Provenance (external provenance - low detail) • Giovanni Planned Operations (what service intends to do) • Analyze service plan • Are we integrating/comparing/synthesizing? • Are similar dimensions in data sources semantically comparable? (semantic diff) • How comparable? (semantic distance) • What data usage caveats exist for data sources? • Adviseregarding general fitness-for-use and data-usage caveats
Presenting data quality to users • Global or product level quality information, e.g. consistency, completeness, etc., that can be presented in a tabular form. • Regional/seasonal. This is where we've tried various approaches: • maps with outlines regions, one map per sensor/parameter/season • scatter plots with error estimates, one per a combination of Aeronet station, parameter, and season; with different colors representing different wavelengths, etc.