Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Discovering Communicable Scientific Knowledge from Spatio-Temporal Data Mark Schwabacher NASA Ames Research Center Computational Sciences Division mark.schwabacher@arc.nasa.gov http://ic-www.arc.nasa.gov/people/schwabacher/ Joint work with Pat Langley and Jeff Shrager (ISLE) and Chris Potter, Steve Klooster, Lisy Torregrosa, and Vanessa Brooks (NASA Earth Science)

Outline • Description of Earth science problem • Choice of representation and algorithm • Results • Visualizations • Discovery of an error in the data • Future Work

Earth Science Problem • The Normalized Difference Vegetation Index (NDVI) is a measure of vegetation across the globe derived from satellite data • NDVI is used in various Earth-science models • Unfortunately, NDVI is only available for the years since 1983, when a satellite with these sensors was launched • We would like to predict NDVI at a point on the globe from ground-based climate variables representing temperature, precipitation, and moisture

Choice of Representation For scientific applications, the learned models should be • Understandable • Communicable

Representation used by scientists Our Earth Science collaborators had built the following model with an “if” statement to select between two linear models, one for warmer locations and one for cooler locations: if GDD<3000 then ln(NDVI) = 0.715 ln(GDD) + 0.377 ln(PPT) – 0.448 if GDD>= 3000 then NDVI = 189.89 AMI + 44.02 ln(PPT) + 227.99

Choice of Algorithm • We selected regression rules as a generalization of the Earth scientists’ representation • We selected Cubist to learn themhttp://www.rulequest.com

First Results Cubist produced better accuracy, but model was hard to understand.

Varying the Cubist minimumrule cover parameter

2-rule Cubist model if PPT <= 25.457 then NDVI = -3.225 + 7.07 PPT + 0.0521 CDD - 84 AMI+ 0.4 ln(PPT) + 0.0001 GDD if PPT > 25.457 then NDVI = 386.3 + 316 AMI + 0.0294 GDD - 0.99 PPT + 0.2 ln(PPT)

Visualization #1:Cubist model in one variable

Visualization #2: Activity of Cubist Rules

Visualization #2:Error of Cubist model

Testing the model across years • We trained Cubist using one year’s data • We tested the resulting model on other years’ data • If it transfers, it’s useful for Earth scientists • If it sometimes doesn’t transfer, that could point to a scientific discovery

Discovery of an error in the data Cross-validate 1985 Train 1984, test 1985

Related Work • Regression trees: Breiman et al’s CART (1984) • Classification applied to Earth science: Brodley & Friedl (1999); Ester, Kriegel, & Xu (1996) • Visualizing classes on map: Brodley & Friedl (1999); Smyth, Ghil, & Ide (1999) • Detecting and correcting faulty class labels in data: John (1995); Brodley and Friedl (1999) • Detecting and correcting calibration problems in remote-sensing systems using predefined model: Chen (1997)

Future Work • Cubist/NDVI work • Incorporate time explicitly • Include other variables (e.g. elevation) • Test understandability • Other work • Improve CASA model (next talk) • Implement an interactive system that lets scientists direct high-level search for improved ecosystem models

Lessons Learned We’ve identified three problems that arise in scientific applications of ML, and proposed initial solutions: Communicability: Use the same representation as the scientists. Understandability: When using spatial data, spatially visualize the model’s errors and the activity of its components. Quantitative errors: When using time-series data, quantitative errors can be identified by testing a model trained on one time period against data from other time periods.

Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Discovering Communicable Scientific Knowledge from Spatio-Temporal Data

Presentation Transcript

Spatio-Temporal Compressive Sensing

Spatio-Temporal Data Mining

SPATIO TEMPORAL FRAMEWORKS

Spatio Temporal Video Retrieval

Different types of Spatio -temporal Data Mining

Spatio-temporal HAC

Spatio-Temporal Databases

Spatio-Temporal Clustering

Spatio-Temporal Databases

SPATIO-TEMPORAL DATABASES

6350 Spatio -temporal Data Processing Course Overview

Spatio-Temporal WiFi Localization

SPATIO-TEMPORAL DATABASES

Indexing Spatio-Temporal Data Warehouses

Privacy-preserving data warehousing for spatio-temporal data

Spatio-temporal Pattern Queries

Spatio-temporal Databases

Spatio-Temporal Predicates

UCERF3 Spatio-Temporal Clustering

On Discovering Moving Clusters in Spatio-temporal Data

Spatio-Temporal Databases

Indexing Spatio-Temporal Data Warehouses