170 likes | 280 Views
Discovering Communicable Scientific Knowledge from Spatio-Temporal Data. Mark Schwabacher NASA Ames Research Center Computational Sciences Division mark.schwabacher@arc.nasa.gov http://ic-www.arc.nasa.gov/people/schwabacher/
E N D
Discovering Communicable Scientific Knowledge from Spatio-Temporal Data Mark Schwabacher NASA Ames Research Center Computational Sciences Division mark.schwabacher@arc.nasa.gov http://ic-www.arc.nasa.gov/people/schwabacher/ Joint work with Pat Langley and Jeff Shrager (ISLE) and Chris Potter, Steve Klooster, Lisy Torregrosa, and Vanessa Brooks (NASA Earth Science)
Outline • Description of Earth science problem • Choice of representation and algorithm • Results • Visualizations • Discovery of an error in the data • Future Work
Earth Science Problem • The Normalized Difference Vegetation Index (NDVI) is a measure of vegetation across the globe derived from satellite data • NDVI is used in various Earth-science models • Unfortunately, NDVI is only available for the years since 1983, when a satellite with these sensors was launched • We would like to predict NDVI at a point on the globe from ground-based climate variables representing temperature, precipitation, and moisture
Choice of Representation For scientific applications, the learned models should be • Understandable • Communicable
Representation used by scientists Our Earth Science collaborators had built the following model with an “if” statement to select between two linear models, one for warmer locations and one for cooler locations: if GDD<3000 then ln(NDVI) = 0.715 ln(GDD) + 0.377 ln(PPT) – 0.448 if GDD>= 3000 then NDVI = 189.89 AMI + 44.02 ln(PPT) + 227.99
Choice of Algorithm • We selected regression rules as a generalization of the Earth scientists’ representation • We selected Cubist to learn themhttp://www.rulequest.com
First Results Cubist produced better accuracy, but model was hard to understand.
2-rule Cubist model if PPT <= 25.457 then NDVI = -3.225 + 7.07 PPT + 0.0521 CDD - 84 AMI+ 0.4 ln(PPT) + 0.0001 GDD if PPT > 25.457 then NDVI = 386.3 + 316 AMI + 0.0294 GDD - 0.99 PPT + 0.2 ln(PPT)
Testing the model across years • We trained Cubist using one year’s data • We tested the resulting model on other years’ data • If it transfers, it’s useful for Earth scientists • If it sometimes doesn’t transfer, that could point to a scientific discovery
Discovery of an error in the data Cross-validate 1985 Train 1984, test 1985
Related Work • Regression trees: Breiman et al’s CART (1984) • Classification applied to Earth science: Brodley & Friedl (1999); Ester, Kriegel, & Xu (1996) • Visualizing classes on map: Brodley & Friedl (1999); Smyth, Ghil, & Ide (1999) • Detecting and correcting faulty class labels in data: John (1995); Brodley and Friedl (1999) • Detecting and correcting calibration problems in remote-sensing systems using predefined model: Chen (1997)
Future Work • Cubist/NDVI work • Incorporate time explicitly • Include other variables (e.g. elevation) • Test understandability • Other work • Improve CASA model (next talk) • Implement an interactive system that lets scientists direct high-level search for improved ecosystem models
Lessons Learned We’ve identified three problems that arise in scientific applications of ML, and proposed initial solutions: Communicability: Use the same representation as the scientists. Understandability: When using spatial data, spatially visualize the model’s errors and the activity of its components. Quantitative errors: When using time-series data, quantitative errors can be identified by testing a model trained on one time period against data from other time periods.