360 likes | 502 Views
Collect. Store. Present. Analyze. Search. Retrieve. The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation. Jeff Dozier James E. Frew. Snow spectral reflectance and absorption coefficient of ice. Landsat Thematic Mapper (TM) band combinations.
E N D
Collect Store Present Analyze Search Retrieve The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation Jeff DozierJames E. Frew
Landsat Thematic Mapper (TM) band combinations Bands 4,3,2 (R,G,B) Bands 5,4,2 (R,G,B)
Daily MODIS acquisition, processing for Sierra Nevada snow cover and albedo
Examples of fractional snow cover, January through April 2004 Jan 01 2004 Mar 26 2004 Jan 17 2004 Apr 08 2004
Examples of grain size, January through April 2004 Jan 01 2004 Mar 26 2004 Jan 17 2004 Apr 08 2004
Effect of vegetation 2004, March 3 vs March 4 2004, March 4 vs March 5 2004, March 5 vs March 7 2004, March 7 vs March 8
Applications: snowmelt modeling, Marble Fork of the Kaweah River(Molotch et al., GRL, 2004) Snow Covered Area net radiation> 0 degree days > 0 where: mq= Energy to water depth conversion, 0.026 cm W-1 m2 day-1
Magnitude of snowmelt: Modeled – Observed snow water equivalence AVIRISalbedo SWE difference, cm Tokopah basin, Sierra Nevada assumed w/ update assumedalbedo
The data author’s perspective on drivers and constraints • The science information user: • I want reliable, timely, usable science information products • Accessibility • Accountability • The funding agencies and the science community: • We want this to be done by a distributed federation of providers, not just by data centers • Scalability • The science information provider: • I’m doing just fine, thanks. • Transparency
Research computing is … Heterogeneous multiple platforms, applications, languages Idiosyncratic researchers typically have highly customized computing environments Problem-driven focus on results, not processes Production computing is … Robust reliable, not just correct Standardized can easily substitute components for repair, upgrade, etc. Scalable accommodates steady or increasing demand for product Research vs. production computing
Principles • Goal • Help scientists become information providers in a federated data system • Prime Directive • Minimal disruption of a working scientist’s computational environment • Ultimate product • Software, system architecture, and procedures for turning science projects into a federation of providers
ESSW: Our Earth System Science Workbench Producer and consumer issues can both be addressedby a laboratory metaphor • Experiment • Network of models • … ingesting / synthesizing data • … generating products • Laboratory • Experiment execution environment • Computing + storage = accessibility + scalability • Lab Notebook • Persistent storage that can be queried • Keeps track of all experiments • Documentation + lineage = accountability
Use existing science applications • No “standard” Earth science computing environment • commercial packages (ArcGIS, ENVI, MATLAB, …) • public packages/models (MM5, MODTRAN, …) • locally-developed codes • Example: Snow cover from AVHRR commercial + standalone programs • parameters highly customized for UCSB • How do we get these programs to • communicate • cooperate with the Earth System Science Workbench (ESSW), without rewriting? Receive Ingest and Calibrate Navigate (Manual/Automatic) Snow-Covered Area Rectify Snow Maps
Wrap Your App: Scripts talk to ESSW XML + SQL Perl API ESSWdaemon • No changes,just additions • Wrapper scripts • Make program (groups) look like ESSW experiments • ESSW daemon • Convertswrapper outputtodatabase input • ESSW database • Stores converted wrapper output Receive Ingest and Calibrate ESSW Database Navigate (Manual/Automatic) Snow-Covered Area Rectify MySQL Java JDBC Perl Snow Maps
avhrr_L0 AVHRR Level 0 product Detailedexample AVHRR telemetry ingest avhrr_ingest Hand navigation details avhrr_l1b AHVRR Level 1B product Hand navigation avhrr_ procedure handNav avhrr_ AVHRR Level 1B: navd_l1b Multi-channel navigated snow-covered avhrr_ area snowModel algorithm avhrr_sca Snow-covered area Copy avhrr_ navigated copyNav image avhrr_ navd_sca SCA: navigated
ESSW Lessons • Providers are customers • Federations aren’t much good unless scientists are happy to put information in them • A light touch is the right touch • Wrapping is easier for scientists and their programmers to deal with than complete re-engineering • Scientists do write scripts, but not necessarily Perl • Scripting (gluing stuff together) comes naturally to scientists • Scientists don’t write DTDs • Nobody calls metadata APIs ESSW was automatic, but not automatic enough…
ES3 : Earth System Science Server data lineage tracking MODster OpenDAP Watershed-scale snow product MODIS Microsoft TerraServer AVHRR Global-scale snow product Alexandria Digital Library Corona BUB data storage ROCKS processing clusters
From ESSW to ES3: Summary • Perl wrappers Probulators • Perl API web services + RDF messages • SQL XML database(s)
From wrappers to probulators Wrappers: active lineage • Good • Complete control over what gets recorded • Single language/API for all wrapped events • Not tied to execution • You can even lie about what happened • Bad • Must explicitly script everything • Scripts can drift from reality • You can even lie about what happened
From wrappers to probulators Probulators: passive lineage • Good • Record what actually happened • Not just what you think happened • Not what didn’t happen • Automatic: don’t have to write new scripts for everything • Bad • Different flavors for different environments • Can’t just do everything in Perl…
Probulator flavors • Instrumentation • Insert lineage capture instructions directly into science codes • e.g. “I just created file ‘foo’” • Typical implementation: preprocessor/precompiler • Overriding • Replace standard routines/libraries with lineage-capturing versions • e.g. open(…) → snoopy_open(…) • Typical implementation: modify execution environment • environment variables • configuration files • Passive monitoring • Trace program execution • e.g. “called open() with args foo, bar, …” • Typical implementation: strace’d shell
logfiles ES3 lineage architecture probulator1 logger transmitter ES3 core probulatorn
Now What? • Probulator reports not universally unique • Q: How hook separate reports together? • A: Logger assigns UUIDs to • Data streams • Processes • Jobs (workflows) • Lineage not explicit • Q: How publish lineage? • A: ES3 Core builds serialized graph
Products available from http://www.snow.ucsb.edu (forthcoming) • Fractional snow-covered area, grain size (and contaminants) from daily MODIS images • Quality flags for cloud cover, highly oblique viewing • Fractional coverage of other endmembers • Best estimate of snow-covered area and broadband albedo on that date • Extrapolating from previous values to that date and smoothing • End-of-season reanalysis of daily snow-covered area and broadband albedo • Interpolation, smoothing, comparison with in situ snow pillow data