1 / 33

The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation

Collect. Store. Present. Analyze. Search. Retrieve. The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation. Jeff Dozier James E. Frew. Snow spectral reflectance and absorption coefficient of ice. Landsat Thematic Mapper (TM) band combinations.

lee
Download Presentation

The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collect Store Present Analyze Search Retrieve The Data Author’s Perspective: Lessons Learned From Data Creation to Data Curation Jeff DozierJames E. Frew

  2. Snow spectral reflectance and absorption coefficient of ice

  3. Landsat Thematic Mapper (TM) band combinations Bands 4,3,2 (R,G,B) Bands 5,4,2 (R,G,B)

  4. What you see, through Earth’s atmosphere

  5. Spatial, spectral characteristics of Landsat and MODIS

  6. What a multispectral sensor sees

  7. Set of equations for each pixel

  8. Fractional snow cover, Sierra Nevada, March 7 2004

  9. Sierra Nevada topography

  10. Daily MODIS acquisition, processing for Sierra Nevada snow cover and albedo

  11. Examples of fractional snow cover, January through April 2004 Jan 01 2004 Mar 26 2004 Jan 17 2004 Apr 08 2004

  12. Examples of grain size, January through April 2004 Jan 01 2004 Mar 26 2004 Jan 17 2004 Apr 08 2004

  13. Effect of vegetation 2004, March 3 vs March 4 2004, March 4 vs March 5 2004, March 5 vs March 7 2004, March 7 vs March 8

  14. Applications: snowmelt modeling, Marble Fork of the Kaweah River(Molotch et al., GRL, 2004) Snow Covered Area net radiation> 0 degree days > 0 where: mq= Energy to water depth conversion, 0.026 cm W-1 m2 day-1

  15. Magnitude of snowmelt: Modeled – Observed snow water equivalence AVIRISalbedo SWE difference, cm Tokopah basin, Sierra Nevada assumed w/ update assumedalbedo

  16. The data author’s perspective on drivers and constraints • The science information user: • I want reliable, timely, usable science information products • Accessibility • Accountability • The funding agencies and the science community: • We want this to be done by a distributed federation of providers, not just by data centers • Scalability • The science information provider: • I’m doing just fine, thanks. • Transparency

  17. Research computing is … Heterogeneous multiple platforms, applications, languages Idiosyncratic researchers typically have highly customized computing environments Problem-driven focus on results, not processes Production computing is … Robust reliable, not just correct Standardized can easily substitute components for repair, upgrade, etc. Scalable accommodates steady or increasing demand for product Research vs. production computing

  18. Principles • Goal • Help scientists become information providers in a federated data system • Prime Directive • Minimal disruption of a working scientist’s computational environment • Ultimate product • Software, system architecture, and procedures for turning science projects into a federation of providers

  19. Model structure for MODIS snow-covered area and albedo

  20. Lineage: current best practice

  21. ESSW: Our Earth System Science Workbench Producer and consumer issues can both be addressedby a laboratory metaphor • Experiment • Network of models • … ingesting / synthesizing data • … generating products • Laboratory • Experiment execution environment • Computing + storage = accessibility + scalability • Lab Notebook • Persistent storage that can be queried • Keeps track of all experiments • Documentation + lineage = accountability

  22. Use existing science applications • No “standard” Earth science computing environment • commercial packages (ArcGIS, ENVI, MATLAB, …) • public packages/models (MM5, MODTRAN, …) • locally-developed codes • Example: Snow cover from AVHRR commercial + standalone programs • parameters highly customized for UCSB • How do we get these programs to • communicate • cooperate with the Earth System Science Workbench (ESSW), without rewriting? Receive Ingest and Calibrate Navigate (Manual/Automatic) Snow-Covered Area Rectify Snow Maps

  23. Wrap Your App: Scripts talk to ESSW XML + SQL Perl API ESSWdaemon • No changes,just additions • Wrapper scripts • Make program (groups) look like ESSW experiments • ESSW daemon • Convertswrapper outputtodatabase input • ESSW database • Stores converted wrapper output Receive Ingest and Calibrate ESSW Database Navigate (Manual/Automatic) Snow-Covered Area Rectify MySQL Java JDBC Perl Snow Maps

  24. avhrr_L0 AVHRR Level 0 product Detailedexample AVHRR telemetry ingest avhrr_ingest Hand navigation details avhrr_l1b AHVRR Level 1B product Hand navigation avhrr_ procedure handNav avhrr_ AVHRR Level 1B: navd_l1b Multi-channel navigated snow-covered avhrr_ area snowModel algorithm avhrr_sca Snow-covered area Copy avhrr_ navigated copyNav image avhrr_ navd_sca SCA: navigated

  25. ESSW Lessons • Providers are customers • Federations aren’t much good unless scientists are happy to put information in them • A light touch is the right touch • Wrapping is easier for scientists and their programmers to deal with than complete re-engineering • Scientists do write scripts, but not necessarily Perl • Scripting (gluing stuff together) comes naturally to scientists • Scientists don’t write DTDs • Nobody calls metadata APIs ESSW was automatic, but not automatic enough…

  26. ES3 : Earth System Science Server data lineage tracking MODster OpenDAP Watershed-scale snow product MODIS Microsoft TerraServer AVHRR Global-scale snow product Alexandria Digital Library Corona BUB data storage ROCKS processing clusters

  27. From ESSW to ES3: Summary • Perl wrappers  Probulators • Perl API  web services + RDF messages • SQL  XML database(s)

  28. From wrappers to probulators Wrappers: active lineage • Good • Complete control over what gets recorded • Single language/API for all wrapped events • Not tied to execution • You can even lie about what happened • Bad • Must explicitly script everything • Scripts can drift from reality • You can even lie about what happened

  29. From wrappers to probulators Probulators: passive lineage • Good • Record what actually happened • Not just what you think happened • Not what didn’t happen • Automatic: don’t have to write new scripts for everything • Bad • Different flavors for different environments • Can’t just do everything in Perl…

  30. Probulator flavors • Instrumentation • Insert lineage capture instructions directly into science codes • e.g. “I just created file ‘foo’” • Typical implementation: preprocessor/precompiler • Overriding • Replace standard routines/libraries with lineage-capturing versions • e.g. open(…) → snoopy_open(…) • Typical implementation: modify execution environment • environment variables • configuration files • Passive monitoring • Trace program execution • e.g. “called open() with args foo, bar, …” • Typical implementation: strace’d shell

  31. logfiles ES3 lineage architecture probulator1 logger transmitter ES3 core probulatorn

  32. Now What? • Probulator reports not universally unique • Q: How hook separate reports together? • A: Logger assigns UUIDs to • Data streams • Processes • Jobs (workflows) • Lineage not explicit • Q: How publish lineage? • A: ES3 Core builds serialized graph

  33. Products available from http://www.snow.ucsb.edu (forthcoming) • Fractional snow-covered area, grain size (and contaminants) from daily MODIS images • Quality flags for cloud cover, highly oblique viewing • Fractional coverage of other endmembers • Best estimate of snow-covered area and broadband albedo on that date • Extrapolating from previous values to that date and smoothing • End-of-season reanalysis of daily snow-covered area and broadband albedo • Interpolation, smoothing, comparison with in situ snow pillow data

More Related