1 / 23

Adventures in Web Services for Large Geophysical Datasets

Joe Sirott PMEL/NOAA. Adventures in Web Services for Large Geophysical Datasets. Motivation. Zonal averages of precipitation trends From Zhang, et al Nature 448, 461-465(26 July 2007) ‏. Seasonal zonal averages of Arctic temperature trends

Download Presentation

Adventures in Web Services for Large Geophysical Datasets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Joe Sirott PMEL/NOAA Adventures in Web Services for Large Geophysical Datasets

  2. Motivation Zonal averages of precipitation trends From Zhang, et al Nature 448, 461-465(26 July 2007)‏

  3. Seasonal zonal averages of Arctic temperature trends From Graversen, et al Nature 541, 53-56(3 Jan 2008)‏

  4. Use case Calculate zonally averaged seasonal temperature trends from 20th century climate experiment from four climate models (NASA GISS, NCAR PCM and CCSM, GFDL CM2.1, and Hadley CM3) in CMIP3 archives from 30N to 90N Total of 81 files in 36GB Time period of interest 1979-2000

  5. Recipe is… Regrid all model data to common grid Calculate seasonal ensemble means for all models for 30N-90N, 1979 - 2000 Calculate zonal means from seasonal ensemble means Calculate seasonal trends from zonal mean Plot/download results

  6. Traditional approach Find datasets/variables of interest Download individual data files or subset with OPeNDAP Analyze data locally

  7. Problems with traditional approach Awkward user interface(s)‏ Obscure UI naming conventions makes it difficult to find variables of interest Datasets often aren’t aggregated Subsetting and/or aggregation services often fail with large datasets (e.g. out of memory errors) Requires download of 36GB of data (file download) or ~2.5GB (OPeNDAP) for final product ~5KB.

  8. More modern approach • Aggregated data • Spatial or temporal subsetting • Meaningful variable and dataset names • Modern Web UI

  9. Mandatory product plug

  10. Dapper (dapper.pmel.noaa.gov/dapper) • Web server that provides distributed access to in-situ or gridded data via OPeNDAP protocol • Aggregates local files, or remote datasets via HTTP or OPeNDAP • Streams data (no more “out of memory” errors)‏

  11. DChart (dapper.pmel.noaa.gov) • Browser based tool for visualizing or downloading in-situ or gridded ocean or atmospheric data • Also aggregates data • AJAX based user interface • Access to ~3.5 TB of gridded data • Configurable UI

  12. What’s missing? • Still requires download of ~2.5GB for final product ~5KB • Lots of clicking to download multiple datasets • BIG problem for AR5 data needs (>1PB)

  13. Missing piece

  14. Ideal analysis environment (scientist perspective) • Highly interactive (i.e. command line) • Scripting in familiar language of choice (bash, Python, Ruby, Matlab) • Access to multiple tools (Matlab, nco, cdo, GrADS, Ferret, gdal, … ) • Access to custom home-grown tools • Storage of intermediate products (anomalies, statistics, etc.)

  15. Limitations of Web services • Users locked-in to backend analysis software • Difficult to debug • Steep learning curve • How to handle long lived operations? • Security problems • No (or limited) scripting capabilities • Not interactive

  16. A cloud computing alternative • Upload data to cloud • Move computation to data • Boot VM preloaded with common analysis tools • Users can customize (and share) VM images and data • Users have full ssh access to Xen VM(s) running Linux with local access to data stored in cloud

  17. Amazon AWS • Amazon EC2 • Uses customizable Linux XEN image • Start 1-100 hosts in parallel • $0.10/instance-hour • Amazon S3 • Data storage service • $0.15 GB/month for storage • Data transfer in $0.10/GB • Data transfer out $0.18/GB

  18. Cloud analysis architecture

  19. Sample workflow (free version) • User authenticated via Web UI • EC2 instance booted with OPeNDAP access to datasets (stored on S3 or EC2 volumes) • User rpms installed (optional) • ssh access to instance using ssh keypair (generated when account issued) • User analyzes, downloads, visualizes, ... • Instance restored to pool after user done (or after period of inactivity)

  20. Analysis cloud advantages • Scalable • Data lives in same network as software • No user software lock-in • Users can work in familiar environment • Security problems reduced • Interactive • Access to debugging tools BUT • Lots of details to work out!

  21. Questions?

  22. More info PMEL Dapper Server http://dapper.pmel.noaa.gov/dapper PMEL DChart http://dapper.pmel.noaa.gov/dchart Downloads, propaganda http://www.epic.noaa.gov/epic/software/dapper/ http://www.epic.noaa.gov/epic/software/dchart/ Joe.Sirott@noaa.gov

More Related