270 likes | 333 Views
Assembling Large, Multi-Sensor Climate Datasets Using SciFlo. Water Climatology & CVO Projects: Brian Wilson, Gerald Manipon, Zhangfan Xing, Eric Fetzer, and Tom Yunck Jet Propulsion Laboratory.
E N D
Assembling Large, Multi-Sensor Climate Datasets Using SciFlo Water Climatology & CVO Projects: Brian Wilson, Gerald Manipon, Zhangfan Xing, Eric Fetzer, and Tom Yunck Jet Propulsion Laboratory Do multi-instrument science by authoring a dataflow doc. for a reusable operator tree. Access scientific data by naming it.
Outline • Use Cases for Decade-Scale Climate Science • Review of SciFlo: Scientific DataFlow Engine • Multi-Sensor Matchup Services • Space/Time Query Services (GeoRegionQuery) • Lessons Learned • uniform interfaces needed (i.e., opensearch protocol) • can use KML as space/time metadata standard • will inevitably repeat the data merge & fusion workflows • “scaling up” to years of data is hard
“Take the A-Train” CloudSat Classes Water from AIRS, AMSR-E, MODIS Water from MLS
Find Level-2 datasets Space/time granule query for multiple EOS (“A-Train”) instruments – AIRS, AMSR-E, AMSU, MODIS, Cloudsat, GPS Co-locate retrievals using space/time metadata Instantaneous “matchups” in space & time Read the data Temperature, water vapor, quality flags, cloud properties (HDF) Understand the data Units, quality control (non-trivial !!), etc. Publish merged products Water vapor climatology, stratified by Cloudsat cloud classes Publish multi-sensor “fused” products Determine instrument biases, understand by stratifying Fuse L2 data on a common grid Large-Scale Data Fusion
SciFlo Applications • AMAPS = Aerosol Modeling And Processing System • Amy Braverman, ACCESS PI; Joyce Penner, U. Michigan • Compare Aerosol Optical Depth (AOD) from MODIS, MISR, & AERONET to IMPACT model • AQUA = Automated Query & Access • One-year ACCESS ECHO grant (Brian Wilson, PI) • Automated, repeatable access to 5-year EOS datasets for large-scale data mining • MEASUREs Project – Eric Fetzer, PI • Publish a temperature & water vapor climatology stratified by cloud scene (CloudSat classes) using A-Train data (AIRS, AMSR-E, AMSU, MODIS, MLS) • Cimate Virtual Observatory • Examine the biases of temperature retrievals from AIRS, AMSU, MLS by comparisons to GPS occulations • Stratify biases by geophysical conditions, cloud scene, etc.; study decade-scale trends. Carbon Cycle
GENESIS SciFlo Engine • Automate large-scale, multi-instrument science processing by authoring a dataflow document that specifies a tree of executable operators/services. • VizFlow Visual Authoring Tool (AJAX GUI in browser) • Distributed Dataflow Execution Engine (in python) • Data Grid: Move data “granules” to the operators using FTP, HTTP, or OpenDAP URLs. • Compute Grid: Move operators (executables) to the data. • Built-in reusable operators provided for many tasks such as subsetting, co-registration, regridding, data fusion, etc. • Custom operators easily plugged in by scientists. • Leverage convergence of Web Services (SOAP) with Grid Services (Globus toolkit v4).
Design Themes • Name: algorithmically, permanently • us:gov:nasa:eos:AIRS:AIRS.2003.01.02.004.L2.RetStd • Resolve: translate URI to one or more URLs • Recognize: known data products (URL URI) • http://sciflo/data/AIRS/L2/AIRS.2003.01.02.004.L2.RetStd.hdf • Discover / Harvest: crawl data centers to gather URLs • Cache / Localize: in known location on each node • Teach the computer: provide rich metadata • Computers are stupid because we don’t tell them anything. • Metadata, metadata, metadata! • Declare intentions (don’t write code) • Author XML workflow document by visual layout • No such thing as good code; all code rots. Carbon Cycle
Design Themes (2) • Publish algorithms as remotely-callable services • Easy software reuse • Choreograph remote Web Services: REST or SOAP • Design “small” metadata objects in XML • XML Microformats: as “views” into (binary) data • Programming language-independent • Path into data: use XPath, XQuery, OpenDAP • Generalized URL with query attached:http://host/dir/queryResponse.xml?xpath:.//Items/Item[1] • OpenDAP URL to fetch named variables from binary files:http://host/dap/AIRSdata.hdf?TAirStd • Adapt interfaces, formats or XML: semantic mediation • Declare types, register conversion operators • Author mediators in XQuery or python • Auto type conversions: e.g. between XML, python, JSON • Human-oriented programming, not object-oriented Carbon Cycle
VizFlow Flowchart GPS-AIRS Matchup & Temp. Profile Comparison • Connect a series of services and operators into a dataflow • Drag services/operators from menu, and drop onto the canvas • Lay out the flowchart by moving nodes • Connect the input/output ports by drawing lines • User guided by matching up port names and types
Each SciFlo processing step is one of: Template for XML (or string) generation REST (http GET) call: e.g. WMS/WCS, DAP URLs SOAP service call: “have WSDL, will call” XPath 2.0 transformation for XML mediation XQuery 1.0 query/transformation Command-line script or executable Python method call Scientist’s custom IDL or MATLAB script Other (What do you need?) Service/Operator Choreography
www.opendap.org Drill down into a “deep web” of science data. Use a one-line query URL to retrieve a slice of a variable grid from a netCDF or HDF file anywhere in the world Binary wire protocol for fine-grained data transfer OpenDAP URL http://gen-dev.jpl.nasa.gov/genesis/cgi-bin/dods/nph-dods/genesis/data/airs/L2/20030113/airx2ret/AIRS.2003.01.13.171.L2.RetStd.hdf?TAirStd(1:3, 3:6, 4:17) OpenDAP Servers netCDF, HDF, GRIB, FreeForm, JGOFS, other file formats Easy to implement another server OpenDAP clients Matlab and IDL, any web browser Python (pydap or SciFlo) Open Data Access Protocol
AIRS/AMSU Scanning GeometryColor fill = CloudSat Class (Sassen and Wang, 2008)
AIRS / CloudSat Overlaps CloudSat Retrieval Compute Overlaps: - Look up geometry - Intersect CloudSat strip with ellipses (nearest neighbor) - Save matchup indices - Use indices later to subset temp., water, and cloud data 1 AMSU & 9 AIRS Footprints
AIRS / CloudSat MatchupsColor fill = CloudSat Class (Sassen and Wang, 2008) Central Equatorial Pacific Black lines: AIRS ‘best’ retrieval altitude X: no AIRS tropospheric profiling.
Multi-Instrument Atmospheric Science AIRS/GPS Co-Registration: Point to Swath Matchup Carbon Cycle AIRS Level2 Swaths over Pacific GPS Level2 Profile Locations
AIRS / GPS Matchups AIRS/GPS Temperature & Water Vapor Comparison Plots
Service Layers • GeoRegionQuery(dataset, timeRange, latLonRectangle) • Return lists of (ftp, http, or dap) URLs • Time segment large queries and return results in batches • AQUA GeoRegionQuery and OrderEntry • Query ECHO repository, order granules that are not on-line • Event callback when order is ready • MatchupServices • Find strip to swath footprint overlaps: CloudSat & AIRS (AMSU, MLS) • Find point to swath overlaps: GPS and AIRS • Find satellite overflights of ground sites (AERONET, AIRNOW) • Parameter Subsetting • Extract specified variables with geo-location from HDF • Bundle into simple netCDF file, only move custom subset • Custom Analysis Workflows • SciFlo engine takes care of sequencing & distributed computing
Provide both REST & SOAP interfaces Also guarantees that SOAP interface has simple argument list Machine-to-machine interfaces vital, keep human out of loop Asynchronous & job segmentation capabilities needed for scalability to years of data Event callbacks or poll URL for results Multiple Layers of Services Query, order, matchup, parameter subsetting, custom workflows Custom workflow docs. published as new REST/SOAP services Scalable space/time granule query Publish space/time bounding boxes for granules as KML metadata Crawl KML and provide space/time search Scalability: Lessons Learned
A SciFlo Dataset is: Specified as a space/time query over collections of data products (or retrieved physical variables) GeoRegionQuery(DataProduct, TimeRange, LatLonRegion) GeoRegionQuery(PhysicalVariable, TimeRange, LatLonRegion) Realized as a list of object ID’s or URI’s (permanent names) GeoRegionQuery returns unique objectIds along with geolocation metadata Accessed using a list of URL’s pointing to on-line replicas of the data objects (files). FindDataById(objectIds) URLs (ftp, http, or OpenDAP) Translate unique object ID’s into list of on-line locations in DataPools or any SciFlo node DataPools & SciFlo P2P network are “crawled” to update distributed translation tables Or query ECHO metadata repository SciFlo network is a distributed cache for scientific datasets Data Access in SciFlo
AQUA Client Architecture Data Providers Collection Discovery, Granule Query Carbon Cycle LARC AQUA GUI Clients 2nd AQUA NSIDC_ECS ECHO SOAP Services Order Items AQUA REST & SOAP Services GSFCS4PA Granule Query, Order Items, Etc. ORNL_DAAC List Collections 3rd AQUA Query & Order Etc. Time-Segmented Granule Query Machine Clients Fetch Data or Browse Files Using URLs
Client GUI for ECHO Queries Live AJAX GUI: Time Slider, Google Maps for space selection & bbox display, Query & Order, Live-Updating Collection & Granule Metadata Carbon Cycle
AQUA Open Search Interface • Discover a Collection • http:/server/aqua//collections?q=water+vapor&startIndex=1 &count=200&format=atom • Returns Atom feed listing collections satisfying query keywords • Metadata included in XML feed • Space/Time Query for Granules • http://server/aqua/-/granules/providerId/datasetName? time=2006-01-01T00:00:00,2006-02-01T00:00:00& bbox=south,west,north,east&responseGroups=Large& startIndex=1&count=200&format=kml • Returns Atom feed (or KML document!) • Granule metadata (time, georss:box) and URL’s included • OpenSearch (GData) Features • GoogleData standard uses and extends OpenSearch • Search aggregators auto-handle Atom feed, traverse result sets • Ingest data feed into Google Earth or Google Apps Carbon Cycle
Global-Scale Data Discovery • Publish metadata in KML or some standard format • Metadata file published alongside data file • Granule metadata includes space/time box, domain & dataset keywords or ontology terms, variable list, etc. • Permanent name for data object • New metadata advertised via datacast (RSS feed) • Crawl and Index • Google or someone crawls and indexes metadata • Search by time, space, metadata fields, keywords • Also ontology-enhanced search: term broadening, etc. • Results are lists of permanent names (URI, XRI) • Delegated Name Resolution • Resolution delegated to proper naming authority • Translate each URI/XRI to one or more URLs (preferably DAP). • Data Access • Browse & viz. using KML • Slice data using DAP URLs Carbon Cycle
Steps Toward Massive Scaling • Publish KML Index Query Perm. Names URL’s Data Slice Visualization • Transfer of data from tape to disk could be hidden inside an asynchronous Name Resolution service Carbon Cycle
KML / GEarth Platform • Scalable publishing & commodity visualization • The innovation is this powerful combination. • Tell a science story with an accompanying visualization • Google crawls and indexes KML files • GEarth provides spatial Placemark search, and visual animations over time • Embed more metadata tags in KML (or Atom) files • Georss tags based on GML, different from KML spatial • Micro XML formats: georss:bbox (spatial bounding box) • GEarth provides Placemark (KML) search • Viewport defines implicit lat/lon rectangle • Then do keyword search into indexed KML files • But what about time spans or other metadata? • Opportunity • Exploit scalable publishing & search • Add more metadata to the combination Carbon Cycle
Provide both REST & SOAP interfaces Also guarantees that SOAP interface has simple argument list Machine-to-machine interfaces vital, keep human out of loop Asynchronous & job segmentation capabilities needed for scalability to years of data Event callbacks or poll URL for results Multiple Layers of Services Query, order, matchup, parameter subsetting, custom workflows Custom workflow docs. published as new REST/SOAP services Scalable space/time granule query Publish space/time bounding boxes for granules as KML metadata Crawl KML and provide space/time search Scalability: Lessons Learned