230 likes | 350 Views
NPP Atmosphere PEATE. Climate Data Processing Made Easy Scott Mindock. Atmosphere PEATE Team Space Science and Engineering Center University of Wisconsin-Madison 10 July 2008.
E N D
NPP Atmosphere PEATE Climate Data Processing Made EasyScott Mindock Atmosphere PEATE TeamSpace Science and Engineering CenterUniversity of Wisconsin-Madison10 July 2008
The NPP Atmosphere PEATE is implemented within the framework and facilities of the Space Science and Engineering Center (SSEC) at the University of Wisconsin-Madison. SSEC has been successfully supporting operational, satellite-based remote-sensing missions since 1967, and its capabilities continue to evolve and expand to meet the demands and challenges of future missions. Space Science Engineering Center (SSEC) 1. Employs ~ 250 scientists, engineers, programmers, administrators and IT support staff. 2. Satellite missions currently supported: GEO: GOES 10/11/12/R; Meteosat 7/9; MTAT-1R; FY 2C/2D; Kalpana LEO: NOAA 15/16/17/18, Terra, Aqua, NPP, NPOESS, FY 3, MetOp
Funding and Related Work • Atmosphere PEATE is funded under NASA Grant NNG05GN47A • Award Date: 10/07/2005 • Grant Period: 08/15/2005 to 8/14/2008 (renewal in progress) • Related Work at SSEC: • CrIS SDR Cal/Val and Characterization (Revercomb, IPO) • VIIRS SDR and Cloud Cal/Val (Menzel, IPO) • VIIRS Algorithm Assessment (Heidinger, IPO) • International Polar Orbiter Processing Package (Huang, IPO) • VIIRS Instrument Characterization (Moeller, NASA)
Creating Climate Data Products (CDR) is hard! • Products track global trends • Calibration must be accurate. (No calibration artifacts) • Algorithms must be fully verified with global data (No regional artifacts) • Data sets are large and hard to manage • Developing the CDRs is an iterative process • Large processing clusters are required • Programming requires different skill set • Distributed systems hard to test • On going process • Requirements change • Technology changes • Staff changes
The process requires multiple computing systems Single machine can be used for initial development but cluster computing needed to verify performance over full globe.
CDR development is an iterative process Initial development occurs on single machine Product verification requires data sets of increasing size Increasing data set size increase computation time
Strategies of processing simplification • Reduce or remove the “Move to Cluster” step • Make executions environments similar • Make data access patterns similar Results in faster iterations
Strategies for managing processing system • Use well defined interfaces between subsystems • Decouples systems which reduces learning curve • Allows evolution of subsystems • Simplifies test and verification of software • Create configuration driven subsystems • Simplifies deployment of subsystems • Allows operations to modify system behavior • Leverage automated testing technologies • Reduces learning curve • Provides continuous test coverage • Captures requirements in executable form
The system: Atmosphere PEATE • Ingest : ING • Brings data into the Atmosphere PEATE • Supports FTP, HTTP and RSYNC • Data Management System : DMS • Stores data in the form of files. • Provides a Web Service to locate, store and retrieve files. • Computational Resource Grid : CRG • Provides Web Service to locate, store and retrieve jobs • Algorithm : ALG • Consumes jobs • Runs algorithms in form of binaries • Algorithm Rule Manager: ARM • Combines data with algorithms to produce jobs • Provides Web Service interface to locate, store and retrieve rules
ING: Ingest, bring data into system • Configuration File • Allows operations to add new sites • Allows operations to maintain existing sites • Customization allowed in form of scripts (BASH,PYTHON) • QC • Quick Look • Metadata extraction • Notices missing or late data
DMS: Stores Data and Products • Relives Scientist of having to manage data. • Simple put and get functionality Configuration file • Specify fileservers and directories • Operations can Add/Remove fileservers File system - hold files Database - holds file information Public Access - DMS interface Worker - manages file system
CRG : Provide nodes with jobs Provides well-defined interface deployed as a web service Accepts job requests Provides Job Status Monitors Job State Allows processing nodes to be added or removed from system
AlgHost: Runs software the produces products • Recreates development environment • Retrieves data from DMS • Retrieves and runs software packages • Saves results to DMS, includes products, stdout and stderr
Algorithm Script Structure • Cluster executes bash script • Script is passed arguments • Software Package Directory • Working / Output directory • Static Ancillary Directory • Dynamic Ancillary Directory • Inputs files • Outputs files • Software Package is called from the script • Results are stored by the process that started script.
ARM: Bind data to software packages Provides well-defined interface deployed as a web service. Assigns jobs to CRG Monitors data in DMS Monitors the status of jobs in CRG Production rules can be added or removed dynamically by operations Volatile logic lives here
Strategies for managing processing system (revisited) • Use well defined interfaces between subsystems • Decouples systems which reduces learning curve • Allows evolution of subsystems • Simplifies test and verification of software • Create configuration driven subsystems • Simplifies deployment of subsystems • Allows operations to modify system behavior • Leverage automated testing technologies • Reduces learning curve • Provides continuous test coverage • Captures requirements in executable form
Development Process: Spiral method Design Implement Build = Deploy to Operations Test Deploy
Testing Strategy • Employ standard software industry practices • Automate with ANT, Make like, XML based • Test with JUNIT, Java Unit Test • Increases system quality • Tests are reproducible • Tests are run more often than they would be if they were manual • Tests are improved over time • Tests are configurable • We don’t just build, the process includes testing and verification
Nightly Build Builds system Tests subsystems Tests scenarios Updates repositories Logs results Scenarios demonstrate requirements
Unit and Regression Testing May use internal knowledge interfaces for testing Test and exercise public interfaces Stress test interfaces Evolve to test and verify bugs Fixed defects have specific tests added Tests run in nightly build Tests verify release Layered approach to testing Everything tested, Every Night
Testing Scenarios (1 of 2) Test ingest function Test forward and redo functions Reflect CDR development process
Test Scenarios (2 of 2) • Documents • 3600-0003.080402.doc - Level 4 requirements • 3600-0004.060911.doc - Operations Concepts • Test plans are implemented as scenario tests • Tests correspond to Use Cases outlined in OpsCon • At least one test for each requirement set • Successful completion of test verifies requirements by demonstration • Factors that determine success • Generation of expected products • Ability to track product heritage • Ability to reproduce results • Ability to uniquely identify products
Conclusion: Climate Data Processing Is Easy • Ingest system makes it easy to add and manage data sources • Operators can control system • Operator can monitor system • The DMS makes it easy to maintain large data sets • Scientists can find data • Operators can add and remove servers • Operators can add and remove sites • The CRG and AlgHost make it easy to transfer CDR production the development to the cluster environment • You still have to get the product correct!