1 / 33

Overview of the Data Processing and Error Analysis System (DPEAS)

Learn about the features, capabilities, and benefits of the Data Processing and Error Analysis System (DPEAS) for large data analysis tasks. Implemented on Windows NT/2000 OS, the system has been in operational use for over 2 years at CIRA and offers global merge capabilities for numerous data sets. With a simplified and easily scalable approach, DPEAS supports various data types and allows users to write parallel code using a language they are familiar with. Discover how DPEAS can enhance your data processing and error analysis tasks.

andrest
Download Presentation

Overview of the Data Processing and Error Analysis System (DPEAS)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of theData Processing and Error Analysis System (DPEAS) Andrew S. JonesColorado State University (CSU)Cooperative Institute for Research in the Atmosphere (CIRA)DOD Center for Geosciences / Atmospheric Research (CG/AR)Fort Collins, CO

  2. What is it? • Data processing system for “large” data analysis tasks using common PCs • Features: • 2nd generation system (replaces an earlier system called PORTAL (Jones et al., 1995)) • Parallel implementation • Web-based documentation and monitoring • Incorporates a Fortran-interpreter for input tasks • Virtualized I/O subsystem (only memory-resident data structures are needed, data algorithms now function like a model) • Able to failover to redundant hardware • Extensible User Module • Error Analysis code is still under development • Implemented on Windows NT/2000 OS

  3. What Does it Do? • Global merge capabilities for numerous data sets • Current system in operational use for 2+ years at CIRA • Current average operational throughput rates using 15 processors on 8 PCs is 17 TB/yr (47 GB/day). • Measured max. throughput rate is: 2.5 PB/yr (7.1 TB/day) • Simplifies • Powerful abstraction layers allow anyone to write parallel code • Virtual I/O subsystem reduces end-user code complexities • Users interact using a language most already know • Easily Scales • Limited process “cross-talk” improves scaling behavior • Tests have shown that a 2000 machine cluster is physically feasible. • Basically… just add hardware.

  4. 10 Data Types Are Currently Supported • Reads and Writes HDF-EOS natively • GOES IMAGER (McIDAS) • NOAA AVHRR GAC and LAC (McIDAS) • NOAA AMSU-A and B (HDF-EOS) • DMSP SSM/I (Byte Stream) • DMSP SSM/T-2 (NGDC OIS) • DMSP OLS (NGDC OIS) • TRMM TMI and VIRS (HDF) • User extensible… (your format here)

  5. The Hardware

  6. Failover Mode

  7. Module Context GUIs This is DPEAS

  8. An example of a DPEAS input script file

  9. How DPEAS Starts Program Start DPEAS Initialization Interpreting DPEAS script declarations Interpreting DPEAS script executable statements

  10. How DPEAS Ends Interpreting DPEAS script executable statements DPEAS Summary Program End

  11. How Are Spawned Input Scripts and Jobs Created? • All spawned DPEAS jobs run machine-generated DPEAS input scripts which are generated by the data processing engine from the Master DPEAS input script (The examples shown previously were examples of DPEAS machine-generated code) • This is automated within DPEAS and the user code goes along for the free ride since it is part of the DPEAS executable (it’s like meeting a friendly virus which helps to spread your code along with it)

  12. What Does DPEASParallelism Look Like? Do loop contentsare sent to other resources in parallel The new jobs run the same “DPEAS.exe”, but execute only the subtask operations Completed Jobs allow additional jobs to start

  13. The 3 Programming Steps to Add a User Routine to DPEAS • Insert a program “hook” The program hook makes the main DPEAS program aware of the existence of your wrapper routine. • Create a wrapper routine The wrapper routine tells the DPEAS fortran interpreter how to parse and interact with your application subroutine arguments. • Create an application routine The application routine performs the “real” work. You can do anything you want within the application routine.

  14. How does the “User_Module.f90” relate to my DPEAS Input Scripts?

  15. User Example:The user’s application routineUsing the virtual I/O data via pointers 1. Find each MW channel 2. Allocate a new output array data structure Your science code looks like this

  16. User Example:The results: Complete integration The new user routine is now fully integrated into DPEAS

  17. User Example:The output HDF-EOS file

  18. User Example:The output image representation 150 GHz Effective Emissivity Calculated from: GOES-08 IMAGER NOAA-15 AMSU-B

  19. User Example:Summary • Creates 2 new routines: • Wrapper routine • Application routine • Requires 25 lines of executable code: • 2 – Program hook • 4 – Wrapper routine • 19 – Application routine • 2 – Variable assignments • 3 – Science algorithm • 14 – Virtual I/O library calls(using only 2 Virtual I/O library routines) Small overhead for gaining massive parallelism capabilities!

  20. User Example:How complex would the user routine be, if written without the Virtual I/O library? • Creates 2 new routines: • Wrapper routine • Application routine • Requires 59 lines of executable code: • 2 – Program hook • 4 – Wrapper routine • 53 – Application routine • 2 – Variable assignments • 3 – Science algorithm • 48 – HDF-EOS library calls(using 26 HDF-EOS library routines) • Answer: Without the DPEAS Virtual I/O library there would be: • 24 additional I/O routines called by the user (+1200%) • 34 additional lines of user code (+236%)

  21. User Example:Conclusions • Implementation Insights • Minimal amount of end-user code is required • The effort and resources involved are small(The DPEAS program recompiled in < 30 s on the user’s desktop) • Virtual I/O Insights • The DPEAS virtual I/O access method is less complex than traditional HDF-EOS file access methods • End user’s perspective • End users are protected from technical data format issues • End users can develop higher quality code by leveraging shared robust common modules • Scalability is greatly enhanced with little end user effort

  22. Summary • DPEAS can process large data sets in an efficient manner while maintaining centralized management controls and error handling behaviors • Parallelism of the code is automatic and runs on “cheap hardware” • Failover capabilities make the system more robust • User code is shielded from complexities of the system using software abstraction layers • Little training is needed since user interfaces are in a known scientific language • User modules directly access data from memory – obsolesces traditional file access methods but maintains needed file compatibility

  23. What did I learn aboutHDF-EOS in the process? • HDF-EOS is an excellent “universal” data format It works for all satellite sensors types I have encountered to date (10+) • HDF-EOS requires serious software design before the implementation stage • It is my experience that “Time” information as a geo/time field for sectorizing is overrated and is likely to cause future software design headaches with the more complex sensors if encouraged to be the “norm”

  24. My 2 cents: How HDF-EOScould be made even better(Hopefully someone has already thought of these things,and this short list will be a reaffirmation) • Given that GOES data, for example, and other multi-detector sensors can have multiple times for each channel for the same geolocation position, and that in addition, they can and do interrupt their sensor scans at any time… • Treat “Time” as a data attribute • Currently I associate “Time” and other associated arrays with its principle data array by nomenclature • It would be better to use data array attribute “groups”. Then “Time”, “Calibration”, and other associated arrays could be grouped with the data array through the data format.

  25. Why Data Attributes? • Many data channels have “associated” information • For example, it might be very meaningful to associate the min. and max. of a grid location with its mean value • It would be better if there was a standard way of showing that group association, so we don’t have to understand each other’s unique nomenclatures, “intent”, or have to resort to the use of unusual “mixed” HDF/HDF-EOS data files • Data attributes should not be arbitrarily limited in scope, but have full data type ranges • Units could also be incorporated through data attributes

  26. The Endjones@cira.colostate.edu

  27. Appendix The following series of slides show how a user can easily modify DPEAS • The user’s program hook • … wrapper routine • … application routine(using the virtual I/O data via pointers) • Usage of the new user routine in a DPEAS input script file • The Results: Complete Integration

  28. User Example:The user’s program hook 2 lines of code

  29. User Example:The user’s wrapper routine 4 lines of executable code

  30. User Example:The user’s application routineUsing the virtual I/O data via pointers 1. Find each MW channel 2. Allocate a new output array data structure Your science code looks like this

  31. User Example:Usage of the new user routine in a DPEAS input script file

  32. User Example:The results: Complete integration The new user routine is now fully integrated into DPEAS

  33. Where Do I Find DPEAS? DPEAS Home Page: http://luna.cira.colostate.edu/DPEAS/DPEAS_frame.htm Please direct questions to jones@cira.colostate.edu

More Related