230 likes | 392 Views
Taking Advantages of Collective Operation Semantics for Loosely Coupled Simulations. Shang-Chieh Joe Wu* Alan Sussman Department of Computer Science University of Maryland, USA. *graduating soon. Motivation Approximate Matching [Grid 2004] Collective Semantics
E N D
Taking Advantages of Collective Operation Semantics for Loosely Coupled Simulations Shang-Chieh Joe Wu* Alan Sussman Department of Computer Science University of Maryland, USA *graduating soon
Motivation Approximate Matching [Grid 2004] Collective Semantics Dissection of Execution Time Smart Buffering Future Work Roadmap
Obtain more accurate results by coupling existing (parallel) physical simulation components Different time and space scales for data produced in shared or overlapped regions Runtime decisions for which time-stamped data objects should be exchanged Performance becomes a concern What is the overall problem?
Special issue in May/Jun 2004 of IEEE/AIP Computing in Science & Engineering (CSE) “It’s then possible to couple several existing calculations together through an interface and obtain accurate answers.” Multi-scale multi-resolution simulations and models – multiphysics (May/Jun 2005 CSE) adaptive small-scale noise capture (hydrodynamics) complex fluid and dense suspension (fluid dynamics) patch dynamics (material science) Earth System Modeling Framework several US federal agencies and universities. (http://www.esmf.ucar.edu) Coupling, is it important?
Separate matching (coupling) information from the participating components Maintainability – Components can be developed/upgraded individually Flexibility – Change participants/components easily Functionality – Support variable-sized time interval numerical algorithms or visualizations Matching information is specified separately by application integrator Runtime match via simulation timestamps POSIX thread-based implementation Matching is OUTSIDE components
App0.R1 App1.R0 App0.R4 App2.R0 App0.R5 App4.R0 Separate codes from matching Configuration file Exporter App0 SPMD Importer App1
Arrays are distributed among multiple processes T4 T3 T2 T1 Basic Operation Importer component Exporter component Exported Distributed Array Imported Distributed Array Distributed Array Transfer Library Approximate Match Request Array@T3.1 Matched Array@T3 Approximate Match Library
Collective operations All processes in the same component must perform the same operation, but not necessarily at the same time Approximate match is a collective operation All processes in the same exporter component asynchronously generates distributed data with the same timestamps (T1 T2 T3 T4) All processes in the same importer component asynchronously makes requests with the same timestamps (T3.1) All processes in the same exporter component must reply to the requests with the same timestamps (T3 match to T3.1) Consistent decisions must be made about which copy of data (Array@T3) should be transferred for shared or overlapped regions Collective Semantics
Approximate match is runtime-based approach, so source code-based optimizations help little Different components execute at different speeds, and export/import data at their own rates Not all exported data are required by importer components Exported data, whose size might be very large, may be buffered when matching decisions cannot yet be made Not all processes in the same component execute at same speed Some complex components can be very hard to perfectly load balance across all processes Performance Concerns
Execution time is composed of Computation Time Local Copy Time (might be unnecessary) Runtime Match Time + Remote Data Transfer Time Same match decisions, for each request, are made repeatedly by all exporter processes in exporter components Smart buffering Faster processes help slower processes in the same exporter component Dissection of Execution Time Smart Buffering
Exported data are buffered in framework A slow exporting process may be able to avoid memory copies, based on Its responses for previously received import requests (self-help) The responses for previous requests satisfied by the fastest process in the same component (buddy-help) Smart Buffering
Smart Buffering Example The Match
No assumptions about load balance inside each component Smart buffering will help with load imbalance at runtime Slower processes can avoid some unnecessary work (memory copies) Component tunes itself at runtime when some processes fall behind Framework-level approach – no restrictions on algorithms/applications Load Balance
utt = uxx + uyy + f(t,x,y), solve 2-d diffusion equation by the finite element method A 1024x1024 distributed array is evenly distributed over participating processes. 4/8/16/32 P4 2.8GHz processors, connected by Myrinet, is the importer component U 4 PIII-650 processors, connected by channel-bonded Fast Ethernet, is the exporter component F Two clusters are connected by Gigabit Ethernet. 1001 data objects exported, and 50 data objects transferred (20:1) One process (fs) in the exporter component F performs extra computation – measuring its data exporting time Smart buffering can be observed when fs is falling (far) behind other processes Micro-Benchmark Experiment
Smart Buffering Results 8 Importer Processes – Exporter component does NOT run Slower Data Exporting Time for Slowest Process
Smart Buffering Results 32 Importer ProcessesExporter component runs more slowly from the beginning Data Exporting Time for Slowest Process
Smart Buffering Results 16 Importer Processes Nearly No Skips Enter Optimal State Some Skips Data Exporting Time for Slowest Process
Parallel Data Redistribution Shared data among coupled parallel models InterComm (Meta-Chaos), PAWS, MCT, CUMULVS, Roccom, etc. MxN Working group in Common Component Architecture (CCA) Forum Related Work • Coordination Languages • Creating and coordinating execution threads in distributed computing environment • Linda (tuple space model + directives). Delirium, Strand (new languages). C-Linda, Fortran-M (extending old languages), plus many others.
Described a runtime-based approach to speed up slower processes in the same exporter component in (loosely) coupled simulations Try to minimize unnecessary buffering for exported data that ends up not being transferred during component execution. Post-processing in the simulations components, or other tools, is not needed Perfect synchronization across participating components is not required – can especially benefit “hard-to-load-balance” components Conclusion
Investigate buffering issues between processes, such as non-blocking transfers or RDMA over InfiniBand Performance optimizations for slow importers (pattern-based semantic cache) Applying the framework to a set of large-scale coupled scientific applications from the space weather domain in progress Future Work
<importer request, exporter matched, desired precision> = <x, f(x), p> LUB minimum f(x) with f(x) ≥ x GLB maximum f(x) with f(x) ≤ x REG f(x) minimizes |f(x)-x| with |f(x)-x| ≤ p REGU f(x) minimizes f(x)-x with 0 ≤ f(x)-x ≤ p REGL f(x) minimizes x-f(x) with 0 ≤ x-f(x) ≤ p FASTR any f(x) with |f(x)-x| ≤ p FASTU any f(x) with 0 ≤ f(x)-x ≤ p FASTL any f(x) with 0 ≤ x-f(x) ≤ p Supported matching policies