250 likes | 584 Views
MITgcm. History. MITgcm Family Tree. MITgcm. Algorithm and applications. MITgcm UV. Ultra-Versatile Implementation. Target Compute Environments. SMP and also Clustered SMP T3E ( SGI/CRAY ) Single and multi-processor vector NEC-SX4, CRAY-C90 ( SGI ).
E N D
MITgcm History
MITgcm Algorithm and applications
MITgcm UV Ultra-Versatile Implementation
Target Compute Environments • SMP • and also Clustered SMP • T3E ( SGI/CRAY ) • Single and multi-processor vector NEC-SX4, CRAY-C90 (SGI) • IBM, • SGI, • Sun, • Intel et al., • Digital, • HP.
Goals Useful today • Good performance on current generation machines • TAMC “compatible” With a future • Practical route to a teraflop/s • Practical route to a desktop gigaflop/s
Challenges • Cache blocking v. long vectors. • Isolating and minimizing communication/synchronization primitives. • OS idiosyncrasies. • Varying degrees of compiler capability.
Technologies • Vector processing. • Caches, deep memory hierarchy. • MPI. • HPF. • Multi-threading. • Network interface. SCI, Memory Channel, Giga-ring, Arcitc
Cache and vector sNx + OLx • “Voodoo numbers” nSx sNx sNy sNx sNy + OLy nSy sNy
Vector mode • Strips or one whole domain i.e. four proc. example sNy = sNy sNx = Nx
Cache, deep memory mode • Block the domain. sNy= sNx= sNy sNx
What about the algorithm? • We know it vectorizes • Can it be blocked? -lets hope so!
MITgcm UV structure Range 1:sNx+1,.. { Can be “block” by “block” Fill overlaps Don’t need any “long vector sweeps” Range 1-OLx:sNx+OLx+1,.. Depends on alg. and problem! {
Communication • Minimize comm. points • Keep at high level, not in compute primitives. • Overlap with computation (needs hardware and OS support to have an effect!). • Multi-threaded and/or multi-process (MPI).
MITgcm UV communication Send G’s Update overlaps Receive G’s Depends on alg. and problem! Send and receive ps
MPI and shared memory • Repeat domain in each process • Shared mem copies -> messaging calls …. etc sNy Nx
Exploiting NI innovation • Ongoing collaborations • T3E production hardware • HP, Sun, Digital - semi-production • Intel, IBM experimental • Rapidly evolving field • MITgcm UV can exploit it
Compiler and OS maturity • F77 v. F90 • F77 is universally OK • On SMP for predictable performance need batch execution, private environment. • Not always configured that way. • Virtual memory • Makes “cache speedup” hard to predict
Is UV really Ugly Version • Example code.
Per proc. grid size 64x32x20 100Mflop/s per proc. Number of procs 16 Total problem size 244x132x20 Total performance 1.6GFlop/s Time per block per time step 0.2 secs Some HP Numbers.1 Forward Code
Some HP Numbers.2 Inverter • Per proc. grid size 64x32 • 200Mflop/s per proc. • Number of procs 16 • Total problem size 244x132 • Total performance 3.2GFlop/s • Time per block per time step 0.2 secs per timestep.
Base code debugging and testing. Parameterizations Mixed layer Eddy mixing I/O pre and post-process. Diagnostics. SPP customization communication primitives solver TAMC compilation. Outstanding Issues 1
Outstanding Issues 2 • Per platform customizations • Pipelined slices! • T3E TAMC tape • Scientific libraries for solver
“Parallel computing has historically been a field whose promise has been characterized by hyperbole, but whose development has has been defined by pragmatism” Conclusion HYPEPERBOLE - Combine the best of MITgcm Classic and MITgcm UV. PRAGAMATISM - We want performance today. - and - UV style implementation most likely teraflop/s model. UV style implementation most likely gigaflop/s desktop.
MITgcm UV • Ultra-Versatile redesign and implementation of MITgcm algorithm • Implementation that can exploit • Wildfire shared-memory • Cache friendly, not vector dominated • Toward coupled ocean-atmosphere model.