The Case for Architectural Diversity

The Case for Architectural Diversity Burton Smith Cray Inc.

ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Culler-Harris Culler Scientific Cydrome Dana/Ardent/Stellar/Stardent Denelcor Elxsi ETA Systems Evans and Sutherland Computer Division Floating Point Systems Galaxy Goodyear Aerospace Gould Guiltech Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories MasPar Meiko Multiflow Myrias Numerix Saxpy Scientific Computer Systems Supertek Supercomputer Systems Inc Thinking Machines Vitesse We’ve had some diversity in the past

Today, there is less of it • Cray • NEC • Hitachi • Cluster suppliers • Do-it-yourself cluster builders • Do-it-yourself grid builders

Two basic types of supercomputers • Cluster and grid systems (Type T) • Prices based on Transistor cost • Performance characterized by Linpack • Low bandwidth interconnection networks • Off-the-shelf processors • Tightly coupled systems (Type C) • Prices based on Connection cost • Performance characterized by sparse MV multiply • High bandwidth interconnection networks • Custom processors • Each type is adapted to its ecological niche • What are these niches? • What are these adaptations?

Type T: Local data access Well-balanced workloads Dense linear algebra Explicit methods Domain decomposition Non-adaptive meshes Regular meshes Slowly varying data bases Type C: Global data access Poorly balanced workloads Sparse linear algebra Implicit methods Operator decomposition Adaptive meshes Irregular meshes Rapidly varying data bases Supercomputer niches • Many disciplines span both columns • They may want to employ both types of system

Supercomputer adaptations • Adaptation is visible in several areas, including • Latency tolerance • Cooling and packaging • Message passing styles • But first, a few words about a few words

o overhead space sizeg network transport time L overhead Bandwidth, overhead, and latency • In the LogP model, well-known in computer science: • L is the network transport latency • o is the processor overhead • g is the reciprocal bandwidth (the “gap”) • P is the number of processors • Time(size) = sizeg + 2o + L

“Latency” has several meanings • It means 2o + L for some, L for others • Each is a legitimate latency, but for different subsystems • Some want it to mean sizeg + 2o + L • This is not so useful • We should at least try to get our names straight • I will use the LogP definitions

overhead overhead space network transport overhead time network transport ··· overhead network transport overhead ··· Latency tolerance (latency hiding) • Latency can be tolerated by using parallelism • A new transmission can start after waiting max(sizeg, o) • LTTime(n, size) = (n - 1)max(sizeg, o) + sizeg + 2o + L

What latency tolerance buys us • It depends on the relative magnitudes of sizeg, o, and L • nTime(size) = n(sizeg + 2o + L) • LTTime(n, size) = (n - 1)max(sizeg, o) + sizeg + 2o + L • If sizeg >> 2o + L we are “bandwidth bound” • n-fold latency tolerance saves a mere (n - 1)(2o + L) • This only gets significant for large n • If o >> sizeg + L we are “overhead bound” • n-fold latency tolerance saves about (n - 1)o • This will roughly halve the time • Unequal overheads at sender and receiver make it worse • If L >> sizeg + 2o we are “latency bound” • n-fold latency tolerance saves approximately (n - 1)L • This is roughly an n-fold time improvement

Aside: does message size vary with P? • Let’s take PDEs as an example, and assume: • We have three space dimensions and one time dimension • We need 16 times the processors to double the resolution • Each processor gets half as many spatial mesh points • If the processors are also faster, maybe somewhat more • For nearest-neighbor communication, the size shrinks • Perhaps to 0.52/3 = 0.63 or 0.51/3 = 0.79 • For all-to-all communication, e.g. in a spectral method, the size shrinks to 1/32 of its former value • There are half as many points per processor and sixteen times as many processors to distribute it among • Your mileage will vary, and it will probably get worse • Supercomputer users usually spend P for time, not space

Latency tolerance in summary • It uses parallelism to reduce total transmission time • It is basically just pipelined data transport • It is most needed when sizeg is relatively small • either because of small size or small g (high bandwidth) • When sizeg is large, it tolerates latency without help • It is not particularly effective when overhead is high • When both o and sizeg are small, it works well • Vector memory references • Multithreaded memory references • Long-haul, high speed ATM packet transmission • Highway traffic (without toll booths, customs, etc.) • The bottom line: latency tolerance is a type C thing • It doesn’t matter so much for type T systems

Cooling and packaging • All type T supercomputers are air cooled • This makes them voluminous • Access for service is simple • Unfortunately, the interconnecting cables are long • Most type C supercomputers are liquid cooled • This lets them be compact • Interconnecting cables are shorter • At high bandwidth, cable volume varies as length3 • Unfortunately, access for service is more complex • Each type is pretty well adapted • Environmental forces that might cause re-adaptation: • Higher power in future off-the-shelf chips • Low-cost optical interconnect

Message passing styles • In the usual type T system, software builds and unbuilds the messages and hardware transports them • o is several microseconds, typically much greater than L • A small g is futile unless size is pretty large • The user program is involved at both ends • The software can adapt to pretty much any old hardware • In most type C systems, hardware can build and unbuild the messages as well as transport them • o is small, typically less than L • A small g is therefore worthwhile • The user program need only be involved at one end • The hardware must be suited to the messaging interface • There are a few single-sided messaging interfaces

MPI-2 • PUT and GET to remote WINDOWs in a process group • The WINDOW is typically atop an array or common block • Each WINDOW instance can have a different origin and size • Window handle, processor number, and offset are args. • WIN_FENCE is the barrier operation • Stride and gather/scatter are controlled by MPI types • e.g. CONTIGUOUS, VECTOR, INDEXED • The type must be set up and made known beforehand • Types can be represented differently in heterogeneous systems and MPI will (hopefully) take care of it • There are several atomic memory accumulate functions • There are many collective communication functions

Shmem • Remote data must be SYMMETRIC, i.e. the virtual address must be the same in all nodes • (TASK)COMMON and C statics are OK • Stack variables can be forced SYMMETRIC • There are BARRIER and BARRIER_ALL operations • Types or explicit widths (8-128) specify transfer quanta • Vector transfers may be unit stride or constant stride • Gather/scatter is only available on UNICOS/mk • There are some incompatibilities among UNICOS, UNICOS/mk, and IRIX • There are several atomic memory accumulate functions • There are a few collective communication functions

Co-array Fortran 95 (and UPC) • Roughly, UPC is to C as co-array Fortran is to Fortran • These languages have two kinds of subscripts • A(i)[J] roughly means A(I) on image J • If J exceeds P, the locations are distributed cyclically • There may be any number of threads per image • There are SYNC_ operations for images • There are nameless critical sections • Fortran 95 has reductions and other collective ops • Fortran 95 already has a forall, and UPC added one

Single-sided implementations • Several builders of type T systems are getting on board • IBM for shmem, several for UPC • DOE’s Office of Science is funding open-source versions • Why is this adaptive type T behavior? • Off-the-shelf network hardware now has some support • Reducing ovehead saves time in type T systems

Conclusions • There are two principal types of supercomputer • Should there be more? Will there be? • These two types are adapted to different niches • And the niches are important • Picking the wrong type of supercomputer wastes money • by paying for unused transistors or connectivity • The great supercomputer debate of the 90’s is over • so let’s move on

The Case for Architectural Diversity