High Performance Computing – CISC 811

High Performance Computing – CISC 811 Dr Rob Thacker Dept of Physics (308A) thacker@physics

Today’s Lecture The Productivity Crisis • Part 1: The Programmability Crisis • New languages for easier programming • Part 2: The Hardware Crisis • Designing balanced systems • Part 3: The future of HPC • Machine architectures – what to expect

Part 1: Programming Crisis • Abstraction issues • Low level versus high level • Implicit message passing models • Co-Array Fortran • Unified Parallel C • Titanium A large part of this lecture is derived from a presentation by Hans Zima: “Towards Future Programming Languages and Models for High-Productivity Computing” IFIP Working GroupCo-Array Fortran and UPC information is derived from lectures byBob Numrich (Minnesota) and Kathy Yelick (UCB)

Coming soon to a Peta-scale facility near you • Truly massive parallelism • 100,000s of processors • Will require parallelism within the algorithm to be at the level of 1,000,000s • component failures will occur in relatively short intervals • Highly Non-Uniform Data Access • deep memory hierarchies • severe differences in latency: 1000+ cycles for accessing data from local memory; larger latency for remote memory • severe differences in bandwidths 17

MPI – A Successful Failure? • Community wide standard for parallel programming • A proven “natural” model for distributed fragmented memory class systems • User responsible for locality management • User responsible for minimizing overhead • User responsible for resource allocation • User responsible for exposing parallelism • Users have relied on ILP and OpenMP for more parallelism • Mediocre scaling: demands problem size expansion for greater performance • We now are constrained to legacy MPI codes

Programming practices circa 2000 • Programmers are forced into a local versus remote viewpoint • Communication has become fundamental to algorithm implementations • Art of programming has been reduced to understanding most effective domain decompositions and communication strategies • Efficient programming of large machines has become a “heroic” endeavour • Application programming can take months to years Productivity - from first concept to results - is at an all time low on HPC systems. Increasing complexity in modelling means that this problem must be solved if we are to attempt the coming challenges.

The Emergence of High-Level Sequential Languages The designers of the very first high level programming language were aware that their success depended on acceptable performance of the generated target programs: John Backus (1957): “… It was our belief that if FORTRAN … were to translate any reasonable scientific source program into an object program only half as fast as its hand-coded counterpart, then acceptance of our system would be in serious danger …” High-level algorithmic languages became generally accepted standards forsequential programmingsince their advantages outweighed any performance drawbacks Forparallel programming no similar development has taken place

The Programming Crisis • Current HPC software: • application efficiency sometimes in single digits • low-productivity programming models dominate • focus on programming in the small • inadequate programming environments and tools • Resemblance to pre Fortran era in sequential programming • assembly language programming = MPI • wide gap between the domain of the scientist and programming language

Why have alternatives to MPI not been more succesful? • Functionality! • Data parallel languages like High Performance Fortran place strong limitations on algorithms • Do not support irregular algorithms • Do strongly focused on “array type” structures and SPMD parallelism • Efficiency • Frequently massaging an algorithm into a data parallel form for HPF produced awful performance • Can we learn from this?

Abstraction: A necessary evil Programming models and languages are the bridge between raw algorithms and execution. Different levels of abstraction are possible; • assembly languages • general-purpose procedural languages • functional languages • very high-level domain-specific languages Abstraction implies loss of machine related details – gain in simplicity, clarity, verifiability, portability versus potentialperformance degradation

What must a new language achieve? • Make Scientists and Engineers more productive! • Provide a higher level of abstraction, without sacrificing performance • Increase useability • Enable portability of algorithms from one architecture to another • Ensure robustness in larger systems

Where must improvements be made? • New languages will need to be supported by significant progress in many fields: • Algorithm determination • compiler technology • runtime systems • intelligent programming environments • Better hardware support

Contrarian View • Algorithms are an expression of the mathematics • Need new algorithms • Need better ways to express those algorithms - must match hardware realities • Parallelism is only one of the easier problems • Algorithms must match what the hardware can do well — this is where languages may need to change • Are new languages really necessary? • If so, how should they be evaluated? • The must address the hard problems, not just the easy ones • If not, how do we solve the problems we face?

More arguments against… • New languages almost always fail (very few successes) • Overly abstracted languages usually do not match needs of system hardware • Compilers take forever to bring to maturity • People, quite reasonably, like what they do; they don’t want to change • People feel threatened by others who want to impose silly naive expensive impractical unilateral ideas • Acceptance is a big issue • And then there’s the legacy problem

Global Address Space Languages • A hybrid of message passing and shared memory models • Message passing becomes implicit via writes to shared variables • Global execution consists of a collection of processes • Number of processes is set at start-up (a la MPI) • Local and shared data, as in shared memory model • Shared data is distributed across processes • Remote data stays remote on distributed memory machines • Examples are Co-Array Fortran (CAF), Unified Parallel C (UPC), Titanium, Split-C

The Guiding Principles behind GAS Languages • We don’t know what the next generation high productivity languages will look like – so why not just take “baby steps” • What is the smallest change(s) required to make C/FORTRAN/JAVA effective parallel languages? • How can this change(s) be expressed so that it is intuitive and natural for programmers? • How can it be expressed so that existing compiler technology can implement it easily and efficiently?

Co-Array Fortran • Co-Array Fortran is one of three simple language extensions to support explicit parallel programming: • www.co-array.org • See also www.pmodels.org

Programming Model • Single-Program-Multiple-Data (SPMD) • Fixed number of processes/threads/images • Explicit data decomposition • All data is local • All computation is local • One-sided communication thru co-dimensions • Explicit synchronization • Is this model still adequate?

Co-Array Fortran Execution Model • The number of images is fixed and each image has its own index, retrievable at run-time: 1 num_images() 1  this_image() ≤ num_images() • Each image executes the same program independently of the others. • The programmer inserts explicit synchronization and branching as needed. • An “object” has the same name in each image. • Each image works on its own local data. • An image moves remote data to local data through, and only through, explicit co-array syntax.

Declaration and memory model Declare the co-arrays: real :: x(n)[] Can have multiple co-array dimensions – creates domain decompositions of differing dimensions p q x(1) x(1) x(n) x(1) x(n) x(1) x(n) x(1) x(n) x(1)[q] x(n)[p] x(n)

Co-dimensions and domain decomposition real :: x(n)[p,q,] • Replicate an array of length n, one on each image • Build a map so each image knows how to find the array on any other image • Organize images in a logical (not physical) three-dimensional grid • The last co-dimension acts like an assumed size array:   num_images()/(pxq)

Communication in CAF Syntax y(:) = x(:)[p] x(index(:)) = y[index(:)] x(:)[q] = x(:) + x(:)[p] Absent co-dimension defaults to the local object.

2d Jacobi boundary exchange • Communication is one-sided as well as being implicit in the copies • Boundary transfer is a few simple lines of code! REAL ANS(0:NN+1,0:MM+1)[0:P-1,0:*] ME_QP = MOD(ME_Q+1+Q,Q) ! north ME_QM = MOD(ME_Q-1+Q,Q) ! south ME_PP = MOD(ME_P+1+P,P) ! east ME_PM = MOD(ME_P-1+P,P) ! west ANS(1:NN,MM+1) = ANS(1:NN,1 )[ME_P, ME_QP] ! north ANS(1:NN, 0) = ANS(1:NN, MM)[ME_P, ME_QM] ! south ANS(NN+1,1:MM) = ANS(1, 1:MM)[ME_PP,ME_Q ] ! east ANS( 0,1:MM) = ANS( NN,1:MM)[ME_PM,ME_Q ] ! west

Synchronization Intrinsics sync_all() Full barrier; wait for all images before continuing. sync_all(wait(:)) Partial barrier; wait only for those images in the wait(:) list. sync_team(list(:)) Team barrier; only images in list(:) are involved. sync_team(list(:),wait(:)) Team barrier; wait only for those images in the wait(:) list. sync_team(myPartner) Synchronize with one other image.

Co-Array Fortran Compilers • Available on all Cray systems • Rice University is developing an open source compiling system for CAF • Runs on the HP-Alpha systems at PSC and GSFC • Runs on SGI platforms • IBM may put CAF on the BlueGene/L machine at LLNL. • DARPA High Productivity Computing Systems (HPCS) Project wants CAF. • IBM, CRAY, SUN

Unified Parallel C • (Roughly speaking) C equivalent of Co-Array FORTRAN • Two compilation models • THREADS may be fixed at compile time or • Dynamically set at program startup time • MYTHREAD specifies thread index (0..THREADS-1) • Basic synchronization mechanisms • Barriers, locks • What UPC does not do automatically: • Determine data layout • Load balance – move computations • Caching – move data • These decisions are left to the programmer (intentionally)

Shared Arrays in UPC • Shared array elements are spread across the threads shared int x[THREADS] /*One element per thread */ shared int y[3][THREADS] /* 3 elements per thread */ shared int z[3*THREADS] /* 3 elements per thread, cyclic */ • In the pictures below • Assume THREADS = 4 • Elements with affinity to processor 0 are red Of course, this is really a 2D array x y blocked z cyclic

Performance of GAS languages • Reason why they may be slower than MPI • Shared array indexing is expensive • Small messages encouraged by model • Reasons why they may be faster than MPI • MPI encourages synchrony • Buffering required for many MPI calls • Remote read/write of a single word may require very little overhead • Assuming overlapped communication, the real issues is overhead: how much time does it take to issue a remote read/write?

UPC Compilers • (GNU) GCC UPC http://www.intrepid.com/upc/ • Berkeley UPC compiler http://upc.lbl.gov/ • Supports a number of different platforms • Michigan Tech MuPC http://www.upc.mtu.edu/ • HP UPC • Cray UPC

The Fundamental Problem • The idea of improving productivity is glossing over one important issue • Productivity issues need to be solved by someone, improving language support is moving the heroic effort from the application programmer to the language/compiler developer • If successful it means the programming problems are solved once, and by one group • Will this be effective in the long run? • Can we provide an effective environment for all types of parallel problems?

Summary Part 1 • Today’s programming languages, models, and tools are not adequate to deal with the challenges of 2010 architectures and application requirements • Nonetheless, support for legacy applications will be a strong driver • Peta-scale architectures will pose additional architectural problems • But(!) should provide better hardware support for high level languages and compilers • Research will have to pursued along both hardware and software routes

Why benchmarking is unscientific How benchmarks influence machine purchases and consequently machine designs High Performance LINPACK and its predictive power for other applications Balancing bandwidth and computational power Part 2: Hardware Crisis

From a notable Sun VP • “Current HPC benchmarks are grossly inadequate for guiding vendors to make the right choices” • “Complaining about ‘efficiency’ as a fraction of peak FLOPS is a throwback to times when FLOPS dominated cost. Get real. Over 95% of cost is the memory system!” • “Anything below one memory op per flop puts too much pressure on programmers to be clever about data placement and timing of that placement”

The Problem with Benchmarking • No scientific theory of benchmarking • Machine architectures are radically “un”SISD • Predicting performance from the result of one benchmark has become impossible • Is it even possible, in principle, that a suite of benchmarks can be used to predict future performance of much larger systems on our current applications? • This is fundamental since it guides design • Most people would say it is impossible, without incorporating the benchmark results into a larger framework for performance modelling • Argues for machine simulation being the key to future progress, rather than just making what we have bigger

Top 500 is becoming too important • Position on the Top 500 is becoming a key factor in machine purchases • Biases researchers towards buying the most flops possible for a given amount of dollars • Gigabit ethernet is used as the interconnect • Funding agencies like to see their dollars “are being well spent” (spot the irony) • Top 500 position is frequently used as a lure for faculty recruitment in academia • Some institutions even connect distinct systems together for one day to get on the list • Most vendors in-turn design for the highest possible LINPACK benchmark and ignore the harder parts of machine design • Such as providing low latency and high bandwidth

Benchmarking isn’t just about chest banging • Large machine procurement is a very sensitive exercise • Need to have installed base of users supply important codes for testing • Frequently external audits may be necessary to ensure money is properly spent • In absence of your own application codes, how can you judge one machine against another? • Single benchmarks do not provide enough details

Source: John McCalpin How good a predictor is LINPACK of general application performance? Terrible!!! Data taken for a number of high-end servers in 1999. This trend has continued…

Slightly better alternative Replaced by Specfp2000, note that correlation is still comparatively poor – but at least there is a visible trend!

Disappearance of Scaling Plots • When quoting performance figures for applications it is common to hear :“XYZ code achieves 15% of the maximal system performance” • Maximal system performance is almost impossible to achieve • Maximal single node performance * number of nodes • It is possible that the application scales almost perfectly, but doesn’t have good performance on a single node • Has caused confusion – where is the performance problem? On the node or in the communication system? • Both!!!

See “The LINPACK Benchmark: Past, Present and Future,” by Dongarra, Luszczek and Petitet LINPACK in more detail • LINPACK is a nicely constrained benchmark • Precisely 2n3/3+2n2+O(n) operations are required • Operations are split almost evenly between multiplications and additions • Memory requirement is order n2 - so one might be inclined to think there is a good computation to communication ratio (actually there is) • LINPACK has both good locality in memory and temporal locality • Best performance on HPL tends to come from running the biggest problem you can fit on your machine • Reduces effect of communication • Can tell how good communication is on a machine by examining the N1/2 value – if you achieve ½ rated flops for a comparatively small problem you are running on a low latency system

What are benchmarks really measuring? • Benchmarks actually measure two things • Properties of the application • Properties of the architecture it is being run on • Consider application A & B, run on systems 1 & 2 • System 1 has a lot of bandwidth but lower raw speed, System 2 has high raw speed but low bandwidth • Similarly application A requires high bandwidth and B does not Derived flop rates

HPCS Program GoalsBenchmarking Challenge

HPCC Challenge benchmark suite • Rather than relying upon one metric, this suite introduces additional (tougher) benchmarks • Sponsored by DARPA, DoE, NSF • All recognize that LINPACK is not enough • Vendors are beginning to come onboard • With 7 parts within the suite you can always claim you do one part of it better than anyone else • Unclear as to how an overall ranking will come about • “Benchmarking Olympiad?”

Components of the new suite • HPL (LINPACK) – Use MPI over the whole system • DGEMM – double precision matrix multiply from BLAS • STREAM on single CPU, Aggregate STREAM across entire system (embarrassingly parallel) • PTRANS (matrix transpose) • Random access – rate of update of random memory locations • FFTE – FFT benchmark • General bandwidth and latency tests

STREAM • Four different loops: • Measure the relative speed of each loop • Motivation for this benchmark comes from grid-based codes

PTRANS • Matrix transposition is communication intensive • Pairs of processors must communicate simultaneously • Two-dimensional block-cyclic storage is used • Results are provided for both single processor and entire system • Calculates A=A+BT

Random Access • Random memory addresses are calculated and then updated • Each processor carries its own table, and tables must consume at least half of the aggregate system memory • This has no spatial locality in memory, or temporal locality • Global problem factors in communication latency and reduces number of updates by 2 orders of magnitude • Very cache unfriendly – you won’t resuse any of the other cache line words retrieved

FFTE • http://www.ffte.jp • Tests speed of 1dimensional FFT • Quite strongly bandwidth dependent • Very useful benchmark for the broad range of scientific codes that use FFTs

General communication tests • Main benchmark is “random ring” • Nodes at selected at random within the MPI communicator • Select nodes at random and then test both latency and bandwidth • Test latency by shortening message load as much as possible • Test bandwidth by increasing message size through a list of prescribed lengths • Together these compare how well a system is optimized for short or long messages

So how do things look on the HPC Challenge right now? • http://icl.cs.utk.edu/hpcc • Benchmarks have been submitted for 131 systems • Vendors like NEC, Cray and SGI are keen to show their wares • Suite actually is more than 7 tests, since some tests can be conducted both locally and globally • It is pretty confusing already… • Category winners tend to be machines optimized for a specific area, although some platforms standard out as being good general purpose machines

High Performance Computing – CISC 811