420 likes | 520 Views
The Coming Multicore Revolution: What Does it Mean for Programming?. Kathy Yelick NERSC Director, LBNL EECS Professor, U.C. Berkeley. Software Issues at Scale. Power concerns will dominates all others; Concurrency is the most significant knob we have: lower clock, increase parallelism
E N D
The Coming Multicore Revolution: What Does it Mean for Programming? Kathy Yelick NERSC Director, LBNL EECS Professor, U.C. Berkeley
Software Issues at Scale • Power concerns will dominates all others; • Concurrency is the most significant knob we have: lower clock, increase parallelism • Power density and facility energy • Summary Issues for Software • 1EF system: Billion-way concurrency, O(1K) cores per chip • 1 PF system: millions of threads and O(1K) cores per chip • The memory capacity/core ratio may drop significantly • Faults will become more prevalent • Flops are cheap relative to data movement • “Core” is a retro term: “sea of functional units” • 1 thread per functional unit • Many functional units per thread (SIMD, vectors) • Many threads per functional unit (Niagra)
DRAM component density is only doubling every 3 years Source: IBM 3 1 Mayl 2008 Sequoia Programming Models
Two Parallel Language Questions • What is the parallel control model? • What is the model for sharing/communication? implied synchronization for message passing, not shared memory data parallel (singe thread of control) dynamic threads single program multiple data (SPMD) receive store send load shared memory message passing
What’s Wrong with MPI Everywhere • We can run 1 MPI process per core • This works now (for CMPs) and will work for a while • But 100x faster bandwidth and lower latency within a chip • How long will it continue working? • 4 - 8 cores? Probably. 128 - 1024 cores? Probably not. • Depends on bandwidth and memory scaling • What is the problem? • Latency: some copying required by semantics • Memory utilization: partitioning data for separate address space requires some replication • How big is your per core subgrid? At 10x10x10, over 1/2 of the points are surface points, probably replicated • Memory bandwidth: extra state means extra bandwidth • Weak scaling: success model for the “cluster era;” will not be for the many core era -- not enough memory per core • Heterogeneity: MPI per CUDA thread-block? • Advantage: no new apps work; modest infrastructure work (multicore-optimized MPI)
What about Mixed MPI and Threads? • Threads: OpenMP, PThreads,… • Problems • Will OpenMP performance scale with the number of cores / chip? • More investment in infrastructure than MPI, but can leverage existing technology • Do people want two programming models? • Doesn’t go far enough • Thread context is a large amount of state compared to vectors/streams • Op per instruction vs. 64 ops per instruction
PGAS Languages: Why use 2 Programming Models when 1 will do? • Global address space: thread may directly read/write remote data • Partitioned: data is designated as local or global x: 1 y: x: 5 y: x: 7 y: 0 Global address space l: l: l: g: g: g: p0 p1 pn • Remote put and get: never have to say “receive” • Remote function invocation? See HPCS languages • No less scalable than MPI! (see previous talks) • Permits sharing, whereas MPI rules it out! • One model rather than two, but if you insist on two: • Can call UPC from MPI and vice verse (tested and used)
Sharing and Communication Models: PGAS vs. MPI host CPU two-sided message • A one-sided put/get message can be handled directly by a network interface with RDMA support • Avoid interrupting the CPU or storing data from CPU (preposts) • A two-sided messages needs to be matched with a receive to identify memory address to put data • Offloaded to Network Interface in networks like Quadrics • Need to download match tables to interface (from host) message id data payload network interface one-sided put message memory address data payload Joint work with Dan Bonachea
Performance Advantage of One-Sided Communication • The put/get operations in PGAS languages (remote read/write) are one-sided (no required interaction from remote proc) • This is faster for pure data transfers than two-sided send/receive
Shared partitioned on-chip x: 1 y: x: 5 y: x: 7 y: 0 PGAS Languages for Multicore DMA • PGAS languages are a good fit to shared memory machines, including multicore • Global address space implemented as reads/writes • Also may be exploited for processor with explicit local store rather than cache, e.g., Cell, GPUs,… • Open question in architecture • Cache-coherence shared memory • Software-controlled local memory (or hybrid) Private on-chip m: l: Shared off-chip DRAM
To Virtualize or Not • The fundamental question facing in parallel programming models is: What should be virtualized? • Hardware has finite resources • Processor count is finite • Registers count is finite • Fast local memory (cache and DRAM) size is finite • Links in network topology are generally < n2 • Does the programming model (language+libraries) expose this or hide it? • E.g., one thread per core, or many? • Many threads may have advantages for load balancing, fault tolerance and latency-hiding • But one thread is better for deep memory hierarchies • How to get the most out of your machine?
Avoid Global Synchronization • Bulk-synchronous programming has too much synchronization • Bad for performance • Linpack performance • On Multicore / SMP (left, Dongarra et al) and distributed memory (right, UPC) • Also bad direction for fault tolerance • Dynamic load balancing: use it when the problem demands it, not gratuitously
Lessons Learned • One-sided communication is faster than 2-sided • All about overlap and overhead (FFT example) • Global address space can ease programming • Dynamic scheduling can tolerate latencies • More adaptive and easier to use (fewer knobs) • Principled scheduling that takes into account • Critical Path, Memory use, Cache, etc. • Combination of dynamic load balancing with locality control has new challenges • Previous work solve load balancing (Cilk) or locality (MPI) but not both together
Autotuning: Extreme Performance Programming • Automatic performance tuning • Use machine time in place of human time for tuning • Search over possible implementations • Use performance models to restrict search space • Programmers should write programs to generate code, not the code itself • Autotuning finds a good performance solution be heuristics or exhaustive search • Perl script generates many versions • Generate SIMD-optimized kernels • Autotuner anlyzes/runs kernels • Uses search and heuristics • PERI SciDAC is including some of these ideas into compilers and domain-specific code generators libraries, e.g., OSKI for sparse matrices • Multicore optimizations not yet in OSKI release
Stencil Code Overview void stencil3d(double A[], double B[], int nx, int ny, int nz) { for all grid indices in x-dim { for all grid indices in y-dim { for all grid indices in z-dim { B[center] = S0* A[center] + S1*(A[top] + A[bottom] + A[left] + A[right] + A[front] + A[back]); } } } } • 3D, 7-point, Jacobi iteration on a 2563 grid • Flop:Byte Ratio: • 0.33 (write allocate), 0.5 (Ideal)
Intel Xeon (Clovertown) AMD Opteron (Barcelona) Sun Niagara2 (Victoria Falls) Naive Stencil Performance • For this simple code - all cache-based platforms show poor efficiency and scalability • Could lead programmer to believe that approaching a resource limit
Intel Xeon (Clovertown) AMD Opteron (Barcelona) Sun Niagara2 (Victoria Falls) Fully-Tuned Performance 1.9x 5.4x Optimizations include: • NUMA-Aware • Padding • Unroll/Reordering • Thread/Cache Blocking • Prefetching • SIMDization • Cache Bypass • Different optimizations have dramatic effects on different architectures • Largest optimization benefit seen for the largest core count 12.5x
Single Precision Double Precision Performance Power Efficiency Stencil Results
Autotuning gives better performance…but is it good performance?
The Roofline Performance Model 128 64 32 16 attainable Gflop/s 8 4 2 1 1/8 1/4 1/2 1 2 4 8 16 flop:DRAM byte ratio
Log scale ! Log scale ! The Roofline Performance Model 128 64 32 16 attainable Gflop/s 8 4 2 1 1/8 1/4 1/2 1 2 4 8 16 flop:DRAM byte ratio
256.0 Generic Machine 128.0 64.0 32.0 16.0 attainable Gflop/s 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 actual flop:byte ratio The Roofline Performance Model • The top of the roof is determined by peak computation rate (Double Precision floating point, DP for these algorithms) • The instruction mix, lack of SIMD operations, ILP or failure to use other features of peak will lower attainable peak DP mul / add imbalance w/out SIMD w/out ILP
256.0 Generic Machine 128.0 64.0 32.0 16.0 attainable Gflop/s 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 actual flop:byte ratio The Roofline Performance Model • The sloped part of the roof is determined by peak DRAM bandwidth (STREAM) • Lack of proper prefetch, ignoring NUMA, or other things will reduce attainable bandwidth peak DP mul / add imbalance w/out SIMD peak BW w/out ILP w/out SW prefetch w/out NUMA
256.0 Generic Machine 128.0 64.0 32.0 16.0 attainable Gflop/s 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 actual flop:byte ratio The Roofline Performance Model • Locations of posts in the building are determined by algorithmic intensity • Will vary across algorithms and with bandwidth-reducing optimizations, such as better cache re-use (tiling), compression techniques peak DP mul / add imbalance w/out SIMD peak BW w/out ILP w/out SW prefetch w/out NUMA
Peak Gflop/s Gflop/s(AI) = min StreamBW * AI Naïve Roofline Model Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 • Unrealistically optimistic model • Hand optimized Stream BW benchmark peak DP peak DP 64 64 32 32 16 16 attainable Gflop/s attainable Gflop/s hand optimized Stream BW hand optimized Stream BW 8 8 4 4 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP hand optimized Stream BW hand optimized Stream BW 16 16 attainable Gflop/s attainable Gflop/s 8 8 4 4 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
Roofline model for Stencil(out-of-the-box code) Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 • Large datasets • 2 unit stride streams • No NUMA • Little ILP • No DLP • Far more adds than multiplies (imbalance) • Ideal flop:byte ratio 1/3 • High locality requirements • Capacity and conflict misses will severely impair flop:byte ratio peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s dataset fits in snoop filter 8 8 w/out ILP 4 4 w/out ILP w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP w/out FMA 16 16 attainable Gflop/s attainable Gflop/s 25% FP bank conflicts w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
Roofline model for Stencil(out-of-the-box code) Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 • Large datasets • 2 unit stride streams • No NUMA • Little ILP • No DLP • Far more adds than multiplies (imbalance) • Ideal flop:byte ratio 1/3 • High locality requirements • Capacity and conflict misses will severely impair flop:byte ratio peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s dataset fits in snoop filter 8 8 w/out ILP 4 4 w/out ILP w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP w/out FMA 16 16 attainable Gflop/s attainable Gflop/s 25% FP bank conflicts w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
Roofline model for Stencil(out-of-the-box code) Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 • Large datasets • 2 unit stride streams • No NUMA • Little ILP • No DLP • Far more adds than multiplies (imbalance) • Ideal flop:byte ratio 1/3 • High locality requirements • Capacity and conflict misses will severely impair flop:byte ratio peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s dataset fits in snoop filter 8 8 w/out ILP 4 4 w/out ILP w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP w/out FMA 16 16 attainable Gflop/s attainable Gflop/s 25% FP bank conflicts w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …) Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 • Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3 • Clovertown has huge caches but is pinned to lower BW ceiling • Cache management is essential when capacity/thread is low peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s dataset fits in snoop filter 8 8 w/out ILP 4 4 w/out ILP w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP w/out FMA 16 16 attainable Gflop/s attainable Gflop/s 25% FP bank conflicts w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
Roofline model for Stencil(SIMDization + cache bypass) Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 • Make SIMDization explicit • Use cache bypass instruction: movntpd • Increases flop:byte ratio to ~0.5 on x86/Cell peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s dataset fits in snoop filter 8 8 w/out ILP 4 4 w/out ILP w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP w/out FMA 16 16 attainable Gflop/s attainable Gflop/s 25% FP bank conflicts w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
Autotuning: Extreme Performance Programming • Automatic performance tuning • Use machine time in place of human time for tuning • Search over possible implementations • Use performance models to restrict search space • Programmers should write programs to generate code, not the code itself • Autotuning finds a good performance solution be heuristics or exhaustive search • Perl script generates many versions • Generate SIMD-optimized kernels • Autotuner anlyzes/runs kernels • Uses search and heuristics • PERI SciDAC is including some of these ideas into compilers and domain-specific code generators libraries, e.g., OSKI for sparse matrices Block size (n0 x m0) for dense matrix-matrix multiply
Latency and Bandwidth-Avoiding • Communication is limiting factor for scalability • Movement of data within memory hierarchy • Movement of data between cores/nodes • Two aspects of communication • Latency: number of discrete events (cache misses, messages) • Bandwidth: volume of data moved • More efficient algorithms tend to run less efficiently • Algorithm efficiency: E.g., O(n) ops on O(n) data • Efficiency of computation on machine: % of peak • At scale, cannot afford to be algorithmically inefficient • Need to run low compute intensity algorithms well, or flops are wasted Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden
Avoiding Communication in Sparse Iterative Solvers • Consider Sparse Iterative Methods for Ax=b • Use Krylov Subspace Methods like GMRES, CG • Can we lower the communication costs? • Latency of communication, i.e., reduce # messages by computing Ak*x with one read of remote x • Bandwidth to memory hierarchy, i.e., compute A • Example: Inner loop is sparse matrix-vector multiply, Ax (=nearest neighbor computation on a graph) • Partition into cache blocks or by processor, and take multiple steps • Simple examples, A is matrix of: • For simplicity of presentation only: • 1D mesh mesh “3 point stencil” Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden
A8x A7x A6x A5x A4x A3x A2x Ax x Locally Dependent Entries for [x,Ax], A tridiagonal 2 processors Proc 1 Proc 2 Can be computed without communication Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden
A8x A7x A6x A5x A4x A3x A2x Ax x Locally Dependent Entries for [x,Ax,A2x], A tridiagonal 2 processors Proc 1 Proc 2 Can be computed without communication Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyudden
Avoiding Communication in Sparse Linear Algebra - Summary • Take k steps of Krylov subspace method • GMRES, CG, Lanczos, Arnoldi • Assume matrix “well-partitioned,” with modest surface-to-volume ratio • Parallel implementation • Conventional: O(k log p) messages • “New”: O(log p) messages - optimal • Serial implementation • Conventional: O(k) moves of data from slow to fast memory • “New”: O(1) moves of data – optimal • Can incorporate some preconditioners • Hierarchical, semiseparable matrices … • Lots of speed up possible (modeled and measured) • Price: some redundant computation
Why all our problems are solved for dense linear algebra– in theory • Thm (Demmel, Dumitriu, Holtz, Kleinberg) (Numer.Math. 2007) • Given any matmul running in O(n) ops for some >2, it can be made stable and still run in O(n+) ops, for any >0. • Current record: 2.38 • Thm (Demmel, Dumitriu, Holtz) (Numer. Math. 2008) • Given any stable matmul running in O(n+) ops, can do backward stable dense linear algebra in O(n+) ops: • GEPP, QR (recursive) • rank revealing QR (randomized) • (Generalized) Schur decomposition, SVD (randomized) • Also reduces communication to O(n+) • But constants?
How to Waste an Exascale Machine • Ignore Little’s Law (waste bandwidth) • Over-synchronize (unnecessary barriers) • Over-synchronize communication (two-sided vs. one-sided) • Waste bandwidth: ignore locality • Use algorithms that minimize flops rather than data movement • Add a high-overhead runtime system when you don’t need it