730 likes | 1.02k Views
The Roofline Model. Samuel Williams Lawrence Berkeley National Laboratory. SWWilliams@lbl.gov. Outline. Challenges / Goals Fundamentals Roofline Performance Model Example: Heat Equation Example: SpMV Alternate Roofline Formulations Summary. Challenges / Goals. VMT PPE. SPE. SPE.
E N D
The Roofline Model Samuel Williams Lawrence Berkeley National Laboratory SWWilliams@lbl.gov
Outline Challenges / Goals Fundamentals Roofline Performance Model Example: Heat Equation Example: SpMV Alternate Roofline Formulations Summary
VMT PPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE Four Architectures 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 512K L2 MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC Thread Cluster Thread Cluster Thread Cluster Thread Cluster interconnect Thread Cluster Thread Cluster Thread Cluster Thread Cluster 192KB L2 (Textures only) 24 ROPs 6 x 64b memory controllers 86.4 GB/s Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron 768MB 900MHz GDDR3 Device DRAM 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 4GB/s (each direction) HyperTransport HyperTransport 2MB Shared quasi-victim (32 way) 2MB Shared quasi-victim (32 way) SRI / crossbar SRI / crossbar VMT PPE 2x64b memory controllers 2x64b memory controllers 10.66 GB/s 10.66 GB/s 512K L2 BIF BIF 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs <20GB/s (each direction) EIB (ring network) EIB (ring network) XDR memory controllers XDR memory controllers 25.6 GB/s 25.6 GB/s 512MB XDR DRAM 512MB XDR DRAM MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC Crossbar Crossbar 8 x 6.4 GB/s (1 per hub per direction) 179 GB/s 179 GB/s 90 GB/s 90 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 4 Coherency Hubs 2x128b controllers 2x128b controllers 21.33 GB/s 21.33 GB/s 10.66 GB/s 10.66 GB/s 667MHz FBDIMMs 667MHz FBDIMMs AMD Barcelona Sun Victoria Falls IBM Cell Blade NVIDIA G80
Challenges / Goals We have extremely varied architectures. Moreover, the characteristics of numerical methods can vary dramatically. The result is that performance and the benefit of optimization can vary significantly from one architecture x kernel combination to the next. We wish to understand whether or not we’ve attained good performance (high fraction of a theoretical peak) We wish to identify performance bottlenecks and enumerate potential remediation strategies.
Little’s Law • Little’s Law: Concurrency = Latency * Bandwidth - or - Effective Throughput = Expressed Concurrency / Latency • Bandwidth • conventional memory bandwidth • #floating-point units • Latency • memory latency • functional unit latency • Concurrency: • bytes expressed to the memory subsystem • concurrent (parallel) memory operations • For example, consider a CPU with 2 FPU’s each with a 4-cycle latency. Little’s law states that we must express 8-way ILP to fully utilize the machine.
Little’s Law Examples Applied to Memory Applied to FPUs • considera CPU with 20GB/s of bandwidth and 100ns memory latency. • Little’s law states that we must express 2KB of concurrency (independent memory operations) to the memory subsystem to attain peak performance • On today’s superscalar processors, hardware stream prefetchers speculatively load consecutive elements. • Solution:express the memory access pattern in a streaming fashion in order to engage the prefetchers. • consider a CPU with 2 FPU’s each with a 4-cycle latency. • Little’s law states that we must express 8-way ILP to fully utilize the machine. • Solution:unroll/jam the code to express 8 independent FP operations. • Note, simply unrolling dependent operations (e.g. reduction) does not increase ILP. It simply amortizes loop overhead.
Three Classes of Locality • Temporal Locality • reusing data (either registers or cache lines) multiple times • amortizes the impact of limited bandwidth. • transform loops or algorithms to maximize reuse. • Spatial Locality • data is transferred from cache to registers in words. • However, data is transferred to the cache in 64-128Byte lines • using every word in a line maximizes spatial locality. • transform data structures into structure of arrays (SoA) layout • Sequential Locality • Many memory address patterns access cache lines sequentially. • CPU’s hardware stream prefetchers exploit this observation to hide speculatively load data to memory latency. • Transform loops to generate (a few) long, unit-stride accesses.
Arithmetic Intensity • True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes • Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality) • Others have constant intensity • Arithmetic intensity is ultimately limited by compulsory traffic • Arithmetic intensity is diminished by conflict or capacity misses. O( log(N) ) O( 1 ) O( N ) A r i t h m e t i c I n t e n s i t y SpMV, BLAS1,2 FFTs Stencils (PDEs) Dense Linear Algebra (BLAS3) Lattice Methods Particle Methods
NUMA Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory
NUMA Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among between cores Exploit NUMA (affinity+first touch) when you malloc/init data. Concept is similar to data decomposition for distributed memory
Overlap of Communication time Byte’s / STREAM Bandwidth Flop’s / Flop/s Consider a simple example in which a FP kernel maintains a working set in DRAM. We assume we can perfectly overlap computation with communication or v.v. either through prefetching/DMA and/or pipelining (decoupling of communication and computation) Thus, time, is the maximum of the time required to transfer the data and the time required to perform the floating point operations.
Roofline ModelBasic Concept Attainable Performanceij = min FLOP/s with Optimizations1-i AI * Bandwidth with Optimizations1-j • Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis. • whereoptimization i can be SIMDize, or unroll, or SW prefetch, … • Given a kernel’s arithmetic intensity (based on DRAM traffic after being filtered by the cache), programmers can inspect the figure, and bound performance. • Moreover, provides insights as to which optimizations will potentially be beneficial.
Example • Consider the Opteron 2356: • Dual Socket (NUMA) • limited HW stream prefetchers • quad-core (8 total) • 2.3GHz • 2-way SIMD (DP) • separate FPMUL and FPADD datapaths • 4-cycle FP latency • Assuming expression of parallelism is the challenge on this architecture, what would the roofline model look like ?
Roofline ModelBasic Concept 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • Plot on log-log scale • Given AI, we can easily bound performance • But architectures are much more complicated • We will bound performance as we eliminate specific forms of in-core parallelism Opteron 2356 (Barcelona) peak DP attainable GFLOP/s Stream Bandwidth actual FLOP:Byte ratio
Roofline Modelcomputational ceilings 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • Opterons have dedicated multipliers and adders. • If the code is dominated by adds, then attainable performance is half of peak. • We call these Ceilings • They act like constraints on performance Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s Stream Bandwidth actual FLOP:Byte ratio
Roofline Modelcomputational ceilings 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • Opterons have 128-bit datapaths. • If instructions aren’t SIMDized, attainable performance will be halved Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth actual FLOP:Byte ratio
Roofline Modelcomputational ceilings 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • On Opterons, floating-point instructions have a 4 cycle latency. • If we don’t express 4-way ILP, performance will drop by as much as 4x Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out ILP actual FLOP:Byte ratio
Roofline Modelcommunication ceilings 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • We can perform a similar exercise taking away parallelism from the memory subsystem Opteron 2356 (Barcelona) peak DP attainable GFLOP/s Stream Bandwidth actual FLOP:Byte ratio
Roofline Modelcommunication ceilings 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • Explicit software prefetch instructions are required to achieve peak bandwidth Opteron 2356 (Barcelona) peak DP attainable GFLOP/s Stream Bandwidth w/out SW prefetch actual FLOP:Byte ratio
Roofline Modelcommunication ceilings 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • Opterons are NUMA • As such memory traffic must be correctly balanced among the two sockets to achieve good Stream bandwidth. • We could continue this by examining strided or random memory access patterns Opteron 2356 (Barcelona) peak DP attainable GFLOP/s Stream Bandwidth w/out SW prefetch w/out NUMA actual FLOP:Byte ratio
Roofline Modelcomputation + communication ceilings 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth. Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA w/out ILP actual FLOP:Byte ratio
Roofline Modellocality walls 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • Remember, memory traffic includes more than just compulsory misses. • As such, actual arithmetic intensity may be substantially lower. • Walls are unique to the architecture-kernel combination Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA only compulsory miss traffic w/out ILP FLOPs AI = Compulsory Misses actual FLOP:Byte ratio
Cache Behavior • Knowledge of the underlying cache operation can be critical. • For example, caches are organized into lines. Lines are organized into sets & ways (associativity) • Thus, we must mimic the effect of Mark Hill’s 3C’s of caches • Impacts of conflict, compulsory, and capacity misses are both architecture- and application-dependent. • Ultimately they reduce the actual flop:byte ratio. • Moreover, many caches are write allocate. • a write allocate cache read in an entire cache line upon a write miss. • If the application ultimately overwrites that line, the read was superfluous (further reduces flop:byte ratio) • Because programs access data in words, but hardware transfers it in 64 or 128B cache lines, spatial locality is key • Array-of-structure data layouts can lead to dramatically lower flop:byte ratios. • e.g. if a program only operates on the “red” field of a pixel, bandwidth is wasted.
Roofline Modellocality walls 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • Remember, memory traffic includes more than just compulsory misses. • As such, actual arithmetic intensity may be substantially lower. • Walls are unique to the architecture-kernel combination Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA +write allocation traffic only compulsory miss traffic w/out ILP FLOPs AI = Allocations + Compulsory Misses actual FLOP:Byte ratio
Roofline Modellocality walls 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • Remember, memory traffic includes more than just compulsory misses. • As such, actual arithmetic intensity may be substantially lower. • Walls are unique to the architecture-kernel combination Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA +capacity miss traffic +write allocation traffic only compulsory miss traffic w/out ILP FLOPs AI = Capacity + Allocations + Compulsory actual FLOP:Byte ratio
Roofline Modellocality walls 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • Remember, memory traffic includes more than just compulsory misses. • As such, actual arithmetic intensity may be substantially lower. • Walls are unique to the architecture-kernel combination Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA +conflict miss traffic +capacity miss traffic +write allocation traffic only compulsory miss traffic w/out ILP FLOPs AI = Conflict + Capacity + Allocations + Compulsory actual FLOP:Byte ratio
Roofline Modellocality walls 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • Optimizations remove these walls and ceilings which act to constrain performance. Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA +conflict miss traffic +capacity miss traffic +write allocation traffic only compulsory miss traffic w/out ILP actual FLOP:Byte ratio
Instruction Issue Bandwidth On a superscalar processor, there is likely ample instruction issue bandwidth. This allows loads, integer, and FP instructions to be issued simultaneously. As such, we assumed that expression of parallelism was the underlying challenge for in-core. However, on some architectures, finite instruction-issue bandwidth can become a major impediment to performance.
Roofline ModelInstruction Mix 256.0 128.0 64.0 32.0 16.0 8.0 4.0 2.0 1.0 0.5 1/8 1/4 1/2 1 2 4 8 16 • As the instruction mix shifts away from floating-point, finite issue bandwidth begins to affect limits on in-core performance. • On Niagara2, with dual issues units but only 1 FPU, FP instructions must constitute ≥50% of the mix to attain peak performance. • A similar approach should be used on GPUs where proper use of CUDA solves the parallelism challenges. UltraSparc T2+ T5140 (Niagara2) Stream Bandwidth peak DP, ≥50% FP attainable GFLOP/s w/out SW prefetch w/out NUMA 25% FP 12% FP actual FLOP:Byte ratio
Optimization Categorization Maximizing (attained) In-core Performance Maximizing (attained) Memory Bandwidth Minimizing (total) Memory Traffic
Optimization Categorization Maximizing In-core Performance Maximizing Memory Bandwidth Minimizing Memory Traffic • Exploit in-core parallelism • (ILP, DLP, etc…) • Good (enough) • floating-point balance
Optimization Categorization Maximizing In-core Performance Maximizing Memory Bandwidth Minimizing Memory Traffic • Exploit in-core parallelism • (ILP, DLP, etc…) • Good (enough) • floating-point balance ? reorder ? unroll & jam ? eliminate branches ? explicit SIMD
Optimization Categorization TLB blocking unit-stride streams ? DMA lists ? SW prefetch ? memory affinity ? ? Maximizing In-core Performance Maximizing Memory Bandwidth Minimizing Memory Traffic • Exploit in-core parallelism • (ILP, DLP, etc…) • Good (enough) • floating-point balance • Exploit NUMA • Hide memory latency • Satisfy Little’s Law reorder ? unroll & jam ? eliminate branches ? explicit SIMD ?
Optimization Categorization TLB blocking unit-stride streams ? DMA lists ? SW prefetch ? memory affinity ? ? Maximizing In-core Performance Maximizing Memory Bandwidth Minimizing Memory Traffic • Exploit in-core parallelism • (ILP, DLP, etc…) • Good (enough) • floating-point balance • Exploit NUMA • Hide memory latency • Satisfy Little’s Law • Eliminate: • Capacity misses • Conflict misses • Compulsory misses • Write allocate behavior reorder ? ? cache blocking unroll & jam ? ? array padding ? eliminate branches ? streaming stores explicit SIMD ? compress data ?
Optimization Categorization ? memory affinity ? SW prefetch ? DMA lists ? unit-stride streams ? TLB blocking Maximizing In-core Performance Maximizing Memory Bandwidth Minimizing Memory Traffic • Exploit in-core parallelism • (ILP, DLP, etc…) • Good (enough) • floating-point balance • Exploit NUMA • Hide memory latency • Satisfy Little’s Law • Eliminate: • Capacity misses • Conflict misses • Compulsory misses • Write allocate behavior ? reorder ? cache blocking ? unroll & jam ? array padding ? eliminate branches ? streaming stores ? explicit SIMD ? compress data
VMT PPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE Multicore SMPs Used 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 512K L2 MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 4GB/s (each direction) HyperTransport HyperTransport 2MB Shared quasi-victim (32 way) 2MB Shared quasi-victim (32 way) SRI / crossbar SRI / crossbar VMT PPE Core Core Core Core Core Core Core Core 2x64b memory controllers 2x64b memory controllers 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 10.66 GB/s 10.66 GB/s 512K L2 BIF BIF 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs <20GB/s (each direction) FSB FSB 10.66 GB/s 10.66 GB/s Chipset (4x64b controllers) EIB (ring network) EIB (ring network) 21.33 GB/s(read) 10.66 GB/s(write) XDR memory controllers XDR memory controllers 667MHz FBDIMMs 25.6 GB/s 25.6 GB/s 512MB XDR DRAM 512MB XDR DRAM MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC Crossbar Crossbar 8 x 6.4 GB/s (1 per hub per direction) 179 GB/s 179 GB/s 90 GB/s 90 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 4 Coherency Hubs 2x128b controllers 2x128b controllers 21.33 GB/s 21.33 GB/s 10.66 GB/s 10.66 GB/s 667MHz FBDIMMs 667MHz FBDIMMs Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade
7-point Stencil z+1 +Z y-1 x+1 x,y,z x-1 y+1 +Y z-1 +X PDE grid stencil for heat equation PDE • Simplest derivation of the Laplacian operator results in a constant coefficient 7-point stencil for all x,y,z: u(x,y,z,t+1) = alpha*u(x,y,z,t) + beta*( u(x,y,z-1,t) + u(x,y-1,z,t) + u(x-1,y,z,t) + u(x+1,y,z,t) + u(x,y+1,z,t) + u(x,y,z+1,t) ) • Clearly each stencil performs: • 8 floating-point operations • 8 memory references all but 2 should be filtered by an ideal cache • 6 memory streams all but 2 should be filtered (less than # HW prefetchers) • Ideally, AI=0.5. However, write-allocate bounds it to 0.33.
Roofline model for Stencil(out-of-the-box code) Clovertown Barcelona 128 128 peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 8 w/out ILP 4 4 fits within snoop filter w/out ILP w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Victoria Falls Cell Blade 128 128 64 64 peak DP 32 32 peak DP w/out FMA 16 16 attainable Gflop/s attainable Gflop/s 25% FP unaligned DMA w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio • Large datasets • 2 unit stride streams • No NUMA • Little ILP • No DLP • Far more adds than multiplies (imbalance) • Ideal flop:byte ratio 1/3 • High locality requirements • Capacity and conflict misses will severely impair flop:byte ratio No naïve Cell implementation
Roofline model for Stencil(out-of-the-box code) Clovertown Barcelona 128 128 peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 8 w/out ILP 4 4 fits within snoop filter w/out ILP w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Victoria Falls Cell Blade 128 128 64 64 peak DP 32 32 peak DP w/out FMA 16 16 attainable Gflop/s attainable Gflop/s 25% FP unaligned DMA w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio • Large datasets • 2 unit stride streams • No NUMA • Little ILP • No DLP • Far more adds than multiplies (imbalance) • Ideal flop:byte ratio 1/3 • High locality requirements • Capacity and conflict misses will severely impair flop:byte ratio No naïve Cell implementation
Roofline model for Stencil(NUMA, cache blocking, unrolling, prefetch, …) Clovertown Barcelona 128 128 peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 8 w/out ILP 4 4 fits within snoop filter w/out ILP w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Victoria Falls Cell Blade 128 128 64 64 peak DP 32 32 peak DP w/out FMA 16 16 attainable Gflop/s attainable Gflop/s 25% FP unaligned DMA w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio • Cache blocking helps ensure flop:byte ratio is as close as possible to 1/3 • Clovertown has huge caches but is pinned to lower BW ceiling • Cache management is essential when capacity/thread is low No naïve Cell implementation
Roofline model for Stencil(SIMDization + cache bypass) Clovertown Barcelona 128 128 peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 8 w/out ILP 4 4 fits within snoop filter w/out ILP w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Victoria Falls Cell Blade 128 128 64 64 peak DP 32 32 peak DP w/out FMA 16 16 attainable Gflop/s attainable Gflop/s 25% FP unaligned DMA w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio • Make SIMDization explicit • Technically, this swaps ILP and SIMD ceilings • Use cache bypass instruction: movntpd • Increases flop:byte ratio to ~0.5 on x86/Cell
Sparse MatrixVector Multiplication A x y A.col[ ] ... A.val[ ] for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0; } ... ... A.rowStart[ ] (a) algebra conceptualization (b) CSR data structure (c) CSR reference code • What’s a Sparse Matrix ? • Most entries are 0.0 • Performance advantage in only storing/operating on the nonzeros • Requires significant meta data to reconstruct the matrix structure • What’s SpMV ? • Evaluate y=Ax • A is a sparse matrix, x & y are dense vectors • Challenges • Very low arithmetic intensity (often <0.166 flops/byte) • Difficult to exploit ILP (bad for pipelined or superscalar), • Difficult to exploit DLP (bad for SIMD)
Roofline model for SpMV InCoreGFlopsi GFlopsi,j(AI) = min StreamBWj * AI Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 • Double precision roofline models • In-core optimizations 1..i • DRAM optimizations 1..j • FMA is inherent in SpMV (place at bottom) peak DP peak DP 64 64 w/out SIMD w/out SIMD 32 32 16 16 attainable Gflop/s attainable Gflop/s w/out ILP dataset dataset fits in snoop filter 8 w/out ILP 8 mul/add imbalance 4 4 mul/add imbalance w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 25% FP bank conflicts w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out FMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio
Roofline model for SpMV(overlay arithmetic intensity) Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 • Two unit stride streams • Inherent FMA • No ILP • No DLP • FP is 12-25% • Naïve compulsory flop:byte < 0.166 peak DP peak DP 64 64 w/out SIMD w/out SIMD 32 32 16 16 attainable Gflop/s attainable Gflop/s w/out ILP dataset dataset fits in snoop filter 8 w/out ILP 8 mul/add imbalance 4 4 mul/add imbalance w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 25% FP bank conflicts w/out SW prefetch w/out ILP 8 8 w/out NUMA w/out NUMA 12% FP 4 4 w/out FMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio