Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)

Berkeley Institute for Performance Studies • Joint venture between U.C. Berkeley (Demmel & Yelick) • And LBNL (Oliker, Strohmaier, Bailey, and others) • Three performance techniques: • Analysis (benchmarking) • Modeling (prediction)

Investigating Architectural Balance using Adaptable Probes Kaushik Datta, Parry Husbands, Paul Hargrove, Shoaib Kamil, Leonid Oliker, John Shalf, Katherine Yelick

Overview • Gap between peak and sustained performance well known problem in HPC • Generally attributed to memory system, but difficult to identify bottleneck • Application benchmarks too complex to isolate specific architectural features • Microbenchmarks too narrow to predict actual code performance • We use adaptable probes to isolate performance limitations: • Give application developers possible optimizations • Give hardware designers feedback on current and proposed architectures • Single processor probes • Sqmat captures regular and irregular memory access patterns (such as dense and sparse linear algebra) • Stencil captures nearest neighbor computation (work in progress) • Architectures examined: • Commercial: Intel Itanium2, AMD Opteron, IBM Power3, IBM Power4, G5 • Research: Imagine, Iram, DIVA

Sqmat overview • Sqmat based on matrix multiplication and linear solvers • Java program used to generate optimally unrolled C code • Square a set of matrices M times in (use enough matrices to exceed cache) • M controls computational intensity (CI) - the ratio between flops and mem access • Each matrix is size NxN • N controls working set size: 2N2 registers required per matrix • Direct Storage: Sqmat’s matrix entries stored continuously in memory • Indirect: Entries accessed indirectly through pointer Parameter S controls degree of indirection, S matrix entries stored contiguously, then random jump in memory

100% 90% 80% 70% 60% percent of algorithmic peak 50% Itanium 2 40% Opteron Power3 30% Power4 20% 10% 0% 1 10 100 1000 10000 computational intensity (CI) Unit Stride Algorithmic Peak • Curve increases until memory system fully utilized, plateaus when FPU units saturate • Itanium2 requires longer time to achieve plateau due to register spill penalty • Opteron’s SIMD nature of SSE2 inhibits high algorithmic peak • Power3 effective hiding latency of cache-access • Power4’s deep pipeline inhibits ability to find sufficient ILP to saturate FPUs

5 Itanium 2 4 Opteron Power3 slowdown Power4 3 2 1 M 1 2 4 8 16 32 64 128 256 512 Slowdown due to Indirection Unit stride access via indirection (S=1) • Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values • Itanium2 showing high penalty for indirection - issue is currently under invesigation

4 3.5 100% (S=1) 50% (S=2) 100% (S=1) 3.5 50% (S=2) 3 25% (S=4) 25% (S=4) 12.5% (S=8) 12.5% (S=8) 3 6.25% (S=16) 6.25% (S=16) 3.13% (S=32) 2.5 1.56% (S=64) 0.78% (S=128) 2.5 slowdown for irregular access slowdown for irregular access 2 2 1.5 1.5 1 1 1 2 4 8 16 32 64 128 256 512 M M 1 2 4 8 16 32 64 128 256 512 Irregularity on Itanium, N=4 Irregularity on Opteron, N=4 Cost of Irregularity (1) • Itanium and Opteron perform well for irregular accesses due to: • Itanium2’s L2 caching of FP values (reduces cost of cache miss) • Opteron’s low memory latency due to on-chip memory controller

15 100% (S=1) 100% (S=1) 50% (S=2) 50% (S=2) 13 21 25% (S=4) 25% (S=4) 12.5% (S=8) 12.5% (S=8) 6.25% (S=16) 6.25% (S=16) 11 3.13% (S=32) 3.13% (S=32) 16 1.56% (S=64) 1.56% (S=64) 9 0.78% (S=128) 0.78% (S=128) 0.39% (S=256) 0.39% (S=256) random accesses slowdown for irregular access slowdown for irregular access random accesses 7 11 5 6 3 1 1 M M 1 2 4 8 16 32 64 128 256 1 2 8 16 32 64 128 256 512 4 512 Irregularity on Power3, N=4 Irregularity on Power4, N=4 Cost of Irregularity (2) • Power3 and Power4 perform well for irregular accesses due to: • Power3’s high penalty cache miss (35 cycles) and limited prefetch abilities • Power4’s requires 4 cache-line hit to activate prefetching

Tolerating Irregularity • S50 • Start with some M at S= (indirect unit stride) • For a given M, how large must S be to achieve at least 50% of the original performance? • M50 • Start with M=1, S= • At S=1 (every access random), how large must M be to achieve 50% of the original performance

Tolerating Irregularity Interested in developing application driven architectural probes for evaluation of emerging petascale systems

Emerging Architectures • General purpose processors badly suited for data intensive ops • Large caches not useful if re-use is low • Low memory bandwidth, especially for irregular patterns • Superscalar methods of increasing ILP inefficient • Power consumption • Application-specific ASICs • Good, but expensive/slow to design. • Solution: general purpose “memory aware” processors • Large number of ALUs: to exploit data-parallelism • Huge memory bandwidth: to keep ALUs busy • Concurrency: overlap memory w/ computation

VIRAM Overview • MIPS core (200 MHz) • Main memory system • 8 banks w/13 MB of on-chip DRAM • Large 6.4 GBytes/s on-chip peak bandwidth • Cache-less Vector unit • Energy efficient way to express fine-grained parallelism and exploit bandwidth • Single issue, in order • Low power consumption: 2.0 W • Peak vector performance • 1.6/3.2/6.4 Gops • 1.6 Gflops (single-precision) • Fabricated by IBM • Deep pipelines mask DRAM latency • Cray’s vcc compiler adapted to VIRAM • Simulator used for results

VIRAM Power Efficiency • Comparable performance with lower clock rate • Large power/performance advantage for VIRAM from • PIM technology, data parallel execution model

Imagine Overview • “Vector VLIW” processor • Coprocessor to off-chip host processor • 8 arithmetic clusters control in SIMD w/ VLIW instructions • Central 128KB Stream Register File @ 32GB/s • SRF can overlap computation w/ memory (double buffering) • SRF cab reuse intermediate results (proc-consum locality) • Stream-aware memory system with 2.7 GB/s off-chip • 544 GB/s intercluster comm • Host sends instuctions to stream controller, SC issues commands to on-chip modules

VIRAM and Imagine • Imagine order of magnitude higher performance • VIRAM twice memory bandwidth, less power consumption • Notice peak Flop/Word ratios

What Does This Have to Do with PIMs? • Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!) • Imagine much faster for long streams, slower for short ones

SQMAT: Performance Crossover • Large number of ops/word N10 where N=3x3 • Crossover point L=64 (cycles) , L = 256 (MFlop) • Imagine power becomes apparent almost 4x VIRAM at L=1024Codes at this end of spectrum greatly benefit from Imagine arch

Stencil Probe • Stencil computations core of wide range of scientific applications • Applications include Jacobi solvers, complex multigrid, block structured AMR • We developing adaptable stencil probe to model range of computations • Findings isolate importance of streaming memory accesses which engage automatic prefetch engines, thus greatly increasing memory throughput • Previous L1 tiling techniques mostly ineffective for stencil computations on modern microprocessors • Small blocks inhibit automatic prefetching performance • Modern large on-chip L2/L3 caches have similar bandwidth to L1 • Currently investigating tradeoffs between blocking and prefetching (paper in preparation) • Interested in exploring potential benefits of enhancing commodity processors with explicitly programmable prefetching

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall