190 likes | 209 Views
This joint venture between UC Berkeley and LBNL explores performance techniques such as analysis and modeling to understand and optimize the memory wall in high-performance computing. The use of adaptable probes helps to isolate performance limitations and provide feedback to developers and hardware designers. The study examines various architectures and explores the impact of architectural features on performance.
E N D
Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)
Berkeley Institute for Performance Studies • Joint venture between U.C. Berkeley (Demmel & Yelick) • And LBNL (Oliker, Strohmaier, Bailey, and others) • Three performance techniques: • Analysis (benchmarking) • Modeling (prediction)
Investigating Architectural Balance using Adaptable Probes Kaushik Datta, Parry Husbands, Paul Hargrove, Shoaib Kamil, Leonid Oliker, John Shalf, Katherine Yelick
Overview • Gap between peak and sustained performance well known problem in HPC • Generally attributed to memory system, but difficult to identify bottleneck • Application benchmarks too complex to isolate specific architectural features • Microbenchmarks too narrow to predict actual code performance • We use adaptable probes to isolate performance limitations: • Give application developers possible optimizations • Give hardware designers feedback on current and proposed architectures • Single processor probes • Sqmat captures regular and irregular memory access patterns (such as dense and sparse linear algebra) • Stencil captures nearest neighbor computation (work in progress) • Architectures examined: • Commercial: Intel Itanium2, AMD Opteron, IBM Power3, IBM Power4, G5 • Research: Imagine, Iram, DIVA
Sqmat overview • Sqmat based on matrix multiplication and linear solvers • Java program used to generate optimally unrolled C code • Square a set of matrices M times in (use enough matrices to exceed cache) • M controls computational intensity (CI) - the ratio between flops and mem access • Each matrix is size NxN • N controls working set size: 2N2 registers required per matrix • Direct Storage: Sqmat’s matrix entries stored continuously in memory • Indirect: Entries accessed indirectly through pointer Parameter S controls degree of indirection, S matrix entries stored contiguously, then random jump in memory
100% 90% 80% 70% 60% percent of algorithmic peak 50% Itanium 2 40% Opteron Power3 30% Power4 20% 10% 0% 1 10 100 1000 10000 computational intensity (CI) Unit Stride Algorithmic Peak • Curve increases until memory system fully utilized, plateaus when FPU units saturate • Itanium2 requires longer time to achieve plateau due to register spill penalty • Opteron’s SIMD nature of SSE2 inhibits high algorithmic peak • Power3 effective hiding latency of cache-access • Power4’s deep pipeline inhibits ability to find sufficient ILP to saturate FPUs
5 Itanium 2 4 Opteron Power3 slowdown Power4 3 2 1 M 1 2 4 8 16 32 64 128 256 512 Slowdown due to Indirection Unit stride access via indirection (S=1) • Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values • Itanium2 showing high penalty for indirection - issue is currently under invesigation
4 3.5 100% (S=1) 50% (S=2) 100% (S=1) 3.5 50% (S=2) 3 25% (S=4) 25% (S=4) 12.5% (S=8) 12.5% (S=8) 3 6.25% (S=16) 6.25% (S=16) 3.13% (S=32) 2.5 1.56% (S=64) 0.78% (S=128) 2.5 slowdown for irregular access slowdown for irregular access 2 2 1.5 1.5 1 1 1 2 4 8 16 32 64 128 256 512 M M 1 2 4 8 16 32 64 128 256 512 Irregularity on Itanium, N=4 Irregularity on Opteron, N=4 Cost of Irregularity (1) • Itanium and Opteron perform well for irregular accesses due to: • Itanium2’s L2 caching of FP values (reduces cost of cache miss) • Opteron’s low memory latency due to on-chip memory controller
15 100% (S=1) 100% (S=1) 50% (S=2) 50% (S=2) 13 21 25% (S=4) 25% (S=4) 12.5% (S=8) 12.5% (S=8) 6.25% (S=16) 6.25% (S=16) 11 3.13% (S=32) 3.13% (S=32) 16 1.56% (S=64) 1.56% (S=64) 9 0.78% (S=128) 0.78% (S=128) 0.39% (S=256) 0.39% (S=256) random accesses slowdown for irregular access slowdown for irregular access random accesses 7 11 5 6 3 1 1 M M 1 2 4 8 16 32 64 128 256 1 2 8 16 32 64 128 256 512 4 512 Irregularity on Power3, N=4 Irregularity on Power4, N=4 Cost of Irregularity (2) • Power3 and Power4 perform well for irregular accesses due to: • Power3’s high penalty cache miss (35 cycles) and limited prefetch abilities • Power4’s requires 4 cache-line hit to activate prefetching
Tolerating Irregularity • S50 • Start with some M at S= (indirect unit stride) • For a given M, how large must S be to achieve at least 50% of the original performance? • M50 • Start with M=1, S= • At S=1 (every access random), how large must M be to achieve 50% of the original performance
Tolerating Irregularity Interested in developing application driven architectural probes for evaluation of emerging petascale systems
Emerging Architectures • General purpose processors badly suited for data intensive ops • Large caches not useful if re-use is low • Low memory bandwidth, especially for irregular patterns • Superscalar methods of increasing ILP inefficient • Power consumption • Application-specific ASICs • Good, but expensive/slow to design. • Solution: general purpose “memory aware” processors • Large number of ALUs: to exploit data-parallelism • Huge memory bandwidth: to keep ALUs busy • Concurrency: overlap memory w/ computation
VIRAM Overview • MIPS core (200 MHz) • Main memory system • 8 banks w/13 MB of on-chip DRAM • Large 6.4 GBytes/s on-chip peak bandwidth • Cache-less Vector unit • Energy efficient way to express fine-grained parallelism and exploit bandwidth • Single issue, in order • Low power consumption: 2.0 W • Peak vector performance • 1.6/3.2/6.4 Gops • 1.6 Gflops (single-precision) • Fabricated by IBM • Deep pipelines mask DRAM latency • Cray’s vcc compiler adapted to VIRAM • Simulator used for results
VIRAM Power Efficiency • Comparable performance with lower clock rate • Large power/performance advantage for VIRAM from • PIM technology, data parallel execution model
Imagine Overview • “Vector VLIW” processor • Coprocessor to off-chip host processor • 8 arithmetic clusters control in SIMD w/ VLIW instructions • Central 128KB Stream Register File @ 32GB/s • SRF can overlap computation w/ memory (double buffering) • SRF cab reuse intermediate results (proc-consum locality) • Stream-aware memory system with 2.7 GB/s off-chip • 544 GB/s intercluster comm • Host sends instuctions to stream controller, SC issues commands to on-chip modules
VIRAM and Imagine • Imagine order of magnitude higher performance • VIRAM twice memory bandwidth, less power consumption • Notice peak Flop/Word ratios
What Does This Have to Do with PIMs? • Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!) • Imagine much faster for long streams, slower for short ones
SQMAT: Performance Crossover • Large number of ops/word N10 where N=3x3 • Crossover point L=64 (cycles) , L = 256 (MFlop) • Imagine power becomes apparent almost 4x VIRAM at L=1024Codes at this end of spectrum greatly benefit from Imagine arch
Stencil Probe • Stencil computations core of wide range of scientific applications • Applications include Jacobi solvers, complex multigrid, block structured AMR • We developing adaptable stencil probe to model range of computations • Findings isolate importance of streaming memory accesses which engage automatic prefetch engines, thus greatly increasing memory throughput • Previous L1 tiling techniques mostly ineffective for stencil computations on modern microprocessors • Small blocks inhibit automatic prefetching performance • Modern large on-chip L2/L3 caches have similar bandwidth to L1 • Currently investigating tradeoffs between blocking and prefetching (paper in preparation) • Interested in exploring potential benefits of enhancing commodity processors with explicitly programmable prefetching