390 likes | 410 Views
Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS). Katherine Yelick, BIPS Director Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept. National Science Foundation. Challenges to Performance.
E N D
Performance Understanding, Prediction, and Tuning at the Berkeley Institute for Performance Studies (BIPS) Katherine Yelick, BIPS Director Lawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept. National Science Foundation
Challenges to Performance Two trends in High End Computing • Increasing complicated systems • Multiple forms of parallelism • Many levels of memory hierarchy • Complex systems software in between • Increasingly sophisticated algorithms • Unstructured meshes and sparse matrices • Adaptivity in time and space • Multi-physics models lead to hybrid approaches • Conclusion: Deep understanding of performance at all levels is important Berkeley Institute for Performance Studies
BIPS Institute Goals • Bring together researchers on all aspects of performance engineering • Use performance understanding to: • Improve application performance • Compare architectures for application suitability • Influence the design of processors, networks and compilers • Identify algorithmic needs Berkeley Institute for Performance Studies
BIPS Approaches • Benchmarking and Analysis • Measure performance • Identify opportunities for improvements in software, hardware, and algorithms • Modeling • Predict performance on future machines • Understand performance limits • Tuning • Improve performance • By hand or with automatic self-tuning tools Berkeley Institute for Performance Studies
Next Gen Apps Full Apps Compact Apps Micro- Benchmarks System Size and Complexity Multi-Level Analysis • Full Applications • What users want • Do not reveal impact of features • Compact Applications • Can be ported with modest effort • Easily match phases of full applications • Microbenchmarks • Isolate architectural features • Hard to tie to real applications Berkeley Institute for Performance Studies
Projects Within BIPS • Application evaluation on vector processors • APEX: Application Performance Characterization Benchmarking • BeBop: Berkeley Benchmarking and Optimization Group • Architectural probes for alternative architectures • LAPACK: Linear Algebra Package • PERC: Performance Engineering Research Center • Top500 • ViVA: Virtual Vector Architectures Berkeley Institute for Performance Studies
Application Evaluation of Vector Systems • Two vector architectures: • The Japanese Earth Simulator • The Cray X1 • Comparison to “commodity”-based systems • IBM SP, Power4 • SGI Altix • Ongoing study of DOE applications • CACTUS Astrophysics 100,000 lines grid based • PARATECMaterial Science50,000 linesFourier space • LBMHDPlasma Physics 1,500 linesgrid based • GTCMagnetic Fusion 5,000 linesparticle based • MADCAPCosmology 5,000 linesdense lin. alg. • Work by L. Oliker, J. Borrill, A. Canning, J. Carter, J. Shalf, S. Hongzhang Berkeley Institute for Performance Studies
APEX-MAP Benchmark • Goal: Quantify the effects of temporal and spatial locality • Focus on memory system and network performance • Graphs over temporal and spatial locality axes • Show performance valleys/cliffs Berkeley Institute for Performance Studies
Application Kernel Benchmarks • Microbenchmarks are good for: • Identifying architecture/compiler bottlenecks • Optimization opportunities • Application benchmarks are good for: • Machine selection for specific apps • In between: Benchmarks to capture important behavior in real applications • Sparse matrices: SPMV benchmark • Stencil operations: Stencil probe • Possible future: sorting, narrow datatype ops,… Berkeley Institute for Performance Studies
Sparse Matrix Vector Multiply (SPMV) • Sparse matrix algorithms • Increasingly important in applications • Challenge memory systems: poor locality • Many matrices have structure, e.g., dense sub-blocks, that can be exploited • Benchmarking SPMV • NAS CG, SciMark, use a random matrix • Not reflective of most real problems • Benchmark challenge: • Ship real matrices: cumbersome & inflexible • Build “realistic” synthetic matrices Berkeley Institute for Performance Studies
Importance of Using Blocked Matrices Speedup of best-case blocked matrix vs unblocked Berkeley Institute for Performance Studies
Generating Blocked Matrices • Our approach: Uniformly distributed random structure, each a rxc block • Collect data for r and c from 1 to 12 • Validation: Can our random matrices simulate “typical” matrices? • 44 matrices from various applications • 1: Dense matrix in sparse format • 2-17: Finite-Element-Method matrices, FEM • 2-9: single block size, 10-17: multiple block sizes • 18-44: non-FEM • Summarization: Weighted by occurrence in test suite (ongoing) Berkeley Institute for Performance Studies
Itanium 2 prediction Berkeley Institute for Performance Studies
UltraSparc III prediction Berkeley Institute for Performance Studies
Sample summary results (Apple G5, 1.8 GHz) Berkeley Institute for Performance Studies
Selected SpMV benchmark results • Raw results • Which machine is fastest • Scaled machine's peak floating-point rate • Mitigates chip technology factors • Influenced by compiler issues • Fraction of peak memory bandwidth • Use Stream bechmark for “attainable peak” • How close to this bound is SPMV running? Berkeley Institute for Performance Studies
Automatic Performance Tuning • Performance depends on machine, kernel, matrix • Matrix known at run-time • Best data structure + implementation can be surprising • Filling in explicit zeros can • Reduce storage • Improve performance • PIII example: 50% more nonzeros, 50% faster • BeBOP approach: empirical modeling and search • Up to 4x speedups and 31% of peak for SpMV • Many optimization techniques for SpMV • Several other kernels: triangular solve, ATA*x, Ak*x • Proof-of-concept: Integrate with Omega3P • Release OSKI Library, integrate into PETSc Berkeley Institute for Performance Studies
Summary of Optimizations • Optimizations for SpMV (numbers shown are maximums) • Register blocking (RB): up to 4x • Variable block splitting: 2.1xover CSR, 1.8x over RB • Diagonals: 2x • Reordering to create dense structure + splitting: 2x • Symmetry: 2.8x • Cache blocking: 6x • Multiple vectors (SpMM): 7x • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x • Higher-level kernels • AAT*x, ATA*x: 4x • A2*x: 2x over CSR, 1.5x • Future: automatic tuning for vectors Berkeley Institute for Performance Studies
Architectural Probes • Understanding memory system performance • Interaction with processor architecture: • Number of registers • Arithmetic units (parallelism) • Prefetching • Cache size, structure, policies • APEX-MAP: memory and network system • Sqmat: processor features included Berkeley Institute for Performance Studies
Impact of Indirection • Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values • Itanium2 showing high penalty for indirection • Results from the sqmat “probe” • Unit stride access via indirection (S=1) Berkeley Institute for Performance Studies
Tolerating Irregularity • S50 (Penalty for random access) • S is the length of each unit stride run • Start with S= (indirect unit stride) • How large must S be to achieve at least 50% of this performance? • All done for a fixed computational intensity • CI50 (Hide random access penalty using high computational intensity) • CI is computational intensity, controlled by number of squarings (M) per matrix • Start with M=1, S= • At S=1 (every access random), how large must M be to achieve 50% of this performance? • For both, lower numbers are better Berkeley Institute for Performance Studies
Tolerating Irregularity Berkeley Institute for Performance Studies
Memory System Observations • Caches are important • Important gap has moved: • between L3/memory, not L1/L2 • Prefetching increasingly important • Limited and finicky • Effect may overwhelm cache optimizations if blocking increases non-unit stride access • Sparse codes: matrix volume is key factor • Not the indirect loads Berkeley Institute for Performance Studies
Ongoing Vector Investigation • How much hardware support for vector-like performance? • Can small changes to a conventional processor get this effect? • Role of compilers/software • Related to Power5 effort • Latency hiding in software • Prefetch engines easily confused • Sparse matrix (random) and grid-based (strided) applications are target • Currently investigating simulator tools and any emerging hardware Berkeley Institute for Performance Studies
Summary • High level goals: • Understand future HPC architecture options that are commercially viable • Can minimal hardware extensions make improve effectiveness for scientific applications • Various technologies • Current, future, academic • Various performance analysis techniques • Application level benchmarks • Application kernel benchmarks (SPMV, stencil) • Architectural probes • Performance modeling and prediction Berkeley Institute for Performance Studies
Jonathan Carter Kaushik Datta James Demmel Joe Gebis Paul Hargrove Parry Husbands Shoaib Kamil Bill Kramer Rajesh Nishtala Leonid Oliker John Shalf Hongzhang Shan Horst Simon David Skinner Erich Strohmaier Rich Vuduc Mike Welcome Sam Williams Katherine Yelick People within BIPS And many collaborators outside Berkeley Lab/Campus Berkeley Institute for Performance Studies
Sqmat overview • Java code generate produces unrolled C code • Stream of matrices • Square each Matrix M times in • M controls computational intensity (CI) - the ratio between flops and mem access • Each matrix is size NxN • N controls working set size: 2N2 registers required per matrix. N is varied to cover observable register set size. • Two storage formats: • Direct Storage: Sqmat’s matrix entries stored continuously in memory • Indirect: Entries accessed through indirection vector. “Stanza length” S controls degree of indirection NxN . . . S in a row Berkeley Institute for Performance Studies
5 Itanium 2 4 Opteron Power3 slowdown Power4 3 2 1 M 1 2 4 8 16 32 64 128 256 512 Slowdown due to Indirection • Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values • Itanium2 showing high penalty for indirection Unit stride access via indirection (S=1) Berkeley Institute for Performance Studies
Potential Impact on Applications: T3P • Source: SLAC [Ko] • 80% of time spent in SpMV • Relevant optimization techniques • Symmetric storage • Register blocking • On Single Processor Itanium 2 • 1.68x speedup • 532 Mflops, or 15% of 3.6 GFlop peak • 4.4x speedup with 8 multiple vectors • 1380 Mflops, or 38% of peak Berkeley Institute for Performance Studies
Potential Impact on Applications: Omega3P • Application: accelerator cavity design [Ko] • Relevant optimization techniques • Symmetric storage • Register blocking • Reordering • Reverse Cuthill-McKee ordering to reduce bandwidth • Traveling Salesman Problem-based ordering to create blocks • Nodes = columns of A • Weights(u, v) = no. of nz u, v have in common • Tour = ordering of columns • Choose maximum weight tour • See [Pinar & Heath ’97] • 2x speedup on Itanium 2, but SPMV not dominant Berkeley Institute for Performance Studies
Tolerating Irregularity • S50 (Penalty for random access) • S is the length of each unit stride run • Start with S= (indirect unit stride) • How large must S be to achieve at least 50% of this performance? • All done for a fixed computational intensity • CI50 (Hide random access penalty using high computational intensity) • CI is computational intensity, controlled by number of squarings (M) per matrix • Start with M=1, S= • At S=1 (every access random), how large must M be to achieve 50% of this performance? • For both, lower numbers are better Berkeley Institute for Performance Studies
Tolerating Irregularity Berkeley Institute for Performance Studies
Emerging Architectures • General purpose processors badly suited for data intensive ops • Large caches not useful if re-use is low • Low memory bandwidth, especially for irregular patterns • Superscalar methods of increasing ILP inefficient • Power consumption • Research architectures • Berkeley IRAM: Vector and PIM chip • Stanford Imagine: Stream processor • ISI Diva: PIM with conventional processor Berkeley Institute for Performance Studies
Sqmat on PIM Systems • Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!) • Imagine much faster for long streams, slower for short ones Berkeley Institute for Performance Studies
Comparison to HPCC “Four Corners” Opteron LINPACK 2000 MFLOPS @1.4ghz Sqmat 2145 MFLOPS @1.6ghz STREAMS 1969 MB/s Sqmat 2047 MB/s RandomAccess 0.00442 GUPs Sqmat 0.00440 GUPs Stream Sqmat S=0 M=1 N=1 RandomAccess Sqmat S=1 M=1 N=1 Temporal Locality Itanium2 LINPACK 4.65 GFLOPs Sqmat 4.47 GFLOPs STREAMS 3895 MB/s Sqmat 4055 MB/s RandomAccess 0.00484 GUPs Sqmat 0.0141 GUPs LINPACK Sqmat S=0 M=8 N=8 FFT (future) Spatial Locality Berkeley Institute for Performance Studies