Performance Engineering Research Institute (DOE SciDAC)

Performance Engineering Research Institute (DOE SciDAC) Katherine Yelick LBNL and UC Berkeley

Performance EngineeringEnabling Petascale Science • Petascale computing is about delivering performance to scientists • Maximizing performance is getting harder: • Systems are more complicated • O(100K) processors • Multi-core with SIMD extensions • Scientific software is more complicated • Multi-disciplinary and multi-scale • The Performance Engineering Research Institute address this challenge in three ways: • Model and predict application performance • Assist SciDAC scientific code projects with performance analysis and tuning • Investigate novel strategies for automatic performance tuning IBM BlueGene at LLNL Cray Xt3 at ORNL POP model of El Nino Beam3D accelerator modeling

Engaging SciDAC Software Developers • Application Engagement • Work directly with DOE computational scientists • Ensure successful performance porting of scientific software • Focus PERI research on real problems • Application Liaisons • Build long-term personal relationships with PERI researchers and scientific code teams • Tiger Teams • Focus on DOE’s highest priorities • SciDAC-2 • INCITE Optimizing arithmetic kernels Maximizing scientific throughput

Automatic Performance Tuning of Scientific Code Long-term goals of PERI • Automate the process of tuning software to maximize its performance • Reduce the performance portability challenge facing computational scientists. • Address the problem that performance experts are in short supply • Build upon forty years of human experience and recent success with linear algebra libraries PERI automatic tuning framework

Participating Institutions Lead PI: Bob Lucas Institutions: Argonne National Laboratory Lawrence Berkeley National Laboratory Lawrence Livermore National Laboratory Oak Ridge National Laboratory Rice University University of California at San Diego University of Maryland University of North Carolina University of Southern California University of Tennessee

Major Tuning Activities in PERI • Triage: discover tuning targets • HPC Toolkit • PAPI • Library-based tuning • Dense linear algebra • Sparse linear algebra • Join the PERI community with your favorite kernels • Application-based tuning • With user support • Automatic source-based tuning • PERI Portal

Triage Tools HPC Toolkit: Tool to identify tuning opportunities (Mellor-Crummey, Rice) • Ease of use • no manual code instrumentation • handle large multi-lingual codes with 3rd party libraries • Perform detailed measurements • both communication and computation • many granularities: node, core, thread, procedure, loop, and statement levels • Collect performance data in a scalable way • data size not linear in execution time: sample-based rather than trace-based • Avoid perturbing execution • user selectable overhead at an arbitrarily low level • Identify inefficiencies in code: • Parallel inefficiencies: load imbalance, serialization, communication overhead • Computation inefficiencies: pipeline stalls, memory bottlenecks, etc.

On-line Hardware Monitoring PAPI: machine-independent Performance API (Shirley Moore & Jack Dongarra, UTK) • Multi-substrate support recently added to PAPI enables simultaneous monitoring of • On-processor counters • Off-processor counters (e.g., network counters) • Temperature sensors • Heterogeneous multi-core hybrid systems • Online monitoring will help enable runtime adaptation

Major Tuning Activities in PERI • Triage: discover tuning targets • HPC Toolkit (Rice) • PAPI (UTK) • Library-based tuning • Dense linear algebra • Sparse linear algebra • Join the PERI community with your favorite kernels • Application-based tuning • User-identified tuning parameters • Automatic source-based tuning • PERI Portal

Dense Linear Algebra Atlas: Auto-tuned library for dense linear algebra (UTK) • Performance portability across processors • New: massively multi-threaded and multi-core architectures, which requires • Asynchrony (e.g., lookahead) • Modern vectorization (SIMD extensions) • Hiding of memory latency • Overlap of communication with computation • Hand techniques being automated • Better search algorithms (POET) and parallel search • See later talks today

Sparse Linear Algebra • OSKI: Optimized Sparse Kernel Interface (Yelick, Demmel, Vuduc) • Extra work can improve performance • Cannot make decisions offline: need matrix structure • Example: • Pad 3x3 blocks with zeros • “Fill ratio” = 1.5 • PIII speedup: 1.5x Joint work with Bebop group

Optimizations Available in OSKI • Optimizations for SpMV • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 3x over CSR • Multiple vectors (SpMM): 7x over CSR • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels (focus for new work) • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A*x: 2x over CSR, 1.5x over RB • New: vector and multicore support, better code generation Joint work with Bebop group, see R. Vuduc PhD thesis

OSKI-PETSc Proof-of-Concept Results • Recent work by Rich Vuduc • Integration of OSKI into PETSc • Example matrix: Accelerator cavity design • N ~ 1 M, ~40 M non-zeros (SLAC matrix) • 2x2 dense block substructure • Uses register blocking and symmetry • Improves performance of local computation • Preliminary speedup numbers: • 8 node Xeon cluster • Speedup: 1.6x Joint work with Bebop group, see R. Vuduc PhD thesis

Stencil Computations • Stencils have simple inner loops • Typically ~1 FLOP per load • Runs at small fraction of peak (<15%)! • Strategies: minimize cache misses • Cache blocked within 1 sweep (aka iteration or timestep) • Time skewed (and cache blocked): merge across iterations • Cache oblivious: use recursive decomposition across iterations • Observations: • Iteration merging only works in some algorithms! • Reducing misses does not always minimize time • Prefetch is at least as important as caching (unit stride runs) • Big difference between 1D, 2D, and 3D results in practice Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Time Skewing and Blocking

Cache Oblivious

Time Skewed/Blocked vs. Cache Oblivious

Tuning for the Cell Architecture • Cell will be used in the PS3  high volume • Current system problems: • Off-chip bandwidth and power • Double precision floating point interface • Only a problem for algorithms with high computational efficiency • Small fix to floating point hardware interface (Cell+) • Memory system • Software controlled memory improves bandwidth and power usage • Allows finely tuned deep prefetching, efficient cache utilization • Predictable performance and less architectural complexity • But increases programming complexity Joint work with S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil

Scientific Kernels on Cell (double precision) 55 Joint work with S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil

Power Efficiency of Cell 1278 Joint work with S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil

Major Tuning Activities in PERI • Triage: discover tuning targets • HPC Toolkit • PAPI • Library-based tuning • Dense linear algebra • Sparse linear algebra • Join the PERI community with your favorite kernels • Application-based tuning • User-identified tuning parameters • Automatic source-based tuning (see talks by Mary Hall and Dan Quinlan later today) • PERI Portal

User-Assisted Runtime Performance Optimization • Active Harmony: Runtime optimization (Hollingsworth, UMD) • Automatic library selection (code) • Monitor library performance • Switch library if necessary • Automatic performance tuning (parameter) • Monitor system performance • Adjust runtime parameters • Results • Cluster-based web service – up to 16% improvement • POP – up to 17% improvement • GS2 – up to 3.4x faster • New: improved search algorithms • New: Tuning of component-based software (Norris & Hovland, ANL)

Active Harmony ExampleParallel Ocean Program (POP) • Parameterized over block dimension • Problem size – 3600x2400 on 480 processors (NERSC IBM SP - seaborg) • Up to 15% improvement in execution time

Summary • Many solved and open problems in automatic tuning • Berkeley-specific activities • OSKI: extra floating point work can save time • Stencil tuning: beware of prefetch • New architectures: vectors, Cell, integration with PETSc for clusters • PERI • Basic auto-tuning framework • Library and application level tuning; online and offline • Source transformations and domain specific generators • Many forms of “guidance” to control optimizations • Performance modeling and application engagement too • Opportunities to collaborate

External Software • Guidance • models • hardware information • annotations • assertions Source Code PERI Automatic Tuning Tools Triage Analysis Domain-Specific Code Generation Transformations - Specific www.peri-scidac.org Code Generation Code Selection Application Assembly Runtime Performance Data Training Runs Production Execution Runtime Adaptation Persistent Database

Runtime Tuning with Components Tuning of component-based software (Norris & Hovland, ANL) • Initial implementation of intra-component performance analysis for CQoS (FY08, Q1) • Intra-component analysis for generating performance models of single components (FY08, Q4) • Define specification for CQoS support for component SciDAC apps (FY09, Q1)

Source-Based Empirical Optimization Source-based optimization (Quinlan/LLNL, Hall/ISI) • Combine Model-guided and empirical optimization • compiler models prune unprofitable solutions • empirical data provide accurate measure of optimization impact • Supporting framework • kernel extraction tools (code isolator) • Prototypes for C/C++ and F90 • experience base to maintain previous results (later years) • More talks on these projects later today

FY07 Plan for Source Tuning (USC) • 1. From proposal: • “Develop optimizations for imperfectly nested loops” • STATUS: New transformation framework underway, uses Omega • 2. Nearer term milestone for out-year deliverable • Frontend to kernel extraction tool in Open64 • PLAN: Instrument original application code to collect loop bounds, control flow and input data • 3. New! • Targeting locality + multimedia extension architectures (AltiVec and SSE3) • STATUS: Preliminary MM results on AltiVec, working on SSE3 • 4. Need help for out-year milestone! • Apply to “selected loops in SciDAC applications” • Plan for identifying these?

1. Extending our framework EXAMPLE: LU DECOMPOSITION do k = 1, n-1 do i = k+1, n a(i, k) = a(i, k) / a(k, k) end do do i = k+1, n do j = k+1, n a(i, j) = a(i, j) – a(i, k)*a(k, j) end do end do end do • Scope • imperfect loop nests • triangular loop bounds • multiple loop nests • larger set of transformations • Requirements • uniform representation of transformations • facilitates composing transformations • representation of unbound optimization parameters • constraints for optimization parameters • Kelly et. al., Cohen et. al.

Original Program void main(){ Call OutlineFunc((<InputParameters>) } void OutlineFunc(<InputParameters>){ } Isolated Program <InputParameters>=SetInitialDataValues StoreInitialDataValues CaptureMachineState SetMachineState Code Fragment to be executed Isolated code 2. Kernel Extraction (Code Isolator) AUTOMATED IN SUIF MANUAL AUTOMATE IN OPEN64 LCPC ‘04

Source-Based Tuning (LLNL)

Tuning (UNC)

Pop Quiz • What are: • HPCToolkit • Rose • BeBOP • Active Harmony • PAPI • Atlas • Eco • OSKI • Should we have an index for PERI portal? • 1-sentence description of each tool and relationship to PERI (if any) • Is google good enough?

Challenges • Technical challenges • Multicore, etc.: • This is under control (modulo inability to control SMPs) • Would do well to target key apps kernels • Scaling, communication, load imbalance: • Less experience here, but some results for communication tuning • Load imbalance is likely to be an app-level problem • Management challenges • Tuning core community is as described • Minor: Mary and Dan need to work closely • Lots of “outer circle” tuning activities • Relationship to modeling • Identify specific opportunities

PERI Tuning • Motivation: • Hand-tuning is too time-consuming, and is not robust… • Especially as we move towards Petascale • Topology may matter, multi-core memory systems are complicated, memory and network latency are not getting better • Solution: automatic performance tuning • Use tools to identified tuning opportunities • Build apps to be auto-tunable by parameters + tool • Use auto-tuned libraries in applications • Tune full applications using source-to-source transforms

How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) Joint work with Bebop group, see R. Vuduc PhD thesis

How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Joint work with Bebop group, see R. Vuduc PhD thesis

How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models Joint work with Bebop group, see R. Vuduc PhD thesis

How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. Joint work with Bebop group, see R. Vuduc PhD thesis

OSKI-PETSc Performance: Accel. Cavity

Stanza Triad • Even smaller benchmark for prefetching • Derived from STREAM Triad • Stanza (L) is the length of a unit stride run while i < arraylength for each L element stanza A[i] = scalar * X[i] + Y[i] skip k elements . . . . . . 1) do L triads 2) skip k elements 3) do L triads stanza stanza Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stanza Triad Results • Without prefetching: • performance would be independent of stanza length; flat line at STREAM peak • our results show performance depends on stanza length Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Cost Model for Stanza Triad • First cache line in every L-element stanza is not prefetched • assign cost Cnon-prefetched • get value from Stanza Triad with L=cache line size • The rest of the cache lines are prefetched • assign cost Cprefetched • value from Stanza Triad with large L • Total Cost: Cost = #non-prefetched * Cnon-prefetched + #prefetched * Cprefetched Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stanza Triad Model Works well, except Itanium2 Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stanza Triad Memory Model 2 • Instead of 2 pt piecewise function, use 3 pts • Models all 3 architectures Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stencil Cost Model for Cache Blocking Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stencil Probe Cost Model Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Stencil Cache Blocking Summary Speedups only with • large grid sizes • unblocked unit-stride dimension • Currently applying to cross-iteration optimizations Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

Performance Engineering Research Institute (DOE SciDAC)