1 / 49

Performance Engineering Research Institute (DOE SciDAC)

Performance Engineering Research Institute (DOE SciDAC). Katherine Yelick LBNL and UC Berkeley. Performance Engineering Enabling Petascale Science. Petascale computing is about delivering performance to scientists Maximizing performance is getting harder: Systems are more complicated

coralr
Download Presentation

Performance Engineering Research Institute (DOE SciDAC)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Engineering Research Institute (DOE SciDAC) Katherine Yelick LBNL and UC Berkeley

  2. Performance EngineeringEnabling Petascale Science • Petascale computing is about delivering performance to scientists • Maximizing performance is getting harder: • Systems are more complicated • O(100K) processors • Multi-core with SIMD extensions • Scientific software is more complicated • Multi-disciplinary and multi-scale • The Performance Engineering Research Institute address this challenge in three ways: • Model and predict application performance • Assist SciDAC scientific code projects with performance analysis and tuning • Investigate novel strategies for automatic performance tuning IBM BlueGene at LLNL Cray Xt3 at ORNL POP model of El Nino Beam3D accelerator modeling

  3. Engaging SciDAC Software Developers • Application Engagement • Work directly with DOE computational scientists • Ensure successful performance porting of scientific software • Focus PERI research on real problems • Application Liaisons • Build long-term personal relationships with PERI researchers and scientific code teams • Tiger Teams • Focus on DOE’s highest priorities • SciDAC-2 • INCITE Optimizing arithmetic kernels Maximizing scientific throughput

  4. Automatic Performance Tuning of Scientific Code Long-term goals of PERI • Automate the process of tuning software to maximize its performance • Reduce the performance portability challenge facing computational scientists. • Address the problem that performance experts are in short supply • Build upon forty years of human experience and recent success with linear algebra libraries PERI automatic tuning framework

  5. Participating Institutions Lead PI: Bob Lucas Institutions: Argonne National Laboratory Lawrence Berkeley National Laboratory Lawrence Livermore National Laboratory Oak Ridge National Laboratory Rice University University of California at San Diego University of Maryland University of North Carolina University of Southern California University of Tennessee

  6. Major Tuning Activities in PERI • Triage: discover tuning targets • HPC Toolkit • PAPI • Library-based tuning • Dense linear algebra • Sparse linear algebra • Join the PERI community with your favorite kernels • Application-based tuning • With user support • Automatic source-based tuning • PERI Portal

  7. Triage Tools HPC Toolkit: Tool to identify tuning opportunities (Mellor-Crummey, Rice) • Ease of use • no manual code instrumentation • handle large multi-lingual codes with 3rd party libraries • Perform detailed measurements • both communication and computation • many granularities: node, core, thread, procedure, loop, and statement levels • Collect performance data in a scalable way • data size not linear in execution time: sample-based rather than trace-based • Avoid perturbing execution • user selectable overhead at an arbitrarily low level • Identify inefficiencies in code: • Parallel inefficiencies: load imbalance, serialization, communication overhead • Computation inefficiencies: pipeline stalls, memory bottlenecks, etc.

  8. On-line Hardware Monitoring PAPI: machine-independent Performance API (Shirley Moore & Jack Dongarra, UTK) • Multi-substrate support recently added to PAPI enables simultaneous monitoring of • On-processor counters • Off-processor counters (e.g., network counters) • Temperature sensors • Heterogeneous multi-core hybrid systems • Online monitoring will help enable runtime adaptation

  9. Major Tuning Activities in PERI • Triage: discover tuning targets • HPC Toolkit (Rice) • PAPI (UTK) • Library-based tuning • Dense linear algebra • Sparse linear algebra • Join the PERI community with your favorite kernels • Application-based tuning • User-identified tuning parameters • Automatic source-based tuning • PERI Portal

  10. Dense Linear Algebra Atlas: Auto-tuned library for dense linear algebra (UTK) • Performance portability across processors • New: massively multi-threaded and multi-core architectures, which requires • Asynchrony (e.g., lookahead) • Modern vectorization (SIMD extensions) • Hiding of memory latency • Overlap of communication with computation • Hand techniques being automated • Better search algorithms (POET) and parallel search • See later talks today

  11. Sparse Linear Algebra • OSKI: Optimized Sparse Kernel Interface (Yelick, Demmel, Vuduc) • Extra work can improve performance • Cannot make decisions offline: need matrix structure • Example: • Pad 3x3 blocks with zeros • “Fill ratio” = 1.5 • PIII speedup: 1.5x Joint work with Bebop group

  12. Optimizations Available in OSKI • Optimizations for SpMV • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 3x over CSR • Multiple vectors (SpMM): 7x over CSR • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels (focus for new work) • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A*x: 2x over CSR, 1.5x over RB • New: vector and multicore support, better code generation Joint work with Bebop group, see R. Vuduc PhD thesis

  13. OSKI-PETSc Proof-of-Concept Results • Recent work by Rich Vuduc • Integration of OSKI into PETSc • Example matrix: Accelerator cavity design • N ~ 1 M, ~40 M non-zeros (SLAC matrix) • 2x2 dense block substructure • Uses register blocking and symmetry • Improves performance of local computation • Preliminary speedup numbers: • 8 node Xeon cluster • Speedup: 1.6x Joint work with Bebop group, see R. Vuduc PhD thesis

  14. Stencil Computations • Stencils have simple inner loops • Typically ~1 FLOP per load • Runs at small fraction of peak (<15%)! • Strategies: minimize cache misses • Cache blocked within 1 sweep (aka iteration or timestep) • Time skewed (and cache blocked): merge across iterations • Cache oblivious: use recursive decomposition across iterations • Observations: • Iteration merging only works in some algorithms! • Reducing misses does not always minimize time • Prefetch is at least as important as caching (unit stride runs) • Big difference between 1D, 2D, and 3D results in practice Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

  15. Time Skewing and Blocking

  16. Cache Oblivious

  17. Time Skewed/Blocked vs. Cache Oblivious

  18. Tuning for the Cell Architecture • Cell will be used in the PS3  high volume • Current system problems: • Off-chip bandwidth and power • Double precision floating point interface • Only a problem for algorithms with high computational efficiency • Small fix to floating point hardware interface (Cell+) • Memory system • Software controlled memory improves bandwidth and power usage • Allows finely tuned deep prefetching, efficient cache utilization • Predictable performance and less architectural complexity • But increases programming complexity Joint work with S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil

  19. Scientific Kernels on Cell (double precision) 55 Joint work with S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil

  20. Power Efficiency of Cell 1278 Joint work with S. Williams, J. Shalf, L. Oliker, P. Husbands, S. Kamil

  21. Major Tuning Activities in PERI • Triage: discover tuning targets • HPC Toolkit • PAPI • Library-based tuning • Dense linear algebra • Sparse linear algebra • Join the PERI community with your favorite kernels • Application-based tuning • User-identified tuning parameters • Automatic source-based tuning (see talks by Mary Hall and Dan Quinlan later today) • PERI Portal

  22. User-Assisted Runtime Performance Optimization • Active Harmony: Runtime optimization (Hollingsworth, UMD) • Automatic library selection (code) • Monitor library performance • Switch library if necessary • Automatic performance tuning (parameter) • Monitor system performance • Adjust runtime parameters • Results • Cluster-based web service – up to 16% improvement • POP – up to 17% improvement • GS2 – up to 3.4x faster • New: improved search algorithms • New: Tuning of component-based software (Norris & Hovland, ANL)

  23. Active Harmony ExampleParallel Ocean Program (POP) • Parameterized over block dimension • Problem size – 3600x2400 on 480 processors (NERSC IBM SP - seaborg) • Up to 15% improvement in execution time

  24. Summary • Many solved and open problems in automatic tuning • Berkeley-specific activities • OSKI: extra floating point work can save time • Stencil tuning: beware of prefetch • New architectures: vectors, Cell, integration with PETSc for clusters • PERI • Basic auto-tuning framework • Library and application level tuning; online and offline • Source transformations and domain specific generators • Many forms of “guidance” to control optimizations • Performance modeling and application engagement too • Opportunities to collaborate

  25. External Software • Guidance • models • hardware information • annotations • assertions Source Code PERI Automatic Tuning Tools Triage Analysis Domain-Specific Code Generation Transformations - Specific www.peri-scidac.org Code Generation Code Selection Application Assembly Runtime Performance Data Training Runs Production Execution Runtime Adaptation Persistent Database

  26. Runtime Tuning with Components Tuning of component-based software (Norris & Hovland, ANL) • Initial implementation of intra-component performance analysis for CQoS (FY08, Q1) • Intra-component analysis for generating performance models of single components (FY08, Q4) • Define specification for CQoS support for component SciDAC apps (FY09, Q1)

  27. Source-Based Empirical Optimization Source-based optimization (Quinlan/LLNL, Hall/ISI) • Combine Model-guided and empirical optimization • compiler models prune unprofitable solutions • empirical data provide accurate measure of optimization impact • Supporting framework • kernel extraction tools (code isolator) • Prototypes for C/C++ and F90 • experience base to maintain previous results (later years) • More talks on these projects later today

  28. FY07 Plan for Source Tuning (USC) • 1. From proposal: • “Develop optimizations for imperfectly nested loops” • STATUS: New transformation framework underway, uses Omega • 2. Nearer term milestone for out-year deliverable • Frontend to kernel extraction tool in Open64 • PLAN: Instrument original application code to collect loop bounds, control flow and input data • 3. New! • Targeting locality + multimedia extension architectures (AltiVec and SSE3) • STATUS: Preliminary MM results on AltiVec, working on SSE3 • 4. Need help for out-year milestone! • Apply to “selected loops in SciDAC applications” • Plan for identifying these?

  29. 1. Extending our framework EXAMPLE: LU DECOMPOSITION do k = 1, n-1 do i = k+1, n a(i, k) = a(i, k) / a(k, k) end do do i = k+1, n do j = k+1, n a(i, j) = a(i, j) – a(i, k)*a(k, j) end do end do end do • Scope • imperfect loop nests • triangular loop bounds • multiple loop nests • larger set of transformations • Requirements • uniform representation of transformations • facilitates composing transformations • representation of unbound optimization parameters • constraints for optimization parameters • Kelly et. al., Cohen et. al.

  30. Original Program void main(){ Call OutlineFunc((<InputParameters>) } void OutlineFunc(<InputParameters>){ } Isolated Program <InputParameters>=SetInitialDataValues StoreInitialDataValues CaptureMachineState SetMachineState Code Fragment to be executed Isolated code 2. Kernel Extraction (Code Isolator) AUTOMATED IN SUIF MANUAL AUTOMATE IN OPEN64 LCPC ‘04

  31. Source-Based Tuning (LLNL)

  32. Tuning (UNC)

  33. Pop Quiz • What are: • HPCToolkit • Rose • BeBOP • Active Harmony • PAPI • Atlas • Eco • OSKI • Should we have an index for PERI portal? • 1-sentence description of each tool and relationship to PERI (if any) • Is google good enough?

  34. Challenges • Technical challenges • Multicore, etc.: • This is under control (modulo inability to control SMPs) • Would do well to target key apps kernels • Scaling, communication, load imbalance: • Less experience here, but some results for communication tuning • Load imbalance is likely to be an app-level problem • Management challenges • Tuning core community is as described • Minor: Mary and Dan need to work closely • Lots of “outer circle” tuning activities • Relationship to modeling • Identify specific opportunities

  35. PERI Tuning • Motivation: • Hand-tuning is too time-consuming, and is not robust… • Especially as we move towards Petascale • Topology may matter, multi-core memory systems are complicated, memory and network latency are not getting better • Solution: automatic performance tuning • Use tools to identified tuning opportunities • Build apps to be auto-tunable by parameters + tool • Use auto-tuned libraries in applications • Tune full applications using source-to-source transforms

  36. How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) Joint work with Bebop group, see R. Vuduc PhD thesis

  37. How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Joint work with Bebop group, see R. Vuduc PhD thesis

  38. How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models Joint work with Bebop group, see R. Vuduc PhD thesis

  39. How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. Joint work with Bebop group, see R. Vuduc PhD thesis

  40. OSKI-PETSc Performance: Accel. Cavity

  41. Stanza Triad • Even smaller benchmark for prefetching • Derived from STREAM Triad • Stanza (L) is the length of a unit stride run while i < arraylength for each L element stanza A[i] = scalar * X[i] + Y[i] skip k elements . . . . . . 1) do L triads 2) skip k elements 3) do L triads stanza stanza Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

  42. Stanza Triad Results • Without prefetching: • performance would be independent of stanza length; flat line at STREAM peak • our results show performance depends on stanza length Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

  43. Cost Model for Stanza Triad • First cache line in every L-element stanza is not prefetched • assign cost Cnon-prefetched • get value from Stanza Triad with L=cache line size • The rest of the cache lines are prefetched • assign cost Cprefetched • value from Stanza Triad with large L • Total Cost: Cost = #non-prefetched * Cnon-prefetched + #prefetched * Cprefetched Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

  44. Stanza Triad Model Works well, except Itanium2 Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

  45. Stanza Triad Memory Model 2 • Instead of 2 pt piecewise function, use 3 pts • Models all 3 architectures Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

  46. Stencil Cost Model for Cache Blocking Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

  47. Stencil Probe Cost Model Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

  48. Stencil Probe Cost Model Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

  49. Stencil Cache Blocking Summary Speedups only with • large grid sizes • unblocked unit-stride dimension • Currently applying to cross-iteration optimizations Joint work with S. Kamil, J. Shalf, K. Datta, L. Oliker, S. Williams

More Related