1 / 37

pOSKI: A Library to Parallelize OSKI

Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University of California, Berkeley April 28, 2008. pOSKI: A Library to Parallelize OSKI. Outline. pOSKI Goals OSKI Overview (Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk)

karik
Download Presentation

pOSKI: A Library to Parallelize OSKI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University of California, Berkeley April 28, 2008 pOSKI: A Library to Parallelize OSKI

  2. Outline • pOSKI Goals • OSKI Overview • (Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk) • pOSKI Design • Parallel Benchmark • MPI-SpMV

  3. pOSKI Goals • Provide a simple serial interface to exploit the parallelism in sparse kernels (focus on SpMV for now) • Target Multicore Architectures • Hide the complex process of parallel tuning while exposing its cost • Use heuristics, where possible, to limit search space • Design it to be extensible so it can be used in conjunction with other parallel libraries (e.g. ParMETIS) Take Sam’s Work and present it in a distributable, easy-to-use format.

  4. Outline • pOSKI Goals • OSKI Overview • (Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk) • pOSKI Design • Parallel Benchmark • MPI-SpMV

  5. OSKI: Optimized Sparse Kernel Interface • Sparse kernels tuned for user’s matrix & machine • Hides complexity of run-time tuning • Low-level BLAS-style functionality • Sparse matrix-vector multiply (SpMV), triangular solve (TrSV), … • Includes fast locality-aware kernels: ATA*x, … • Target: cache-based superscalar uniprocessors • Faster than standard implementations • Up to 4x faster SpMV, 1.8x TrSV, 4xATA*x • Written in C (can call from Fortran) Note: All Speedups listed are from Sequential Platforms in 2005

  6. How OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Workload from program monitoring History Matrix Generated code variants Benchmark data 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.

  7. Cost of Tuning • Non-trivial run-time tuning cost: up to ~40 mat-vecs • Dominated by conversion time • Design point: user calls “tune” routine explicitly • Exposes cost • Tuning time limited using estimated workload • Provided by user or inferred by library • User may save tuning results • To apply on future runs with similar matrix • Stored in “human-readable” format

  8. Optimizations Available in OSKI • Optimizations for SpMV (bold heuristics) • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 3x over CSR • Multiple vectors (SpMM): 7x over CSR • And combinations… • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A*x: 2x over CSR, 1.5x over RB Note: All Speedups listed are from Sequential Platforms in 2005

  9. Outline • pOSKI Goals • OSKI Overview • (Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk) • pOSKI Design • Parallel Benchmark • MPI-SpMV

  10. How pOSKI Tunes (Overview) Library Install-Time (offline) Application Run-Time (online) Matrix Load Balance Build for Target Arch. Parallel Benchmark Parallel Heuristic models P-OSKI Submatrix Submatrix Parallel Benchmark data To User: pOSKI Matrix Handle For kernel Calls Accumulate Handles Evaluate Parallel Model Evaluate Parallel Model Generated code variants Benchmark data History OSKI Evaluate Models Heuristic models Build for Target Arch. Benchmark Select Data Struct. & Code OSKI_Matrix_Handle For kernel Calls

  11. Where the Optimizations Occur

  12. Current Implementation • The Serial Interface • Represents SP composition of ParLab Proposal. The parallelism is hidden under the covers • Each serial-looking function call triggers a set of parallel events • Manages its own thread pool • Supports up to the number of threads supported by underlying hardware • Manages thread and data affinity

  13. Additional Future Interface • The Parallel Interface • Represents PP composition of ParLab Proposal • Meant for expert programmers • Can be used to share threads with other parallel libraries • No guarantees of thread of data affinity management • Example Use: y  ATAx codes • Alternate between SpMV and preconditioning step. • Share threads between P-OSKI (for SpMV) and some parallel preconditioning library • Example Use: UPC Code • Explicitly Parallel Execution Model • User partitions matrix based on some information P-OSKI would not be able to infer

  14. Thread and Data Affinity (1/3) • Cache Coherent Non Uniform Memory Access (ccNUMA) times on Modern MultiSocket, MultiCore architectures • Modern OS’ ‘first touch’ policy in allocating memory • Thread Migration between Locality Domains is expensive • In ccNUMA, a locality domain is a set of processor cores together with locally connected memory which can be accessed without resorting to a network of any kind. • For now, we have to deal with these OS policies ourselves. The ParLab OS Group is trying to solve these problems in order to hide such issues from the programmer.

  15. Thread and Data Affinity (2/3) • The Problem with malloc() and free() • malloc() first looks for free pages on heap and then requests OS to allocate new pages. • If available free pages reside on a different locality domain, malloc() still allocates them • Autotuning codes are malloc() and free() intensive so this is a huge problem

  16. Thread and Data Affinity (3/3) • The solution: Managing our own memory • One large chunk (heap) allocated at the beginning of tuning per locality domain • Size of this heap controlled by user input through environment variable [P_OSKI_HEAP_IN_GB=2] • Rare case: allocated space is not big enough • Stop all threads • Free all allocated memory • Grow the amount of space significantly across all threads and locality domains • Print a strong warning to the user

  17. Outline • pOSKI Goals • OSKI Overview • (Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk) • pOSKI Design • Parallel Benchmark • MPI-SpMV

  18. Justification • OSKI’s Benchmarking • Single Threaded • All the memory bandwidth is given to this one thread • pOSKI’s Benchmarking • Benchmark’s 1, 2, 4, …, threads (based on hardware limit) in parallel • Each thread uses up memory bandwidth which resembles run-time more accurately • When each instance of OSKI choose appropriate data structures and algorithms, it uses the data from this parallel benchmark

  19. Takeaways: • Parallel Benchmark performs at worst 2% worse than Regular but can perform as much as 13% better. • Incorporating a NUMA_MALLOC interface within OSKI is of utmost importance because without that performance is unpredictable. STATUS: In Progress • Superscalar speedups of > 4X, why? Results (1/2)

  20. Results (2/2) • Justifies Need for Search • Need Heuristics to reduce this since the multicore search space is expanding exponentially

  21. Outline • pOSKI Goals • OSKI Overview • (Slides adopted from Rich Vuduc’s SIAM CSE 2005 Talk) • pOSKI Design • Parallel Benchmark • MPI-SpMV

  22. Goals • Target: MultiNode, MultiCore architectures • Design: Build an MPI-layer on top of pOSKI • MPI is a starting point • Tuning Parameters: • Balance of Pthreads and MPI tasks • Rajesh has found for collectives, the balance is not always clear • Identifying if there are potential performance gains by assigning some of the threads (or cores) to only handle sending/receiving of messages • Status: • Just started, should have initial version in next few weeks • Future Work: • Explore UPC for communication • Distributed Load Balancing, Workload Generation

  23. Questions? pOSKI Goals OSKI Overview pOSKI Design Parallel Benchmark MPI-SpMV

  24. Extra Slides Motivation for Tuning

  25. 8x8 dense substructure Motivation: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem

  26. Mflop/s Best: 4x2 Reference Mflop/s Speedups on Itanium 2: The Need for Search

  27. Extra Slides Some Current Multicore Machines

  28. Rad Lab Opteron

  29. Niagara 2 (Victoria Falls)

  30. Nersc Power5 [Bassi]

  31. Cell Processor

  32. Extra Slides SpBLAS and OSKI Interfaces

  33. SpBLAS Interface • Create a matrix handle • Assert matrix properties • Insert matrix entries • Signal the end of matrix creation • Call operations on the handle • Destroy the handle  Tune here

  34. OSKI Interface • The basic OSKI interface has a subset of the matrix creation interface of the Sparse BLAS, exposes the tuning step explicitly, and supports a few extra kernels (e.g., A^(T)*A*x). • The OSKI interface was designed with the intent of implementing the Sparse BLAS using OSKI under-the-hood.

  35. Extra Slides Other Ideas for pOSKI

  36. Challenges of a Parallel Automatic Tuner • Search space increases exponentially with number of parameters • Parallelization across Architectural Parameters • Across Multiple Threads • Across Multiple Cores • Across Multiple Sockets • Parallelizing the data of a given problem • Across Rows, Across Columns, or Checkerboard • Based on User Input in v1 • Future Versions can integrate ParMETIS or other graph partitioners

  37. A Memory Footprint Minimization Heuristic The Problem: Search Space is too Large  Auto-tuning takes too long • The rate of increase in aggregate memory bandwidth over time is not as fast as the rate of increase in processing power per machine. • Our Two Step Tuning Process: • Calculate the top 20% memory efficient configurations on Thread 0 • Each Thread finds its optimal block size for its sub-matrix from the list in Step 1

More Related