80 likes | 424 Views
High Performance Computing The GotoBLAS Library. HPC: numerical libraries. Many numerically intensive applications make use of specialty libraries to perform common operations: Linear algebra operators (e.g., dot products, matrix-vector multiplies) Fast Fourier transforms Linear solvers
E N D
HPC: numerical libraries • Many numerically intensive applications make use of specialty libraries to perform common operations: • Linear algebra operators (e.g., dot products, matrix-vector multiplies) • Fast Fourier transforms • Linear solvers • To maximize application performance (and throughput), we want these libraries to be highly optimized for each computer architecture • One commonly used numerical library is BLAS: • Contains routines that provide standard building blocks for performing basic vector and matrix operations • Commonly used in scientific and engineering software and graphics processing • “High-profile” since it is used with the Linpack benchmark, used to rank the fastest supercomputers in the world (Top 500 list)
HPC: GotoBLAS • GotoBLAS is an implementation of the BLAS library developed by TACC researcher Kazushige Goto. • Kazushige has been called “the Michael Jordan of high-performance linear algebra kernels.” • Software is designed for all common chipset architectures, including: • Power 4, Power 5 • Opteron • Blue Gene/L • Pentium 4/Xeon (32-bit and 64-bit) • Itanium 2
HPC: GotoBLAS • Most vendors provide their own BLAS implementation: • Significant development overhead incurred for new architectures • Large code base with many switching branches based on input sizing • Kazushige’s approach uses a simplified model • No major context switching • Functions separated based on performance impact • Non-performance bits written in C • Crucial performance kernels written in assembly • GotoBLAS tries to minimize assembler codes • Actual assembler code is really small • Easy to improve and debug • Benefit: It takes only 3 to 7 days to develop a tuned BLAS for a new architecture
GotoBLAS DGEMM performance DGEMM is one of the most widely used BLAS functions; it performs matrix-matrix multiplies. Efficiency indicates the ratio of observed performance to the maximum theoretical value.
DGEMM POWER5 1.9GHz GOTO ESSL ATLAS 7600 6840 6080 5320 4560 3800 MFlops 3040 2280 1520 760 0 0 500 1000 1500 2000 Size Example GotoBLAS comparisons
HPC: GotoBLAS • In April 2006, TACC released the latest version of GotoBLAS: • Free to use for academic and research purposes • Supports a wide range of Fortran compiler interfaces • Available to commercial users through UT’s Office of Technology Commercialization • Source code for the library is now available. • Redistribution rights are also available.
Thanks for your time! Karl W. Schulz, karl@tacc.utexas.edu Kazushige Goto, kgoto@tacc.utexas.edu