Scientific Kernels on VIRAM and Imagine

Scientific Kernels on VIRAM and Imagine Leonid Oliker Future Technologies Group NERSC/LBNL www.nersc.gov/~oliker Xiaoye Li, Parry Husbands, Adam Janin, Manikandan Narayanan, Kathy Yelick

Motivation • Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones) • E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~ 20% of peak on 1.5GHz P4 • Even worse when parallel efficiency considered • Overall ~10% across application benchmarks • Is memory bandwidth the problem? • Performance directly related to how well memory system performs • But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)

Solutions? • Better Software • ATLAS, FFTW, Sparsity, PHiPAC • Power and packaging are important too! • New buildings and infrastructure needed for many recent/planned installations • Alternative Architectures • One idea: Tighter integration of processor and memory • BlueGene/L (~ 25 cycles to main memory) • VIRAM • Uses PIM technology in attempt to take advantage of large on-chip bandwidth available in DRAM • Imagine • Stream aware memory hierarchy to support SIMD controlled VLIW clusters

14.5 mm 20.0 mm VIRAM Overview • MIPS core (200 MHz) • Main memory system • 13 MB of on-chip DRAM • Large on-chip bandwidth6.4 GBytes/s peak to vector unit • Vector unit • Energy efficient way to express fine-grained parallelism and exploit bandwidth • Typical power consumption: 2.0 W • Peak vector performance • 1.6/3.2/6.4 Gops • 1.6 Gflops (single-precision) • Fabrication by IBM • Tape-out in O(1 month) • Our results use simulator with Cray’s vcc compiler

Our Task • Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines • For now focus on serial performance • Benchmark VIRAM on Scientific Computing kernels • Originally for multimedia applications • Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser) • Isolate performance limiting features of architectures • More than just memory bandwidth

Benchmarks Considered • Transitive-closure (small & large data set) • NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) • Fetch-and-increment a stream of “random” addresses • Sparse matrix-vector product: • Order 10000, #nonzeros 177820 • Computing a histogram • Different algorithms investigated: 64-elements sorting kernel; privatization; retry • 2D unstructured mesh adaptation

The Results Comparable performance with lower clock rate

Power Efficiency • Large power/performance advantage for VIRAM from • PIM technology • Data parallel execution model

Ops/Cycle

GUPS • 1 op, 2 loads, 1 store per step • Mix of indexed and unit stride operations • Address generation key here (only 4 per cycle on VIRAM)

Histogram • 1 op, 2 loads, 1 store per step • Like GUPS, but duplicates restrict available parallelism and make it more difficult to vectorize • Sort method performs best on VIRAM on real data • Competitive when histogram doesn’t fit in cache

Which Problems are Limited by Bandwidth? • What is the bottleneck in each case? • Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) • SPMV and Mesh limited by address generation, bank conflicts, and parallelism • For Histogram lack of parallelism, not memory bandwidth

VIRAMSummary and Future Directions • Performance advantage • Large on applications limited only by bandwidth • More address generators/sub-banks would help irregular performance • Performance/Power advantage • Over both low power and high performance processors • Both PIM and data parallelism are key • Performance advantage for VIRAM depends on application • Need fine-grained parallelism to utilize on-chip bandwidth • Future steps • Validate our work on real chip! • Extend to multi-PIM systems • Explore system balance issues • Other memory organizations (banks, bandwidth vs. size of memory) • # of vector units • Network performance vs. on-chip memory

IMAGINE:The need for stream processors • General purpose processors badly suited • Large caches not useful • Low memory bandwidth • Superscalar methods of increasing ILP inefficient • Power consumption • Application-specific ASICs • Good, but expensive/slow to design. • Solution: general purpose “stream processors” • Exploit producer-consumer locality • High arithmetic requirement • Homogenous computation (SIMD controlled VLIW clusters) • Unique (but limited) control logic

Bandwidth Hierarchy SIMD/VLIW Control SDRAM ALU Cluster ALU Cluster ALU Cluster SDRAM ALU Cluster Stream Register File ALU Cluster SDRAM ALU Cluster ALU Cluster SDRAM ALU Cluster Peak BW: 4GB/s 32GB/s 544GB/s

IMAGINE ARITHMETIC CLUSTERS: • Each cluster contains 3 ADD, 2 MULT, 1 DIV/SQRT, 1 scratch pad register unit, & 1 cluster communication unit • 32 bit operations: subword operations support 16 and 8 bit data. • Local registers on functional units hold 16 words each (total 1.5 KB local registers per cluster) • Clusters receive VLIW-style instructions broadcast from microcontroller.

VIRAM vs. IMAGINE

Sqmat Microbenchmark: Scalable Synthetic Probe • Used to gain insight into the architectures and capture performance crossover point • Sqmat contains abundant fine-grain parallelism, no data dependencies, and “multi-word” records • Sqmat squares a set of L matrices of size NxN repeatedly M times • Varying N,M controls size of comp kernel and ops/word • Varying L controls vector/stream length • Start with low end of performance spectrum and work our way up to high efficiency

Sqmat:Low ops/word % of algorithmic peak for varying N with M=1 and L =16 % of algorithmic peak for varying L with M=1 and N=3

Sqmat:High ops/word % of algorithmic performance for varying M with N=3 and L=1024 • Imagine achieves less then 50% efficiency even though there are 30 multiplications for each memory access Achieving high efficiency using long streams and high computation • To achieve 90% efficiency on Imagine, requires large (N=5) computational kernel and many ops per word

Sqmat:Performance Crossover Performance crossover for N=3 and M=10 • VIRAM performance flattens while Imagine continues to grow. At L=1024, Imagine raw power becomes apparent - requiring 33% fewer cycles and 4x improvement in MFlop/s • Codes are better suited for each architecture depending on their computational characteristics

Low Comp Intensity Example:SPMV • Performance is low using original matrix due to lack of parallelism (only 8 and 18 nnz per row). But VIRAM achieves much higher fraction of peak • Ellpack (filled) version shows better performance on both arch, but VIRAM still showing better performance characteristics • Note that padding matrices to create equal row lengths can reduce the fraction of useful operations arbitrarily low (but equally across both architectures)

High Comp Intensity Example:Complex QR Decomposition • Both use block variants of the Householder QR - rich in BLAS3 operations • Use of complex elements increases computational intensity (ops/word) • VIRAM version is a port of CLAPACK, involves inserting vectorization directives into BLAS routines and minimizing strided access • Imagine uses blocks of 8 columns and requires complicated indexing logic • VIRAM only sustains 34% of peak due to large stride memory access • Imagine performs at 65% of peak with an impressive speed of over 13GFlop • Demonstrates significant performance that can be achieved on Imaginefor applications with high operations per memory access

VIRAM Compiler poor at basic block scheduling No automatic unroll/software pipelining Hand-tweaking assembly No ISA support for multi-word records Degrades memory performance Increases program complexity Imagine Programmer exposed to memory hierarchy 2 levels of programming Number of clusters (8) exposed in ISA Not portable Hard to handle streams with (size % clusters) != 0 Burden on programmer Complicated control logic Brook language attempts to address these issues VIRAM and IMAGINE:Current usability problems

Observations:VIRAM and IMAGINE • Relative performance depends on computational requirements per data element (Bytes/Flop) • Different balance of memory organization • Programming complexity is high for both approaches, although VIRAM is based on established vector technology • For well-suited applications Imagine processor can sustain over 10GFlop/s (simulated results) • Large number of homogeneous computation required to sufficiently saturate Imagine while VIRAM can operate on small vector sizes • IMAGINE can take advantage of consumer-producer locality • Both present significant reduction in power and space

Scientific Kernels on VIRAM and Imagine