Memory Intensive Benchmarks: IRAM vs. Cache Based Machines

Memory Intensive Benchmarks:IRAM vs. Cache Based Machines Parry Husbands(LBNL)Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak Biswas (NASA Ames)

Motivation • Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones) • E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~ 20% of peak on 1.5GHz P4 • Even worse when parallel efficiency considered • Overall ~10% across application benchmarks • Is memory bandwidth the problem? • Performance directly related to how well memory system performs • But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)

Solutions? • Better Software • ATLAS, FFTW, Sparsity, PHiPAC • Power and packaging are important too! • New buildings and infrastructure needed for many recent/planned installations • Alternative Architectures • One idea: Tighter integration of processor and memory • BlueGene/L (~ 25 cycles to main memory) • VIRAM • Uses PIM technology in attempt to take advantage of large on-chip bandwidth available in DRAM

14.5 mm 20.0 mm VIRAM Overview • MIPS core (200 MHz) • Main memory system • 13 MB of on-chip DRAM • Large on-chip bandwidth6.4 GBytes/s peak to vector unit • Vector unit • Energy efficient way to express fine-grained parallelism and exploit bandwidth • Typical power consumption: 2.0 W • Peak vector performance • 1.6/3.2/6.4 Gops • 1.6 Gflops (single-precision) • Fabrication by IBM • Tape-out in O(1 month) • Our results use simulator with Cray’s vcc compiler

Our Task • Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines • For now focus on serial performance • Benchmark VIRAM on Scientific Computing kernels • Originally for multimedia applications • Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser) • Isolate performance limiting features of architectures • More than just memory bandwidth

Benchmarks Considered • Transitive-closure (small & large data set) • NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit) • Fetch-and-increment a stream of “random” addresses • Sparse matrix-vector product: • Order 10000, #nonzeros 177820 • Computing a histogram • Different algorithms investigated: 64-elements sorting kernel; privatization; retry • 2D unstructured mesh adaptation

The Results Comparable performance with lower clock rate

Power Efficiency • Large power/performance advantage for VIRAM from • PIM technology • Data parallel execution model

Ops/Cycle

GUPS • 1 op, 2 loads, 1 store per step • Mix of indexed and unit stride operations • Address generation key here (only 4 per cycle on VIRAM)

Histogram • 1 op, 2 loads, 1 store per step • Like GUPS, but duplicates restrict available parallelism and make it more difficult to vectorize • Sort method performs best on VIRAM on real data • Competitive when histogram doesn’t fit in cache

Which Problems are Limited by Bandwidth? • What is the bottleneck in each case? • Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) • SPMV and Mesh limited by address generation, bank conflicts, and parallelism • For Histogram lack of parallelism, not memory bandwidth

Summary and Future Directions • Performance advantage • Large on applications limited only by bandwidth • More address generators/sub-banks would help irregular performance • Performance/Power advantage • Over both low power and high performance processors • Both PIM and data parallelism are key • Performance advantage for VIRAM depends on application • Need fine-grained parallelism to utilize on-chip bandwidth • Future steps • Validate our work on real chip! • Extend to multi-PIM systems • Explore system balance issues • Other memory organizations (banks, bandwidth vs. size of memory) • # of vector units • Network performance vs. on-chip memory

SPARCIIi MIPS R10K P III P 4 Alpha EV6 Make Sun Ultra 10 Origin 2000 Intel Mobile Dell Compaq DS10 Clock 333MHz 180MHz 600MHz 1.5GHz 466MHz L1 16+16KB 32+32KB 32KB 12+8KB 64+64KB L2 2MB 1MB 256KB 256KB 2MB Mem 256MB 1GB 128MB 1GB 512MB The Competition

Transitive Closure (Floyd-Warshall) • 2 ops, 2 loads, 1 store per step • Good for vector processors: • Abundant, regular parallelism and unit stride

SPMV • 2 ops, 3 loads per step • Mix of indexed and unit stride operations • Good performance for ELLPACK, but only when we have same number of non-zeros per row

Mesh Adaptation • Single level of refinement of mesh with 4802 triangular elements, 2500 vertices, and 7301 edges • Extensive reorganization required to take advantage of vectorization • Many indexed memory operations (limited again by address generation)

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines