220 likes | 305 Views
Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine. Leonid Oliker Future Technologies Group Computational Research Division LBNL www.nersc.gov/~oliker Sourav Chatterji , Jason Duell, Manikandan Narayanan. Motivation.
E N D
Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine Leonid Oliker Future Technologies Group Computational Research Division LBNL www.nersc.gov/~oliker Sourav Chatterji, Jason Duell, Manikandan Narayanan
Motivation • Commodity cache-based SMP clusters perform at small % of peak for memory intensive problems (esp irregular prob) • But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr) • Power and packaging are becoming significant bottlenecks • Better software is improving some problems: • ATLAS, FFTW, Sparsity, PHiPAC • Alternative arch allow tighter integration of proc & memoryCan we build HPC systems w/ high-end media proc tech? • VIRAM: PIM technology combines embedded DRAM with vector coprocessor to exploit large bandwidth potential • IMAGINE: Stream-aware memory supports large processing potential of SIMD controlled VLIW clusters
Motivation • General purpose procs badly suited for data intensive ops • Large caches not useful • Low memory bandwidth • Superscalar methods of increasing ILP inefficient • Power consumption • Application-specific ASICs • Good, but expensive/slow to design. • Solution: general purpose “memory aware” processors • Large number of ALUs: to exploit data-parallelism • Huge memory bandwidth: to keep ALUs busy • Concurrency: overlap memory w/ computation
VIRAM Overview • MIPS core (200 MHz) • Main memory system • 8 banks w/13 MB of on-chip DRAM • Large 6.4 GBytes/s on-chip peak bandwidth • Cach-less Vector unit • Energy efficient way to express fine-grained parallelism and exploit bandwidth • Single issue, in order • Low power consumption: 2.0 W • Peak vector performance • 1.6/3.2/6.4 Gops • 1.6 Gflops (single-precision) • Fabricated by IBM: Taped-out 02/2003 • To hide DRAM access load/store, arithmetic instructions deeply pipelined (15 stages) • We use simulator with Cray’s vcc compiler
VIRAM Vector Lanes • Parallel lane design has adv in performance, design complex, scalability • Each lanes has 2 ALUs ( 1 for FP) and receives identical control signal • Vector instr specify 64 way-parallelism, hardware exec 8-way • 8 KB vector register file partitioned into 32 vector registers • Variable data widths: 4 lanes 64-bit, 8 lanes for 32 bit, 16 for 8 bit • Data width cut in half, # of elems per register (and peak) doubles • Limitations: no 64-bit FP & compiler doesn’t generate fused MADD
VIRAM Power Efficiency • Comparable performance with lower clock rate • Large power/performance advantage for VIRAM from • PIM technology, data parallel execution model
Stream Processing Example: stereodepth extraction • Data and Functional Parallelism • High Comp rate • Little Data Reuse • Producer-Consumer and Spatial locality • Ex: Multimedia, sign proc, graphics • Stream: ordered set of records (homogenous, arbitrary data type) • Stream programming: data is streams, compu is kernel • Kernel loop through all stream elements (sequential order) • Perform compound (multiword) operation on each stream elem • Vectors perform single arith op on each vector elem (then store in reg)
Imagine Overview • “Vector VLIW” processor • Coprocessor to off-chip host processor • 8 arithmet clusters control in SIMD w/ VLIW instr • Central 128KB Stream Register File @ 32GB/s • SRF can overlap comp with mem (double buff) • SRF cab reuse intermed results (prod-cons local) • Stream-aware mem sys with 2.7 GB/s off-chip • 544 GB/s interclustr comm • Host sends inst to stream controller, SC issues commands to on-chip modules
Imagine Arithmetic Clusters • 400 MHz clock, 8 clusters w/ 6 FU each (48 FU total) • Reads/writes streams to SRF • Each cluster 3 ADD, 2 MULT, 1 DIV/SQRT, 1 scratch, & 1 comm unit • 32 bit arch: subword operations support 16 and 8 bit data (no 64 bit support) • Local registers on functional units hold 16 words each (total 1.5 KB) • Clusters receive VLIW-style instructions broadcast from microcontroller.
VIRAM and Imagine • Imagine order of magnitude higher performance • VIRAM twice mem bandwidth, less power consumption • Notice peak Flop/Word ratios
SQMAT Architectural Probe 3x3 Matrix Multiply • Sqmat: scalable synthetic probe, control comput intensity, vector len • Imagine stream model req large # of ops per word to amortize mem refPoor use of SRF, no producer-consumer locality • Long stream helps hide mem latency but only 7% of algorithmic peak • VIRAM: performs well for low op/word (40% when L=256) • Vector pipeline overlap comp/mem, on-chip DRAM (hi bdwth, low laten)
SQMAT: Performance Crossover • Large number of ops/word N10 where N=3x3 • Crossover point L=64 (cycles) , L = 256 (MFlop) • Imagine power becomes apparent almost 4x VIRAM at L=1024Codes at this end of spectrum greatly benefit from Imagine arch
VIRAM/Imagine Optimization • Optimization strat: speed up slower of comp or mem • Restructure computation for better kernel perform • Mem is waiting for ALUS • Add more computation for better memory perform • ALU memory starved • Subtle overlap effects:vect chaining, stream doub buff • Example optimization RGB→YIQ conversion from EEMBC Input format: R1G1B1R2G2R2R3G3B3… Required format: R1R2R3… G1G2G3… B1B2B3….
VIRAM RGB→YIQ Optimization VIRAM: poor memory performance • Strided accesses (~1/2 performance) - RGBRGBRGB… -- strided loads→ RRR…GGG…BBB… - Only 4 address generators for 8 addresses (sufficient for 64 bit) • Word operations on byte data (1/4th performance) Optimization: replace strided w/ unit access, using in-register shuffle • Increased computational overhead (packing and unpacking)
VIRAM RGB→YIQResults Used functional units instead of memory to extract components, increasing the computational overhead
Imagine RGB→YIQ Optimization • Imagine bottleneck is comp due poor ALU schedule (left) • Unoptimized 15 cycles per pixel • Software pipelining makes VLIW schedule denser (right) • Optimized 8 cycles per pixel
Imagine RGB→YIQResults Optimized kernel takes only ½ the cycles per element Memory is now the new bottleneck
EEMBC Benchmark • Vec-add: one add/elem, perf limited by memory system • RGB →(YIQ,CMYK): VIRAM limited by processing (cannot use avail bdwidth) • Grayfiler: Difficult to efficiently impl on Imagine (sliding 3x3 window) • Autocorr: Uses short streams, Imagine host latency is high
Scientific KernelsSPMV Performance • Algorithmic peak: VIRAM 8 ops/cycle, Imag 32 ops/cycle • LSHAPE: finite element matrix, LARGEDIS pseudo-random nnz • Imagine lacks irreg access, reorder matrix before kernelC • VIRAM better suited for this class of apps (low comp/mem)
Scientific KernelsComplex QR Decomposition • A=QR Q orthrog & A upper triag, • Blocked Househoulder variant – rich in level 3 BLAS ops • Complex elems increases ops/word & locality (1 MUL = 6 ops) • VIRAM uses CLAPACK port (insertion of vector directives) • Imagine: complex indexing of matrix stream (each iter smaller matrix) • Imagine over 10GFlops (19x VIRAM) – well suited for this archLow VIRAM perf due strided access and compiler limitations
Overview • Significantly different balance of memory organization • Relative performance depends on computational intensity • Programming complexity is high for both approaches, although VIRAM is based on established vector technology • For well-suited applications IMAGINE processor can sustain over 10GFlop/s (simulated results) • Large # homogeneous computation required to sufficiently saturate IMAGINE while VIRAM can operate on small vector sizes • IMAGINE can take advantage of producer-consumer locality • Both present significant reduction in power and space • May be used as coprocessors in future generation architectures
Next Generation • CODE: next generation of VIRAM • More functional units/ faster clock speed • Local registers per unit instead of single register file. • Looking more like Imagine… • Multi VIRAM architecture – network interface issues? • Brook: new language for Imagine • Eliminate exposure of hardware details (# of clusters) • Streaming Supercomputer – multi Imagine configuration • Streams can be used for functional/data parallelism • Currently evaluating DIVA architecture