130 likes | 246 Views
Compilers and Applications. Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer Science Division UC Berkeley. Compiling for VIRAM.
E N D
Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer Science Division UC Berkeley
Compiling for VIRAM • Long-term success of DIS technology depends on simple programming model, i.e., a compiler • Needs to handle significant class of applications • IRAM: multimedia, graphics, speech and image processing • ISTORE: databases, signal processing, other DIS benchmarks • Needs to utilize hardware features for performance • IRAM: vectorization • ISTORE: scalability of shared-nothing programming model
IRAM Compilers • IRAM/Cray vectorizing compiler [Judd] • Production compiler • Used on the T90, C90, as well as the T3D and T3E • Being ported (by SGI/Cray) to the SV2 architecture • Has C, C++, and Fortran front-ends (focus on C) • Extensive vectorization capability • outer loop vectorization, scatter/gather, short loops, … • VIRAM port is under way • IRAM/VSUIF vectorizing compiler [Krashinsky] • Based on VSUIF from Corinna Lee’s group at Toronto which is based on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford • This is a “research” compiler, not intended for compiling large complex applications • It has been working since 5/99.
IRAM/Cray Compiler Status Vectorizer Code Generators Frontends • MIPS backend developed in this year • Validated using a commercial test suite for code generation • Generated code run through vas assembler • Vector backend recently started • Testing with vsim under way this week • Leveraging from Cray • Automatic vectorization • Basic instruction scheduling framework C PDGCS C90 C++ IRAM Fortran
ISTORE Compiler Optimizer C compiler • Titanium language is an extension of Java • tc is the Titanium compiler • Recent progress: • improved portability of generated code and the compiler itself, including port to Cray parallel machines • additions to generate annotations on C code to improve fine-grained parallelism (on Tera MTA) and vectorization • New benchmarking efforts • database primitives: sorting, hash-join and index-nested-loop join • 3d FFT and linear solvers (LU) Code Gen Java tc C + comm t3e cc Titanium ISTORE
Applications • Hand-written kernels for single-chip VIRAM • focus on multimedia kernels, see IRAM hardware talk • Compiled programs for single-chip VIRAM • 2 examples from IRAM/VSUIF: decryption and mvm • most effort devoted to IRAM/Cray compiler • Performance benchmarks for ISTORE • 3d FFT • Others • SAM benchmarks for ISTORE
Automatic Vectorization • Vectorizing compilers very successful on scientific applications • not entirely automatic, especially for C/C++ • good tools for training users • Multimedia applications have • shorter vector lengths • can sometime exploit outer loop vectorization for longer vectors • often leads to non-unit strides • tree traversals could be written as scatter/gather (breadth-first), • although automating this is far from solved e.g., image compression
IRAM/VSUIF Decryption (IDEA) # lanes • IDEA Decryption operates on 16-bit ints • Compiled with IRAM/VSUIF (with unrolling by hand) • Note scalability of both #lanes and data width
VIRAM/VSUIF Matrix/Vector Multiply • VIRAM/VSUIF does reasonably well on long loops • 256x256 single matrix • Compare to 1600 Mflop/s (peak without multadd) • Note BLAS-2 (little reuse) • ~350 on Power3 and EV6 • Problems specific to VSUIF • hand strip-mining results in short loops • reductions • no multadd support mvm vmm
3D FFT on ISTORE • Performance of large 3D FFT’s depend on 2 factors • speed of 1D FFT on a single node (next slide) • network bandwidth for “transposing” data • 1.3 Tflop FFT possible w/ 1K IRAM nodes and .5 TB/s bw
DSP56002 DSP 908 us (Motorola) (24 bits) TMS320C6000 DSP 124us (Texas Instruments) (32 bits) TigerSHARC DSP 41us (Analog Devices) (32bit) IRAM 37us (32bit) 1D FFT on IRAM • FFT study on IRAM [Randi Thomas] • hand-coded and scheduled • use of ISA features to make in-register FFTs fast (128 point) • bit-reversal time not included; will also use ISA support
Other ISTORE Applications • Working on several performance applications for ISTORE • Database primitives: sorts, joins, scans, etc. [Kar Ming Tang] • RT_STAP • QR Decomposition vectorizes easily, partially complete in IRAM/VSUIF • Conjugate Gradient [Samson Kwok] • Dominated by sparse matrix-vector multiply • Current performance: 500/250 Mflops (single/double) on VIRAM • Compare to 10s of Mflops on most RISC machines • Dense linear algebra [Simon Yau] • Considering other DIS benchmarks, such as MoM
Conclusions • Significant compiler progress: • Cray collaboration key [Dave Judd UCB @ Eagan ] • Good tech transfer model • Vector code gen and instruction scheduling next steps • Even VSUIF version indicates reasonable performance • Commercial-quality compiler will allow non-toy applications, e.g., Speech • Benchmarks • Have been used to help with final ISA design • Simulated results validate performance claims • Models show real advantage to Intelligence in Memory (and Disk) • Machines scale and with simpler programming and optimization model than conventional multiprocessors