180 likes | 193 Views
Rough Schedule. 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break 3:15-3:30 Financial 4:00-5:00 Future. IRAM Hardware and Software. Kathy Yelick Computer Science Division UC Berkeley. L o g i c. f a b. Proc. $. $. L2 $. Bus. Bus. D. R. A. M. I/O. I/O. I/O. I/O.
E N D
Rough Schedule • 1:30-2:15 IRAM overview • 2:15-3:00 ISTORE overview • break • 3:15-3:30 Financial • 4:00-5:00 Future
IRAM Hardware and Software Kathy Yelick Computer Science Division UC Berkeley
L o g i c f a b Proc $ $ L2$ Bus Bus D R A M I/O I/O I/O I/O Proc f a b D R A M Bus D R A M Intelligent RAM: IRAM Microprocessor & DRAM on a single chip: • 10X capacity vs. DRAM • on-chip memory latency 5-10X, bandwidth 50-100X • improve energy efficiency 2X-4X (no off-chip bus) • serial I/O 5-10X v. buses • smaller board area/volume IRAM advantages extend to: • a single chip system • a building block for larger systems
C P U+$ 4 Vector Pipes/Lanes VIRAM: System on a Chip • 0.18 um EDL process • 16 MB DRAM, 8 banks • MIPS Scalar core and caches @ 200 MHz • 4 64-bit vector unit pipelines @ 200 MHz • 17x17 mm, 2 Watts target • 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) • 0.8 Gflops (64-bit), 6.4 GOPs (16-bit) Memory(64 Mbits / 8 MBytes) Xbar Memory(64 Mbits / 8 MBytes)
IRAM Chip Update • IBM supplying embedded DRAM/Logic (100%) • Agreement in place and technology files available • MIPS supplying scalar core (100%) • MIPS processor, caches, TLB • MIT supplying FPU (100%) • VIRAM-1 Tape-out scheduled for late-2000 • Simplifications • Floating point • Network Interface
MIPS scalar core Synthesizable RTL code received from MIPS Cache RAMs to be compiled for IBM technology FPU RTL code almost compete Vector unit RTL models for sub-blocks developed; currently integrated and tested Control logic to be compiled for IBM technology Full-custom layout for multipliers/adders developed; layout for shifters to be developed Memory system Synthesizable model for DRAM controllers done To be integrated with IBM DRAM macros Full-custom layout for crossbar under development Testing infrastructure Environment developed for automatic test & validation Directed tests for single/multiple instruction groups developed Random instruction sequence generator developed VIRAM-1 Chip Design Status
IRAM Architecture Update • ISA mostly frozen since 6/99 • Changes in 2H 99 for better fixed-point model and some instructions for short vectors (auto increment and in-register permutations) • Minor changes in 1H 00 to address new co-processor interface in MIPS core • ISA manual publicly available • http://www.cs.berkeley.edu • Suite of simulators actively used • vsim-isa (functional) • Major rewrite underway for new scalar processor • All UCB code • vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)
Vectorizer Code Generators Frontends C PDGCS C90 C++ IRAM Fortran IRAM Compiler Status • Retarget of Cray Backend • Steps in compiler development • Build MIPS backend (done) • Build VIRAM bacckend for vectorized loops (done) • Instruction scheduling for VIRAM-1 (works, but could be improved) • Insertion of memory barriers (using Cray strategy, improving) • Optimizations for short loops (reduce overhead) • Feedback results to Cray, new version from Cray (ongoing)
IRAM Compiler Update • Study of compiler quality using 100 “Dongarra loops” • 70 vectorized • Average 10x reduction in dynamic instruction count • Average vector length of 42 • 30 did not, usually due to a dependence • Some reductions missed • Vector version of math libraries (sin, cos, etc.) needed • Some failed due to bugs in benchmark • Identified 2 specific areas for improvements in loop overhead • Use VL and MVL more carefully • Use auto-increment instruction more extensively
Compiled Applications Update • Applications using compiler • Speech processing under development • Developed new small-memory algorithm for speech processing • Uses some existing kernels (FFT and MM) • Vector search algorithm is most challenging • DIS image understanding application under development • Compiles, but does not yet vectorize well • Singular Value Decomposition • Better than 2 VLIW machines (TI C67 and TM 1100) • Challenging BLAS-1,2 work well on IRAM because of memory BW • Kernels • SAXPY, MVM, etc. • Will include DIS stress-marks
(10n x n SVD, rank 10) (From Herman, Loo, Tang, CS252 project)
Hand-Coded Applications Update • Image processing kernels (old FPU model) • Note BLAS-2 performance
0 1 15 0 1 16 16 16 15 Problem: General Element Permutation • Hardware for a full vector permutation instruction (128 16b elements, 256b datapath) • Datapath: 16 x 16 (x 16b) crossbar; scales by 0(N^2) • Control: 16 16-to-1 multiplexors; scales by 0(N*logN) • Time/energy wasted on wide vector register file port
0 1 15 Simple Vector Permutations • Simple steps of butterfly permutations • A register provides the butterfly radix • Separate instructions for moving elements to left/right • Sufficient semantics for • Fast reductions of vector registers (dot products) • Fast FFT kernels
64 64 64 64 shift shift 0 3 Hardware for Simple Permutations • Hardware for 128 16b elements, 256b datapath • Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16b, 32b only); Scales by O(N) • Control: 6 control cases; scales by O(N) • Other benefits • Consecutive result elements written together; • Buses used only for small radices
FFT: Uses In-Register Permutations Without in-register permutations
Summary • IRAM takes advantage of high on-chip bandwidth • BLAS-2 performance confirms this • Vector IRAM ISA utilizes this bandwidth • Unit, strided, and indexed memory access patterns supported • Exploits fine-grained parallelism, even with pointer chasing • Compiler • Well-understood compiler model, semi-automatic • Still some work on code generation quality • Application benchmarks • Compiled and hand-coded • Include FFT, SVD, MVM, sparse MVM, and other kernels used in image and signal processing
IRAM as Building Block for ISTORE • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • Target for + 5-7 years: • building block: 2006 MicroDrive integrated with IRAM • 9GB disk, 50 MB/sec disk (projected) • connected via crossbar switch • O(10) Gflops • 10,000+ nodes fit into one rack!