180 likes | 194 Views
Learn about Intelligent RAM (IRAM) advancements at UC Berkeley, including VIRAM, compilers, applications, and hardware updates with IBM, MIPS, and MIT collaborations. Stay informed on groundbreaking developments in IRAM architecture and compiler optimization strategies.
E N D
Rough Schedule • 1:30-2:15 IRAM overview • 2:15-3:00 ISTORE overview • break • 3:15-3:30 Financial • 4:00-5:00 Future
IRAM Hardware and Software Kathy Yelick Computer Science Division UC Berkeley
L o g i c f a b Proc $ $ L2$ Bus Bus D R A M I/O I/O I/O I/O Proc f a b D R A M Bus D R A M Intelligent RAM: IRAM Microprocessor & DRAM on a single chip: • 10X capacity vs. DRAM • on-chip memory latency 5-10X, bandwidth 50-100X • improve energy efficiency 2X-4X (no off-chip bus) • serial I/O 5-10X v. buses • smaller board area/volume IRAM advantages extend to: • a single chip system • a building block for larger systems
C P U+$ 4 Vector Pipes/Lanes VIRAM: System on a Chip • 0.18 um EDL process • 16 MB DRAM, 8 banks • MIPS Scalar core and caches @ 200 MHz • 4 64-bit vector unit pipelines @ 200 MHz • 17x17 mm, 2 Watts target • 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) • 0.8 Gflops (64-bit), 6.4 GOPs (16-bit) Memory(64 Mbits / 8 MBytes) Xbar Memory(64 Mbits / 8 MBytes)
IRAM Chip Update • IBM supplying embedded DRAM/Logic (100%) • Agreement in place and technology files available • MIPS supplying scalar core (100%) • MIPS processor, caches, TLB • MIT supplying FPU (100%) • VIRAM-1 Tape-out scheduled for late-2000 • Simplifications • Floating point • Network Interface
MIPS scalar core Synthesizable RTL code received from MIPS Cache RAMs to be compiled for IBM technology FPU RTL code almost compete Vector unit RTL models for sub-blocks developed; currently integrated and tested Control logic to be compiled for IBM technology Full-custom layout for multipliers/adders developed; layout for shifters to be developed Memory system Synthesizable model for DRAM controllers done To be integrated with IBM DRAM macros Full-custom layout for crossbar under development Testing infrastructure Environment developed for automatic test & validation Directed tests for single/multiple instruction groups developed Random instruction sequence generator developed VIRAM-1 Chip Design Status
IRAM Architecture Update • ISA mostly frozen since 6/99 • Changes in 2H 99 for better fixed-point model and some instructions for short vectors (auto increment and in-register permutations) • Minor changes in 1H 00 to address new co-processor interface in MIPS core • ISA manual publicly available • http://www.cs.berkeley.edu • Suite of simulators actively used • vsim-isa (functional) • Major rewrite underway for new scalar processor • All UCB code • vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)
Vectorizer Code Generators Frontends C PDGCS C90 C++ IRAM Fortran IRAM Compiler Status • Retarget of Cray Backend • Steps in compiler development • Build MIPS backend (done) • Build VIRAM bacckend for vectorized loops (done) • Instruction scheduling for VIRAM-1 (works, but could be improved) • Insertion of memory barriers (using Cray strategy, improving) • Optimizations for short loops (reduce overhead) • Feedback results to Cray, new version from Cray (ongoing)
IRAM Compiler Update • Study of compiler quality using 100 “Dongarra loops” • 70 vectorized • Average 10x reduction in dynamic instruction count • Average vector length of 42 • 30 did not, usually due to a dependence • Some reductions missed • Vector version of math libraries (sin, cos, etc.) needed • Some failed due to bugs in benchmark • Identified 2 specific areas for improvements in loop overhead • Use VL and MVL more carefully • Use auto-increment instruction more extensively
Compiled Applications Update • Applications using compiler • Speech processing under development • Developed new small-memory algorithm for speech processing • Uses some existing kernels (FFT and MM) • Vector search algorithm is most challenging • DIS image understanding application under development • Compiles, but does not yet vectorize well • Singular Value Decomposition • Better than 2 VLIW machines (TI C67 and TM 1100) • Challenging BLAS-1,2 work well on IRAM because of memory BW • Kernels • SAXPY, MVM, etc. • Will include DIS stress-marks
(10n x n SVD, rank 10) (From Herman, Loo, Tang, CS252 project)
Hand-Coded Applications Update • Image processing kernels (old FPU model) • Note BLAS-2 performance
0 1 15 0 1 16 16 16 15 Problem: General Element Permutation • Hardware for a full vector permutation instruction (128 16b elements, 256b datapath) • Datapath: 16 x 16 (x 16b) crossbar; scales by 0(N^2) • Control: 16 16-to-1 multiplexors; scales by 0(N*logN) • Time/energy wasted on wide vector register file port
0 1 15 Simple Vector Permutations • Simple steps of butterfly permutations • A register provides the butterfly radix • Separate instructions for moving elements to left/right • Sufficient semantics for • Fast reductions of vector registers (dot products) • Fast FFT kernels
64 64 64 64 shift shift 0 3 Hardware for Simple Permutations • Hardware for 128 16b elements, 256b datapath • Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16b, 32b only); Scales by O(N) • Control: 6 control cases; scales by O(N) • Other benefits • Consecutive result elements written together; • Buses used only for small radices
FFT: Uses In-Register Permutations Without in-register permutations
Summary • IRAM takes advantage of high on-chip bandwidth • BLAS-2 performance confirms this • Vector IRAM ISA utilizes this bandwidth • Unit, strided, and indexed memory access patterns supported • Exploits fine-grained parallelism, even with pointer chasing • Compiler • Well-understood compiler model, semi-automatic • Still some work on code generation quality • Application benchmarks • Compiled and hand-coded • Include FFT, SVD, MVM, sparse MVM, and other kernels used in image and signal processing
IRAM as Building Block for ISTORE • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • Target for + 5-7 years: • building block: 2006 MicroDrive integrated with IRAM • 9GB disk, 50 MB/sec disk (projected) • connected via crossbar switch • O(10) Gflops • 10,000+ nodes fit into one rack!