340 likes | 490 Views
Toward an Advanced Intelligent Memory System. FlexRAM. Josep Torrellas. University of Illinois. http://iacoma.cs.uiuc.edu. torrellas@cs.uiuc.edu. People Involved. Students. Other faculty. Michael Huang. David Padua. Joe Renau. H. V. Jagadish. Seung Yoo. Daniel Reed. Jaejin Lee.
E N D
Toward an Advanced Intelligent Memory System FlexRAM Josep Torrellas University of Illinois http://iacoma.cs.uiuc.edu torrellas@cs.uiuc.edu
People Involved Students Other faculty Michael Huang David Padua Joe Renau H. V. Jagadish Seung Yoo Daniel Reed Jaejin Lee
Technological Landscape Merged Logic and DRAM (MLD): • IBM, Mitsubishi, Samsung, Toshiba and others • Powerful: e.g. IBM SA-27E ASIC (Feb 99) • 0.18 um (chips for 1 Gbit DRAM) • Logic frequency: 400 MHz • IBM PowerPC 603 proc + 16 KB I, D caches = 3% • Further advances in the horizon Opportunity: How to exploit MLD best?
Terminology Processor In Memory (PIM) Intelligent Memory or Intelligent RAM (IRAM) =
Key Applications Benefit from HW • Data Mining (decision trees and neural networks) • Computational Biology (protein sequence matching) • Financial Modeling (stock options, derivatives) • Molecular Dynamics (short-range forces) • Multimedia (MPEG-2) • Decision Support Systems (TPC-D) • Speech Recognition All these are Data Intensive Applications
Example App: DNA Matching • Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chains
How the Algorithm Works • Pick 4 consecutive aminoacids from sample • Generate 50+ most-likely mutations
Example App: DNA Matching • Compare them to every positions in the database DNAs • If match is found: try to extend it
How to Use MLD 1. Main compute engine of the machine • Add proc to DRAM chip Incremental gains • Include a vector processor • or multiple processors Hard to program UC Berkeley: IRAM Notre Dame: Execube, Petaflops MIT: Raw Stanford: Smart Memories
How to Use MLD (II) 2. Co-processor, special-purpose processor • ATM switch controller • Process data beside the disk • Graphics accelerator Stanford: Imagine UC Berkeley: ISTORE
How to Use MLD (III) 3. Our approach: take the place of memory chips in a workstation or server • PIM chip processes the memory-intensive parts • of the program Illinois: FlexRAM UC Davis: Active Pages USC-ISI: DIVA
Our Solution: Principles • Extract high bandwidth from DRAM: • Many simple processing units • Run legacy codes with high performance: • Do not replace off-the-shelf uP in workstation • Take place of memory chip. Same interface as DRAM • Intelligent memory defaults to plain DRAM • Small increase in cost over DRAM: • Simple processing units, still dense • General purpose: • Do not hardwire any algorithm. No Special purpose
The FlexRAM Memory System Can exploit multiple levels of parallelism For a high-end workstation: • 1 P.Host processor (e.g. Merced, IBM GP) • 100’s of P.Mems in memory (e.g. IBM PowerPC 603) • 100,000’s of very simple P.Arrays in memory
Memory in one FlexRAM Chip • 64 Mbytes of DRAM organized as 16Mx32 bits • Organized in 64 1-Mbyte banks • Each bank: • Associated to 1 P.Array • 1 single port • 2 2-Kbyte row buffers (no P.Array cache) • P.Array access to memory: 10 ns (row hit) or 20 ns (miss) • On-chip memory bandwidth: 102 Gbytes/second
Memory in one FlexRAM Chip Group of 4 P.Arrays share one 8-Kbyte, 4-ported SRAM instruction memory • Holds the P.Array code • Small because short code • Aggressive access time: 1 cycle = 2.5 ns
P.Array • 64 P.Arrays per chip. Not SIMD but SPMD • 32-bit integer arithmetic; 16 registers • No caches, no floating point • 4 P.Arrays share one multiplier • 28 different 16-bit instructions • Can access own 1 Mbyte of DRAM plus DRAM of • left and right neighbors. Connection forms a ring • Broadcast and notify primitives: Barrier
P.Mem • 2-issue static superscalar like IBM PowerPC 603 • 16-Kbyte I, D caches • Executes serial sections • Communication with P.Arrays: • Broadcast/notify or plain write/read to memory • Communication with other P.Mems: • Memory in all chips is visible • Access via the inter-chip network • Must flush caches to ensure data coherence
Issues Communication P.Mem-P.Host: • P.Mem cannot be the master of bus • P.Host starts P.Mems by writing register in Rambus interf. • P.Host polls a register in Rambus interf. of master P.Mem • If P.Mem not finished: memory controller retries. Retries • are invisible to P.Host Virtual memory: • P.Mems and P.Arrays use virtual memory • They share a range of virtual addresses with P.Host
Area Estimation (mm ) 2 VERY CONSERVATIVE PowerPC 603+caches: 12 64 Mbytes of DRAM: 330 SRAM instruction memory: 34 P.Arrays: 96 Multipliers: 10 Rambus interface: 3.4 Pads + network interf. + refresh logic 20 Total = 505 Of which 28% logic, 65% DRAM, 7% SRAM
High P.Array Util Low P.Mem Util Utilization
Utilization • Low P.Host Utilization
Constant Problem Sz Scaled Problem Sz Speedups
Speedups • Varying Logic Frequency
Programming FlexRAM • FlexRAM programmed in C + extensions: C-Flex • Library of Intelligent Memory Operations (IMOs) C subroutines that can be called from main pgm Executed by P.Arrays or P.Mem Operate on large data sets with poor locality • Library also contains plain subroutines • Link program with IMOs or plain subroutines
C-Flex Programming Extensions • On processor_range: where the following code is executed • Waitforprocessor_range: processors waiting for others • Mapobject to processor_range: mapping of pages • Release object • Flush(object), Flush&Inval(object): flush from cache • Broadcast(address), Poll(), Receive(address), Notify() • FlexRAM_malloc(), P_mem_malloc(), P_array_malloc()
Performance Evaluation • Hardware performance monitoring embedded in the chip • Software tools to extract and interpret performance info
Current Status • Identified and wrote all applications • Designed architecture based on apps & feasible technology • Conceived ideas behind language/compiler • Need to do: chip layout and fabrication development of the compiler • Funds needed for: • processor core (P.Mem) • chip fabrication • hardware and software engineers
Overall Goal • Fabricate chips • Build a workstation with an intelligent memory system • Build a compiler for the intelligent memory system • Demonstrate significant speedups on real applications
Conclusion • We have a handle on: • A promising technology (MLD) • Key applications of industrial interest • Real chance to transform the computing landscape