1 / 34

Toward an Advanced Intelligent Memory System

Toward an Advanced Intelligent Memory System. FlexRAM. Josep Torrellas. University of Illinois. http://iacoma.cs.uiuc.edu. torrellas@cs.uiuc.edu. People Involved. Students. Other faculty. Michael Huang. David Padua. Joe Renau. H. V. Jagadish. Seung Yoo. Daniel Reed. Jaejin Lee.

Download Presentation

Toward an Advanced Intelligent Memory System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Toward an Advanced Intelligent Memory System FlexRAM Josep Torrellas University of Illinois http://iacoma.cs.uiuc.edu torrellas@cs.uiuc.edu

  2. People Involved Students Other faculty Michael Huang David Padua Joe Renau H. V. Jagadish Seung Yoo Daniel Reed Jaejin Lee

  3. Technological Landscape Merged Logic and DRAM (MLD): • IBM, Mitsubishi, Samsung, Toshiba and others • Powerful: e.g. IBM SA-27E ASIC (Feb 99) • 0.18 um (chips for 1 Gbit DRAM) • Logic frequency: 400 MHz • IBM PowerPC 603 proc + 16 KB I, D caches = 3% • Further advances in the horizon Opportunity: How to exploit MLD best?

  4. Terminology Processor In Memory (PIM) Intelligent Memory or Intelligent RAM (IRAM) =

  5. Key Applications Benefit from HW • Data Mining (decision trees and neural networks) • Computational Biology (protein sequence matching) • Financial Modeling (stock options, derivatives) • Molecular Dynamics (short-range forces) • Multimedia (MPEG-2) • Decision Support Systems (TPC-D) • Speech Recognition All these are Data Intensive Applications

  6. Example App: DNA Matching • Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chains

  7. How the Algorithm Works • Pick 4 consecutive aminoacids from sample • Generate 50+ most-likely mutations

  8. Example App: DNA Matching • Compare them to every positions in the database DNAs • If match is found: try to extend it

  9. How to Use MLD 1. Main compute engine of the machine • Add proc to DRAM chip Incremental gains • Include a vector processor • or multiple processors Hard to program UC Berkeley: IRAM Notre Dame: Execube, Petaflops MIT: Raw Stanford: Smart Memories

  10. How to Use MLD (II) 2. Co-processor, special-purpose processor • ATM switch controller • Process data beside the disk • Graphics accelerator Stanford: Imagine UC Berkeley: ISTORE

  11. How to Use MLD (III) 3. Our approach: take the place of memory chips in a workstation or server • PIM chip processes the memory-intensive parts • of the program Illinois: FlexRAM UC Davis: Active Pages USC-ISI: DIVA

  12. Our Solution: Principles • Extract high bandwidth from DRAM: • Many simple processing units • Run legacy codes with high performance: • Do not replace off-the-shelf uP in workstation • Take place of memory chip. Same interface as DRAM • Intelligent memory defaults to plain DRAM • Small increase in cost over DRAM: • Simple processing units, still dense • General purpose: • Do not hardwire any algorithm. No Special purpose

  13. Architecture Proposed

  14. The FlexRAM Memory System Can exploit multiple levels of parallelism For a high-end workstation: • 1 P.Host processor (e.g. Merced, IBM GP) • 100’s of P.Mems in memory (e.g. IBM PowerPC 603) • 100,000’s of very simple P.Arrays in memory

  15. Chip Organization

  16. Memory in one FlexRAM Chip • 64 Mbytes of DRAM organized as 16Mx32 bits • Organized in 64 1-Mbyte banks • Each bank: • Associated to 1 P.Array • 1 single port • 2 2-Kbyte row buffers (no P.Array cache) • P.Array access to memory: 10 ns (row hit) or 20 ns (miss) • On-chip memory bandwidth: 102 Gbytes/second

  17. Memory in one FlexRAM Chip Group of 4 P.Arrays share one 8-Kbyte, 4-ported SRAM instruction memory • Holds the P.Array code • Small because short code • Aggressive access time: 1 cycle = 2.5 ns

  18. P.Array • 64 P.Arrays per chip. Not SIMD but SPMD • 32-bit integer arithmetic; 16 registers • No caches, no floating point • 4 P.Arrays share one multiplier • 28 different 16-bit instructions • Can access own 1 Mbyte of DRAM plus DRAM of • left and right neighbors. Connection forms a ring • Broadcast and notify primitives: Barrier

  19. P.Mem • 2-issue static superscalar like IBM PowerPC 603 • 16-Kbyte I, D caches • Executes serial sections • Communication with P.Arrays: • Broadcast/notify or plain write/read to memory • Communication with other P.Mems: • Memory in all chips is visible • Access via the inter-chip network • Must flush caches to ensure data coherence

  20. Issues Communication P.Mem-P.Host: • P.Mem cannot be the master of bus • P.Host starts P.Mems by writing register in Rambus interf. • P.Host polls a register in Rambus interf. of master P.Mem • If P.Mem not finished: memory controller retries. Retries • are invisible to P.Host Virtual memory: • P.Mems and P.Arrays use virtual memory • They share a range of virtual addresses with P.Host

  21. Chip Architecture

  22. Basic Block

  23. Area Estimation (mm ) 2 VERY CONSERVATIVE PowerPC 603+caches: 12 64 Mbytes of DRAM: 330 SRAM instruction memory: 34 P.Arrays: 96 Multipliers: 10 Rambus interface: 3.4 Pads + network interf. + refresh logic 20 Total = 505 Of which 28% logic, 65% DRAM, 7% SRAM

  24. Evaluation

  25. High P.Array Util Low P.Mem Util Utilization

  26. Utilization • Low P.Host Utilization

  27. Constant Problem Sz Scaled Problem Sz Speedups

  28. Speedups • Varying Logic Frequency

  29. Programming FlexRAM • FlexRAM programmed in C + extensions: C-Flex • Library of Intelligent Memory Operations (IMOs) C subroutines that can be called from main pgm Executed by P.Arrays or P.Mem Operate on large data sets with poor locality • Library also contains plain subroutines • Link program with IMOs or plain subroutines

  30. C-Flex Programming Extensions • On processor_range: where the following code is executed • Waitforprocessor_range: processors waiting for others • Mapobject to processor_range: mapping of pages • Release object • Flush(object), Flush&Inval(object): flush from cache • Broadcast(address), Poll(), Receive(address), Notify() • FlexRAM_malloc(), P_mem_malloc(), P_array_malloc()

  31. Performance Evaluation • Hardware performance monitoring embedded in the chip • Software tools to extract and interpret performance info

  32. Current Status • Identified and wrote all applications • Designed architecture based on apps & feasible technology • Conceived ideas behind language/compiler • Need to do: chip layout and fabrication development of the compiler • Funds needed for: • processor core (P.Mem) • chip fabrication • hardware and software engineers

  33. Overall Goal • Fabricate chips • Build a workstation with an intelligent memory system • Build a compiler for the intelligent memory system • Demonstrate significant speedups on real applications

  34. Conclusion • We have a handle on: • A promising technology (MLD) • Key applications of industrial interest • Real chance to transform the computing landscape

More Related