Toward an Advanced Intelligent Memory System

Toward an Advanced Intelligent Memory System FlexRAM Josep Torrellas University of Illinois http://iacoma.cs.uiuc.edu torrellas@cs.uiuc.edu

People Involved Students Other faculty Michael Huang David Padua Joe Renau H. V. Jagadish Seung Yoo Daniel Reed Jaejin Lee

Technological Landscape Merged Logic and DRAM (MLD): • IBM, Mitsubishi, Samsung, Toshiba and others • Powerful: e.g. IBM SA-27E ASIC (Feb 99) • 0.18 um (chips for 1 Gbit DRAM) • Logic frequency: 400 MHz • IBM PowerPC 603 proc + 16 KB I, D caches = 3% • Further advances in the horizon Opportunity: How to exploit MLD best?

Terminology Processor In Memory (PIM) Intelligent Memory or Intelligent RAM (IRAM) =

Key Applications Benefit from HW • Data Mining (decision trees and neural networks) • Computational Biology (protein sequence matching) • Financial Modeling (stock options, derivatives) • Molecular Dynamics (short-range forces) • Multimedia (MPEG-2) • Decision Support Systems (TPC-D) • Speech Recognition All these are Data Intensive Applications

Example App: DNA Matching • Problem: Find areas of database DNA chains that match (modulo some mutations) the sample DNA chains

How the Algorithm Works • Pick 4 consecutive aminoacids from sample • Generate 50+ most-likely mutations

Example App: DNA Matching • Compare them to every positions in the database DNAs • If match is found: try to extend it

How to Use MLD 1. Main compute engine of the machine • Add proc to DRAM chip Incremental gains • Include a vector processor • or multiple processors Hard to program UC Berkeley: IRAM Notre Dame: Execube, Petaflops MIT: Raw Stanford: Smart Memories

How to Use MLD (II) 2. Co-processor, special-purpose processor • ATM switch controller • Process data beside the disk • Graphics accelerator Stanford: Imagine UC Berkeley: ISTORE

How to Use MLD (III) 3. Our approach: take the place of memory chips in a workstation or server • PIM chip processes the memory-intensive parts • of the program Illinois: FlexRAM UC Davis: Active Pages USC-ISI: DIVA

Our Solution: Principles • Extract high bandwidth from DRAM: • Many simple processing units • Run legacy codes with high performance: • Do not replace off-the-shelf uP in workstation • Take place of memory chip. Same interface as DRAM • Intelligent memory defaults to plain DRAM • Small increase in cost over DRAM: • Simple processing units, still dense • General purpose: • Do not hardwire any algorithm. No Special purpose

Architecture Proposed

The FlexRAM Memory System Can exploit multiple levels of parallelism For a high-end workstation: • 1 P.Host processor (e.g. Merced, IBM GP) • 100’s of P.Mems in memory (e.g. IBM PowerPC 603) • 100,000’s of very simple P.Arrays in memory

Chip Organization

Memory in one FlexRAM Chip • 64 Mbytes of DRAM organized as 16Mx32 bits • Organized in 64 1-Mbyte banks • Each bank: • Associated to 1 P.Array • 1 single port • 2 2-Kbyte row buffers (no P.Array cache) • P.Array access to memory: 10 ns (row hit) or 20 ns (miss) • On-chip memory bandwidth: 102 Gbytes/second

Memory in one FlexRAM Chip Group of 4 P.Arrays share one 8-Kbyte, 4-ported SRAM instruction memory • Holds the P.Array code • Small because short code • Aggressive access time: 1 cycle = 2.5 ns

P.Array • 64 P.Arrays per chip. Not SIMD but SPMD • 32-bit integer arithmetic; 16 registers • No caches, no floating point • 4 P.Arrays share one multiplier • 28 different 16-bit instructions • Can access own 1 Mbyte of DRAM plus DRAM of • left and right neighbors. Connection forms a ring • Broadcast and notify primitives: Barrier

P.Mem • 2-issue static superscalar like IBM PowerPC 603 • 16-Kbyte I, D caches • Executes serial sections • Communication with P.Arrays: • Broadcast/notify or plain write/read to memory • Communication with other P.Mems: • Memory in all chips is visible • Access via the inter-chip network • Must flush caches to ensure data coherence

Issues Communication P.Mem-P.Host: • P.Mem cannot be the master of bus • P.Host starts P.Mems by writing register in Rambus interf. • P.Host polls a register in Rambus interf. of master P.Mem • If P.Mem not finished: memory controller retries. Retries • are invisible to P.Host Virtual memory: • P.Mems and P.Arrays use virtual memory • They share a range of virtual addresses with P.Host

Chip Architecture

Basic Block

Area Estimation (mm ) 2 VERY CONSERVATIVE PowerPC 603+caches: 12 64 Mbytes of DRAM: 330 SRAM instruction memory: 34 P.Arrays: 96 Multipliers: 10 Rambus interface: 3.4 Pads + network interf. + refresh logic 20 Total = 505 Of which 28% logic, 65% DRAM, 7% SRAM

Evaluation

High P.Array Util Low P.Mem Util Utilization

Utilization • Low P.Host Utilization

Constant Problem Sz Scaled Problem Sz Speedups

Speedups • Varying Logic Frequency

Programming FlexRAM • FlexRAM programmed in C + extensions: C-Flex • Library of Intelligent Memory Operations (IMOs) C subroutines that can be called from main pgm Executed by P.Arrays or P.Mem Operate on large data sets with poor locality • Library also contains plain subroutines • Link program with IMOs or plain subroutines

C-Flex Programming Extensions • On processor_range: where the following code is executed • Waitforprocessor_range: processors waiting for others • Mapobject to processor_range: mapping of pages • Release object • Flush(object), Flush&Inval(object): flush from cache • Broadcast(address), Poll(), Receive(address), Notify() • FlexRAM_malloc(), P_mem_malloc(), P_array_malloc()

Performance Evaluation • Hardware performance monitoring embedded in the chip • Software tools to extract and interpret performance info

Current Status • Identified and wrote all applications • Designed architecture based on apps & feasible technology • Conceived ideas behind language/compiler • Need to do: chip layout and fabrication development of the compiler • Funds needed for: • processor core (P.Mem) • chip fabrication • hardware and software engineers

Overall Goal • Fabricate chips • Build a workstation with an intelligent memory system • Build a compiler for the intelligent memory system • Demonstrate significant speedups on real applications

Conclusion • We have a handle on: • A promising technology (MLD) • Key applications of industrial interest • Real chance to transform the computing landscape

Toward an Advanced Intelligent Memory System

Toward an Advanced Intelligent Memory System

Presentation Transcript

An Intelligent Rule-Oriented Data Management System

Advanced Utility Infrastructure The Foundation for an Intelligent Electricity Delivery System

An Interactive Multimedia Intelligent Tutoring System

Advanced Memory Management

Chapter 2. Toward an Ideal System

Toward an Intelligent Event Broker: Automated Transient Classification

Toward an Advanced Intelligent Memory System

Tri-Level-Cell Phase Change Memory (PCM): Toward an Efficient and Reliable Memory System

Implementing Advanced Intelligent Memory

An Intelligent Rule-Oriented Data Management System

Marvin An Intelligent Obstacle Avoidance System

Towards an Intelligent Multilingual Keyboard System

IntelliTraffic (An intelligent Traffic Control System)

Marvin An Intelligent Obstacle Avoidance System

Steps Toward Developing an Intelligent Robotics Course Kutztown University

Marvin An Intelligent Obstacle Avoidance System

Intelligent Archiving Strategies: Toward ILM

An Intelligent Information System for Forest Management

Intelligent System

Development of an Intelligent Translation Memory

How an Intelligent Traffic System Helps?

Advanced Memory Hierarchy