In-Situ Compute Memory Systems

In-Situ Compute Memory Systems Reetuparna Das Assistant Professor, EECS Department

Massive amounts of data generated each day

Near Data Computing Move compute near storage

Processing in Memory (PIM) IRAM, DIVA, Active Pages, etc ... 1997 Evolution... Emergence of Big Data Data movement dominates Joules/Op 3D Memories available Resurgence of Processing in Memory (PIM) Logic layer near Memory enabled by 3D Technology 2012 Automaton Processor Associative memory with custom interconnects 2014 Computer Memories (bit-line computing) In-situ computation inside memory arrays 2015

Problem 1: Memory are big and inactive Memory consumes most of aggregate die area Cache

Problem 2: Significant energy wasted is moving data through memory hierarchy 1000-2000 pJ 1-50 pJ Operation Data movement 20-40x

Problem 3: General purpose processors are inefficient for data parallel applications Scalar Small vector (32 bytes) Inefficient More efficient Very large vector 1000x larger?

Problem summary:Conventional systems process data inefficiently Problem 2 Energy Problem 3 Memory Memory Problem 1 Area CORE CORE

Key Idea Memory = Storage +In-place compute

Proposal:Repurpose memory logic for compute Energy Massive Parallelism (up to 100X) Memory Memory Memory Energy Efficiency (up to 20X) Area CORE CORE CORE

Cache Compute Caches for Efficient Very Large Vector Processing PIs: Reetuparna Das, SatishNarayanasamy, David Blaauw Students: Shaizeen Aga, SupreetJeloka, Arun Subramanian

Proposal: Compute Caches Memory disambiguation support Sub-array CORE0 CORE3 L1 L1 Bank L2 L2 In-place compute SRAM A B Interconnect Challenges: Data orchestration Managing parallelism Coherence and Consistency Cache controller extension L3-Slice3 L3-Slice0 = A op B

Opportunity 1: Large vector computation Operation width = row size L3 Many smaller sub-arrays Each sub-array can compute in parallel L2 Parallelism available (16 MB L3) 512 sub-arrays * 64B 32KB Operand 128X saved L1

Opportunity 2: Save data movement energy L3 Significant portion of cache energy is wire energy (60-80%) H-tree Save wire energy Save energy in moving data to higher cache levels L2 L1

Compute Cache Architecture Memory disambiguation for large vector operations CORE0 CORE3 Cache controller extension L1 L1 More details in upcoming HPCA paper L2 L2 Interconnect L3-Slice3 L3-Slice0 In-place compute SRAM

SRAM array operation Bitlines Read Operation BLn Address BLBn BLB0 BL0 Precharge 0 1 0 1 sub-array Wordlines Row Decoder SA SA Differential Sense Amplifiers 1 1

In-place compute SRAM Bitlines Changes BLn BLBn BLB0 BL0 Row Decoder-O Wordlines Row Decoder Single-ended Sense Amplifiers Vref Vref SA SA SA SA SA SA Differential Sense Amplifiers

In-place compute SRAM A AND B B BLn BLBn BLB0 BL0 A A 0 1 1 0 0 1 0 1 Row Decoder-O Row Decoder B Single-ended Sense Amplifiers Vref Vref SA SA SA SA 0 1 A AND B

SRAM Prototype Test Chip

Compute Cache ISA So Far

Applications modeled using Compute Caches Text Processing StringMatch Wordcount In-memory Checkpointing FastBitBitMap Database Bit Matrix Multiplication

Compute Cache Results Summary 1.9X 2.4X 4%

Cache Compute Cache Summary Empower caches to compute Performance: Large vector parallelism Energy: Reduce on-chip data movement In-place compute SRAM Data placement and cache geometry for increased operand locality 8% area overhead 2.1X performance, 2.7X energy savings

Future

Compute Memory System Stack OS primitives FSA Machine Learning Data Analytics Crypto Image Processing Bioinformatics Graphs Redesign Applications Express computation using in-situ operations Data-flow languages Adapt PL/Compiler Java/C++ OpenCL Express parallelism RAPID [20] Google’s TensorFlow [14] ISA Data orchestration Coherence & Consistency Large SIMD Data-flow Design Architecture Manage parallelism Design Compute Memories Non-Volatile Cache Customize hierarchy Volatile Re-RAM STT-RAM MRAM Flash DRAM SRAM Where to compute? Rich operation set Parallel Automaton In-situ technique Bit-line Bit-line Locality Operation set Logical, Data migration, Comparison, Search Addition, Multiplication, Convolution, FSM

In-Situ Compute Memory Systems Thank You! Reetuparna Das

In-Situ Compute Memory Systems

In-Situ Compute Memory Systems

Presentation Transcript

Memory IV Memory Systems Amnesia

MEMORY SYSTEMS IN THE BRAIN

Concurrency in Shared Memory Systems

MEMORY SYSTEMS IN THE BRAIN

Memory Systems

Memory systems

In situ

Memory systems

Memory Systems

Memory Systems

Memory issues in production systems

Compute

Memory Systems

Memory Systems

Memory systems

Interface Design Compute Memory Timing

Signatures in Transactional Memory Systems

Memory Systems

Memory Systems in Dementia

MEMORY SYSTEMS

Memory issues in production systems