260 likes | 275 Views
Explore the evolution of big data computation with in-memory processing, efficiency improvements, and parallelism gains for large vector operations. Discover solutions for energy wastage and inefficiencies in conventional systems.
E N D
In-Situ Compute Memory Systems Reetuparna Das Assistant Professor, EECS Department
Near Data Computing Move compute near storage
Processing in Memory (PIM) IRAM, DIVA, Active Pages, etc ... 1997 Evolution... Emergence of Big Data Data movement dominates Joules/Op 3D Memories available Resurgence of Processing in Memory (PIM) Logic layer near Memory enabled by 3D Technology 2012 Automaton Processor Associative memory with custom interconnects 2014 Computer Memories (bit-line computing) In-situ computation inside memory arrays 2015
Problem 1: Memory are big and inactive Memory consumes most of aggregate die area Cache
Problem 2: Significant energy wasted is moving data through memory hierarchy 1000-2000 pJ 1-50 pJ Operation Data movement 20-40x
Problem 3: General purpose processors are inefficient for data parallel applications Scalar Small vector (32 bytes) Inefficient More efficient Very large vector 1000x larger?
Problem summary:Conventional systems process data inefficiently Problem 2 Energy Problem 3 Memory Memory Problem 1 Area CORE CORE
Key Idea Memory = Storage +In-place compute
Proposal:Repurpose memory logic for compute Energy Massive Parallelism (up to 100X) Memory Memory Memory Energy Efficiency (up to 20X) Area CORE CORE CORE
Cache Compute Caches for Efficient Very Large Vector Processing PIs: Reetuparna Das, SatishNarayanasamy, David Blaauw Students: Shaizeen Aga, SupreetJeloka, Arun Subramanian
Proposal: Compute Caches Memory disambiguation support Sub-array CORE0 CORE3 L1 L1 Bank L2 L2 In-place compute SRAM A B Interconnect Challenges: Data orchestration Managing parallelism Coherence and Consistency Cache controller extension L3-Slice3 L3-Slice0 = A op B
Opportunity 1: Large vector computation Operation width = row size L3 Many smaller sub-arrays Each sub-array can compute in parallel L2 Parallelism available (16 MB L3) 512 sub-arrays * 64B 32KB Operand 128X saved L1
Opportunity 2: Save data movement energy L3 Significant portion of cache energy is wire energy (60-80%) H-tree Save wire energy Save energy in moving data to higher cache levels L2 L1
Compute Cache Architecture Memory disambiguation for large vector operations CORE0 CORE3 Cache controller extension L1 L1 More details in upcoming HPCA paper L2 L2 Interconnect L3-Slice3 L3-Slice0 In-place compute SRAM
SRAM array operation Bitlines Read Operation BLn Address BLBn BLB0 BL0 Precharge 0 1 0 1 sub-array Wordlines Row Decoder SA SA Differential Sense Amplifiers 1 1
In-place compute SRAM Bitlines Changes BLn BLBn BLB0 BL0 Row Decoder-O Wordlines Row Decoder Single-ended Sense Amplifiers Vref Vref SA SA SA SA SA SA Differential Sense Amplifiers
In-place compute SRAM A AND B B BLn BLBn BLB0 BL0 A A 0 1 1 0 0 1 0 1 Row Decoder-O Row Decoder B Single-ended Sense Amplifiers Vref Vref SA SA SA SA 0 1 A AND B
Applications modeled using Compute Caches Text Processing StringMatch Wordcount In-memory Checkpointing FastBitBitMap Database Bit Matrix Multiplication
Compute Cache Results Summary 1.9X 2.4X 4%
Cache Compute Cache Summary Empower caches to compute Performance: Large vector parallelism Energy: Reduce on-chip data movement In-place compute SRAM Data placement and cache geometry for increased operand locality 8% area overhead 2.1X performance, 2.7X energy savings
Compute Memory System Stack OS primitives FSA Machine Learning Data Analytics Crypto Image Processing Bioinformatics Graphs Redesign Applications Express computation using in-situ operations Data-flow languages Adapt PL/Compiler Java/C++ OpenCL Express parallelism RAPID [20] Google’s TensorFlow [14] ISA Data orchestration Coherence & Consistency Large SIMD Data-flow Design Architecture Manage parallelism Design Compute Memories Non-Volatile Cache Customize hierarchy Volatile Re-RAM STT-RAM MRAM Flash DRAM SRAM Where to compute? Rich operation set Parallel Automaton In-situ technique Bit-line Bit-line Locality Operation set Logical, Data migration, Comparison, Search Addition, Multiplication, Convolution, FSM
In-Situ Compute Memory Systems Thank You! Reetuparna Das