1 / 26

In-Situ Compute Memory Systems

Explore the evolution of big data computation with in-memory processing, efficiency improvements, and parallelism gains for large vector operations. Discover solutions for energy wastage and inefficiencies in conventional systems.

Download Presentation

In-Situ Compute Memory Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. In-Situ Compute Memory Systems Reetuparna Das Assistant Professor, EECS Department

  2. Massive amounts of data generated each day

  3. Near Data Computing Move compute near storage

  4. Processing in Memory (PIM) IRAM, DIVA, Active Pages, etc ... 1997 Evolution... Emergence of Big Data Data movement dominates Joules/Op 3D Memories available Resurgence of Processing in Memory (PIM) Logic layer near Memory enabled by 3D Technology 2012 Automaton Processor Associative memory with custom interconnects 2014 Computer Memories (bit-line computing) In-situ computation inside memory arrays 2015

  5. Problem 1: Memory are big and inactive Memory consumes most of aggregate die area Cache

  6. Problem 2: Significant energy wasted is moving data through memory hierarchy 1000-2000 pJ 1-50 pJ Operation Data movement 20-40x

  7. Problem 3: General purpose processors are inefficient for data parallel applications Scalar Small vector (32 bytes) Inefficient More efficient Very large vector 1000x larger?

  8. Problem summary:Conventional systems process data inefficiently Problem 2 Energy Problem 3 Memory Memory Problem 1 Area CORE CORE

  9. Key Idea Memory = Storage +In-place compute

  10. Proposal:Repurpose memory logic for compute Energy Massive Parallelism (up to 100X) Memory Memory Memory Energy Efficiency (up to 20X) Area CORE CORE CORE

  11. Cache Compute Caches for Efficient Very Large Vector Processing PIs: Reetuparna Das, SatishNarayanasamy, David Blaauw Students: Shaizeen Aga, SupreetJeloka, Arun Subramanian

  12. Proposal: Compute Caches Memory disambiguation support Sub-array CORE0 CORE3 L1 L1 Bank L2 L2 In-place compute SRAM A B Interconnect Challenges: Data orchestration Managing parallelism Coherence and Consistency Cache controller extension L3-Slice3 L3-Slice0 = A op B

  13. Opportunity 1: Large vector computation Operation width = row size L3 Many smaller sub-arrays Each sub-array can compute in parallel L2 Parallelism available (16 MB L3) 512 sub-arrays * 64B 32KB Operand 128X saved L1

  14. Opportunity 2: Save data movement energy L3 Significant portion of cache energy is wire energy (60-80%) H-tree Save wire energy Save energy in moving data to higher cache levels L2 L1

  15. Compute Cache Architecture Memory disambiguation for large vector operations CORE0 CORE3 Cache controller extension L1 L1 More details in upcoming HPCA paper L2 L2 Interconnect L3-Slice3 L3-Slice0 In-place compute SRAM

  16. SRAM array operation Bitlines Read Operation BLn Address BLBn BLB0 BL0 Precharge 0 1 0 1 sub-array Wordlines Row Decoder SA SA Differential Sense Amplifiers 1 1

  17. In-place compute SRAM Bitlines Changes BLn BLBn BLB0 BL0 Row Decoder-O Wordlines Row Decoder Single-ended Sense Amplifiers Vref Vref SA SA SA SA SA SA Differential Sense Amplifiers

  18. In-place compute SRAM A AND B B BLn BLBn BLB0 BL0 A A 0 1 1 0 0 1 0 1 Row Decoder-O Row Decoder B Single-ended Sense Amplifiers Vref Vref SA SA SA SA 0 1 A AND B

  19. SRAM Prototype Test Chip

  20. Compute Cache ISA So Far

  21. Applications modeled using Compute Caches Text Processing StringMatch Wordcount In-memory Checkpointing FastBitBitMap Database Bit Matrix Multiplication

  22. Compute Cache Results Summary 1.9X 2.4X 4%

  23. Cache Compute Cache Summary Empower caches to compute Performance: Large vector parallelism Energy: Reduce on-chip data movement In-place compute SRAM Data placement and cache geometry for increased operand locality 8% area overhead 2.1X performance, 2.7X energy savings

  24. Future

  25. Compute Memory System Stack OS primitives FSA Machine Learning Data Analytics Crypto Image Processing Bioinformatics Graphs Redesign Applications Express computation using in-situ operations Data-flow languages Adapt PL/Compiler Java/C++ OpenCL Express parallelism RAPID [20] Google’s TensorFlow [14] ISA Data orchestration Coherence & Consistency Large SIMD Data-flow Design Architecture Manage parallelism Design Compute Memories Non-Volatile Cache Customize hierarchy Volatile Re-RAM STT-RAM MRAM Flash DRAM SRAM Where to compute? Rich operation set Parallel Automaton In-situ technique Bit-line Bit-line Locality Operation set Logical, Data migration, Comparison, Search Addition, Multiplication, Convolution, FSM

  26. In-Situ Compute Memory Systems Thank You! Reetuparna Das

More Related