1 / 10

Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K. Gupta †

Energy-Efficient GPGPU Architectures via Collaborative Compilation and Memristive Memory-Based Computing. Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K. Gupta † † UCSD , ‡ UCSB, EHTZ * , UNIBO *. Micrel.deis.unibo.it /MultiTherman. Variability.org.

otto
Download Presentation

Abbas Rahimi † , A. Ghofrani ‡ , M. A. Montano ‡ , K-T Cheng ‡ , L. Benini * , R. K. Gupta †

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Energy-Efficient GPGPU Architectures via Collaborative Compilation and Memristive Memory-Based Computing Abbas Rahimi†, A. Ghofrani‡, M. A. Montano‡, K-T Cheng‡, L. Benini*, R. K. Gupta† †UCSD, ‡UCSB, EHTZ*, UNIBO* Micrel.deis.unibo.it /MultiTherman Variability.org

  2. Energy-Efficient GPGPU ✓SIMD  × conservative guardbands loss of operational efficiency  Total delay: corner + 3σ stochastic delay guardband Kakoee et al, TCAS-II’12 Thousands of deep and wide pipelines make GPGPU high power consuming parts NT and VOS achieve energy efficiency at costs to • Performance loss • Increasing timing sensitivity in the presence of variations

  3. Variability is about Cost and Scale Eliminating guardband  Timing error  Bowman et al, JSSC’09 error rate × wider width Costly error recovery for SIMD  Wide lanes Recovery cycles increases linearly with pipeline length quadratically expensive Deep pipes

  4. Taxonomy of SIMD Variability-Tolerance Guardband Eliminating Adaptive Timing error No timing error Hierarchically focused guardbanding and uniform instruction assignment Error recovery Rahimi et al, DATE’13 Rahimi et al, DAC’13 Exact / approximate computing Exact computing Predict & prevent Independent recovery Memoization Recalling recent context of error-free execution (approximately / exactly) Lane decoupling through private queues Pawlowski et al, ISSCC’12 Krimer et al, ISCA’12 Rahimi et al, TCAS’13 Rahimi et al, DATE’14 Detect-then-correct

  5. Contributions Efficient spatiotemporal reuse of computation in GPGPUs by collaborative • Micro-architectural design • An associative memristive memory (AMM) module is integrated with FPUs − representing partial functionality • Compiler profiling • Fine-grained partitioning of values (searching space of possible inputs) • Pre- storing high-frequent sets of values in AMM modules Ensure their resiliency under voltage overscaling for Evergreen GPGPUs

  6. Collaborative compilation framework and memristive-based computing Training datasets OpenCL Kernel Profiler • Profiling Highly frequent computations one-off activity Customized clCreateBuffer to insert AMM contents 2) Code generation AMM contents Kernel programming lunching kernel FPU AMM 3) Runtime =?

  7. AMM with FPU Error  No Recovery  Returnpre-stored result Search Operands TCAM: a self-referenced sensing scheme†, 2-bit encoding, 15% positive slack at 45nm Memory block: avoids read disturbance Ternary content addressable memory (TCAM) Crossbar-based 1T-1R memristive memory block †Li et al, JSSC’14 AMM: Software programmable Mimics partial functionality of FPU Two pipelined stages

  8. OpenCL Sobel AMM Hit Rates Profiler +: {a, b} → {q} *: {a, b} → {q} √ : {a} → {q} … train offline test1 Programming before lunching kernel FPU+ AMM+ test2 FPU* AMM* FPU√ AMM√ … test3 runtime test4

  9. Efficiency under Voltage Overscaling 19% 30% 17% 28% 36% 33% 32% 33% 37% 39% Reduce timing errors from 38% to 24% 28% 29% At 1.0V, without any timing error, 36% average energy saving (7 kernels) • At 0.88V, on average 39% energy saving

  10. Conclusion Static compiler analysis and coordinated microarchitectural design that enable efficient reuse of computations in GPGPUs Emerging associativememristive modules are coupled with FPU for fast spatial and temporal reuse GPGPU Kernels exhibit a low entropy yielding an average energy saving of 36% on the 32-entry AMMs

More Related