410 likes | 428 Views
Energy-efficient Transactional Memory. Osman Sabri Ü nsal Barcelona Supercomputing Center Euro-TM Final Workshop, January 2015. Terminology. Power Dynamic P = E trans α f ck = α f ck CV 2 /2 Clock gating, DVFS Static (leakage) Difficult to model, increases with T, area
E N D
Energy-efficient Transactional Memory Osman SabriÜnsalBarcelona Supercomputing Center Euro-TM Final Workshop, January 2015
Terminology • Power • Dynamic • P= Etransα fck = α fck CV2/2 • Clock gating, DVFS • Static (leakage) • Difficult to model, increases with T, area • Power gating • Energy • Power over time • Metrics: Energy, Energy-delay, Energy-delay2
History • In thebeginningtherewas Maurice (and Tali and Iris) Energy Reduction in Multiprocessor Systems Using Transactional Memory TaliMoreshet, Iris Bahar, Maurice Herlihy ISLPED2005 • Others – manyfrom Euro-TM
This talk • In thebeginning, therewas Maurice Energy Reduction in Multiprocessor Systems Using Transactional Memory TaliMoreshet, Iris Bahar, Maurice Herlihy ISLPED2005 • Clock-gateonAbort ClockGateonAbort: TowardsEnergy-Efficient Hardware TransactionalMemory SutirthaSanyal, Sourav Roy, Adrián Cristal, OsmanUnsal,MateoValero IPDPS2009 • Belowsafe-Vddwith TM Combining Error Detection and TransactionalMemoryforEnergy-efficient Computing belowSafeOperationMargins Gulay Yalcin, Anita Sobe, AlexeyVoronin, Jons-TobiasWamhoff, DerinHarmanci, Adrián Cristal, OsmanUnsal, Pascal Felber, ChristofFetzer PDP2014
ISLPED2005 at a glance • Using a fully-associative transactional cache • Running a microbenchmark and comparing to locks • Simple energy-efficient heuristic; serialize on conflict
ISLPED2005 at a glance (cont.) • Power calculated through CACTI and MicronSDRAM power calculator
This talk • In thebeginning, therewas Maurice Energy Reduction in Multiprocessor Systems Using Transactional Memory TaliMoreshet, Iris Bahar, Maurice Herlihy ISLPED2005 • Clock-gateonAbort ClockGateonAbort: TowardsEnergy-Efficient Hardware TransactionalMemory SutirthaSanyal, Sourav Roy, Adrián Cristal, OsmanUnsal,MateoValero IPDPS2009 • Belowsafe-Vddwith TM Combining Error Detection and TransactionalMemoryforEnergy-efficient Computing belowSafeOperationMargins Gulay Yalcin, Anita Sobe, AlexeyVoronin, Jons-TobiasWamhoff, DerinHarmanci, Adrián Cristal, OsmanUnsal, Pascal Felber, ChristofFetzer PDP2014
Futile Aborts • Aborting more than “once” is waste of energy. • n Aborts -> n wasted executions. These are Futile Aborts. • Clock-Gate to halt processor receiving Abort.
Clock Gating • Well known energy saving technique.
Scalable-TCC • Assumes directory instead of shared bus. • Multiple Directories -> Parallel Commit • Sharers Abort
Contributions • A novel protocol on top of SC-TCC which dynamically gate and ungate processors to save energy. • A new contention management model to support frequent gating -> maximize energy savings.
Proposal • Keep the aborted processor clock-Gated if: • a) The aborter thread is still present in that directory. • AND • b) If the aborter thread is executing the same transaction which earlier killed the abortee transaction.
Changes in Directory Aborter Proc Id = The Processor doing the Commit. Aborter Tx Id = PC of the Tx causing this Abort. Abort Counter = The number of aborts suffered so far (in this directory). Renew Counter = Number of renewals of clock gating sessions. Gate Timer = Duration of Clock-Gating.
Protocol Operations • Assume a NUMA configuration such as shown above. • P2 and P3 can commit in parallel. • Assume, P0 is committing. Others will be invalidated.
Protocol Operations (cont..) • P1 Gated. • Timer Value Loaded. TxP0
Gating Period • Proposed a new exponential back-off to obtain energy saving along with a performance gain. • Idea is to gate frequently at low abort, and increase exponentially as Abort (and/or Renewal) goes up -> Exponential Double Stair-Case.
Reminder • Keep the aborted processor clock-Gated if: • a) The aborter thread is still present in that directory. • AND • b) If the aborter thread is executing the same transaction which earlier killed the abortee transaction.
Protocol Operations (cont..) • Expiration of Timer -> Processor may still be gated.
Renewal Of Gating In this case, i) Processor remained turned-off. ii) Renew Count incremented by 1. iii) New value for Gating period loaded.
Simulation and Results • Simulator – M5. • Benchmark – 3 applications from STAMP. Genome, Yada and Intruder. Average Speed-Up – 4%
Simulation and Results Average Energy-Savings – 19% Average Power-Savings – 13%
This talk • In thebeginning, therewas Maurice Energy Reduction in Multiprocessor Systems Using Transactional Memory TaliMoreshet, Iris Bahar, Maurice Herlihy ISLPED2005 • Clock-gateonAbort ClockGateonAbort: TowardsEnergy-Efficient Hardware TransactionalMemory SutirthaSanyal, Sourav Roy, Adrián Cristal, OsmanUnsal,MateoValero IPDPS2009 • Belowsafe-Vddwith TM Combining Error Detection and TransactionalMemoryforEnergy-efficient Computing belowSafeOperationMargins Gulay Yalcin, Anita Sobe, AlexeyVoronin, Jons-TobiasWamhoff, DerinHarmanci, Adrián Cristal, OsmanUnsal, Pascal Felber, ChristofFetzer PDP2014
Dark Silicon Phenomenon • Number of transistors can be increased. • In order to stay within a chip’s power budget, some must remain “dark”. • One solution: Downscale the voltage. • Go below safe voltage limit
How about Reliability? When the Vdd is reduced, the error rate increases exponentially [1]. Our goal is: Investigating the edge cases on voltage reduction while the error recovery still leads to a reduced energy consumption. [1] Dan Ernst et al. “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation.” In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, pages 7–18, 2003
Agenda / Overview • Motivation • Experiment: Scaling Vdd in a Real System • Basics of Reliability • Error Recovery with TM • Error Detection Schemes • Analysis • Conclusion
Reducing Vdd in a Real System • AMD FX-6100 • 6-core CPU • CPU-heavy execution • Every 10 seconds reduce Vdd by 12.5mV • Monitor • Incorrect Result • System Crash • Machine Check Architecture Errors are ininstruction cache (37%), execution unit (61%) and others (less than 2%). The system encounters errors which can not be corrected by MCA even only after 10% reduction in Vdd
Basics of Reliability Transactional Memory can provide a lightweight Coordinated Local Checkpoitning [2] [2] Gulay Yalcin et al. “FaulTM: Fault Tolerance Using Hardware Transactional Memory , DATE 2013
TM provides checkpointing/rollback Pn P4 Processor 1 P3 P2 Synchronize checkpoints Checkpoint (Log Area) Checkpoint (Log Area) Checkpoint (Log Area) Checkpoint (Log Area) Checkpoint (Log Area) Data-Versioning provides a synchronization mechanism between checkpoints. TM write-sets log the tentative memory updates.
Error Detection Schemes - Replication Execute instruction streams multiple times Compare the results of executions Less comparison with TM. Dual/Triple Modular Redundancy + High Error Detection Rate - High Energy Overhead
Error Detection Schemes-Assertions/Invariants • Assertions: Conditions referring to the current and previous state of the program. • Check the state • Adding manually or automatic • TM facilitates inserting invariants • Ex:
Error Detection Schemes - Symptoms • Monitor program executions to inspect if there is a symptom of hardware faults. • Symptoms: • Mispredictions in high confidence branches, • high OS activity, • fatal traps (e.g. undefined instruction code) • Reliability at a low cost
Error Detection Schemes- Encoded Processing • Apply software coding (ECC-like) techniques • The redundancy is added by applying arithmetic codes to the values. • With TM, the validation of a code word can be deferred until a TX commits. • Ex:
Analysis • Gem5 full system simulator • 1GHz in-order cores • 4 cores • X86 ISA • 64KB L1 data and instruction caches • Unified 2MB L2 cache • SPLASH2 benchmark suite.
Energy Analysis Error Detection Rate Vdd Fault Injection TX size Recovery Overhead E ≈ C x Vdd2 Error-free Overhead
Conclusion • The energy consumption of CPUs can be reduced if we have efficient hardware support for Transactional Memory and for Error Detection.
History (revisited) - Energy Reduction in Multiprocessor Systems Using Transactional Memory Moreshet et al., ISLPED2005 - ClockGateonAbort: TowardsEnergy-EfficientHardware TransactionalMemory Sanyal et al., IPDPS2009 - Energy-Performance Tradeoffs in Software Transactional Memory Balhassin et al., SBACPAD2012 - Dynamic Serialization: Improving Energy Consumption in Eager-Eager Hardware Transactional Memory Systems Gaona et al., PDP2012 - Energy Efficient GPU Transactional Memory via Space-Time Optimizations Fung et al., MICRO2013 - CombiningError Detection and TransactionalMemoryforEnergy-efficient Computing belowSafeOperationMargins Yalcin et al., PDP2014 - Performance and Energy Analysis of the Restricted Transactional Memory Implementation on Haswell Goel et al., IPDPS2014 - On the Energy and Performance of Commodity Hardware Transactional Memory Diegues et al., SIGMETRICS2014
Thanks to • Gulay Yalcin • SutirthaSanyal • Anita Sobe • AlexeyVoronin • Jons-TobiasWamhoff • DerinHarmanci • Adrian Cristal • ChristofFetzer • Pascal Felber