280 likes | 496 Views
The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors. Motivation. Multiprocessor architectures sprouting everywhere large compute servers small servers, desktops chip multiprocessors High energy consumption a problem – more so in MPs
E N D
The Thrifty BarrierEnergy-Aware Synchronizationin Shared-Memory Multiprocessors
Motivation • Multiprocessor architectures sprouting everywhere • large compute servers • small servers, desktops • chip multiprocessors • High energy consumption a problem – more so in MPs • Most power-aware techniques tailored at uniprocessors • Multiprocessors present unique challenges • processor co-ordination, synchronization The Thrifty Barrier – Li, Martínez, and Huang
Case: Barrier Synchronization • Fast threads spin-wait for slower ones • Spin-wait wasteful by definition • quick reaction • but only last iteration useful compute spin-wait The Thrifty Barrier – Li, Martínez, and Huang
Proposal: Thrifty Barrier • Reduce spin-wait energy waste in barriers • leverage existing processor sleep states (e.g. ACPI) • Minimize impact on execution time • achieve timely wake-up conventional thrifty The Thrifty Barrier – Li, Martínez, and Huang
Challenges • Should sleep? • transition times (sleep + wake-up) non-negligible • What sleep state? • more energy savings → longer transition times • When to wake up? • early w.r.t. barrier release → may hurt energy savings • late w.r.t. barrier release → may hurt performance Must predict barrier stall time accurately The Thrifty Barrier – Li, Martínez, and Huang
Findings • Many barrier stall times large enough to leverage sleep states • Stall times predictable • discriminate through PC indexing • predict indirectly using barrier interval times • Timely wake-up: combination of two mechanisms • coherence message bounds wake-up latency • watchdog timer anticipates wake-up The Thrifty Barrier – Li, Martínez, and Huang
Thrifty Barrier Mechanism BARRIER ARRIVAL Stall time prediction SLEEP? No Wake-up signal S1 S2 S3 RESIDUAL SPIN BARRIER DEPARTURE The Thrifty Barrier – Li, Martínez, and Huang
Sleep Mechanism BARRIER ARRIVAL Stall time prediction SLEEP? No Wake-up signal S1 S2 S3 RESIDUAL SPIN BARRIER DEPARTURE The Thrifty Barrier – Li, Martínez, and Huang
Predicting Stall Time • Splash-2’s FMM example: 3 important barriers, 4 iterations • randomly picked thread (always the same) • PC indexing reduces variability • Interval time (BIT) more stable metric than stall time (BST) The Thrifty Barrier – Li, Martínez, and Huang
Stall Time vs. Interval Time • Barriers separate computation phases • PC indexing reduces variability • Barrier stall time (BST) varies considerably • even with PC indexing • barrier-, but also thread-dependent • computation shifts among threads across invocations • Barrier interval time (BIT) varies much less • quite stable if PC indexing used • barrier-, but not thread-dependent • last-value prediction ok for most applications The Thrifty Barrier – Li, Martínez, and Huang
Predicting Stall Time Indirectly • Can use BIT to predict BST indirectly • compute time measurable upon arrival to barrier • subtract from predicted BIT to derive predicted BST • How to manage time info? Computet BSTt BIT The Thrifty Barrier – Li, Martínez, and Huang
Managing Time Info • Threads depart from barrier instance b-1 toward instance b • Each thread t has local record of release timestamp BRTSt,b-1 • Assumptions: • no global clock • local wallclock active even if CPU sleeps • all CPUs same nominal clock frequency b-1 b BRTSt,b-1 The Thrifty Barrier – Li, Martínez, and Huang
Managing Time Info • Thread t arrives, knowing BRTSt,b-1, Computet,b • make prediction pBITb • derive pBSTt,b = pBITb – Computet,b • use pBSTt,b to pick sleep state (if warranted) • best fit based on transition time b-1 b Computet,b pBSTt,b pBITb BRTSt,b-1 The Thrifty Barrier – Li, Martínez, and Huang
Managing Time Info • Last thread u arrives, knowing BRTSu,b-1 • derive actualBITb = time( ) – BRTSu,b-1 • update (shared) predictor with BITb • release barrier b-1 b BITb BRTSu,b-1 The Thrifty Barrier – Li, Martínez, and Huang
Managing Time Info • Every thread t (possibly after waking up late) • read BITb from updated predictor • compute actualBRTSt,b = BRTSt,b-1 + BITb • Threads never use timestamps (BRTS) from other threads • no global clock is needed b-1 b BITb BRTSt,b-1 BRTSt,b * The Thrifty Barrier – Li, Martínez, and Huang
Thrifty Barrier Mechanism BARRIER ARRIVAL Stall time prediction SLEEP? No Wake-up signal S1 S2 S3 RESIDUAL SPIN BARRIER DEPARTURE The Thrifty Barrier – Li, Martínez, and Huang
Wake-up Mechanism BARRIER ARRIVAL Stall time prediction SLEEP? No Wake-up signal S1 S2 S3 RESIDUAL SPIN BARRIER DEPARTURE The Thrifty Barrier – Li, Martínez, and Huang
Wake-up Mechanism • Communicate barrier completion to sleeping CPUs • signal sent to CPU pin • options: external vs. internal wake-up • External (passive): initiated by processor that releases barrier • leverage coherence protocol – invalidation to spinlock • must supply spinlock address to cache controller • Internal (active): triggered by watchdog timer • program with predicted BST before going to sleep The Thrifty Barrier – Li, Martínez, and Huang
Early vs. Late Wake-up • Early wake-up (underprediction) • energy waste – residual spin • Late wake-up (overprediction) • possible impact on execution time • External wake-up guarantees late wake-up (but bounded) • Internal wake-up can lead to both (late not bounded) • Our approach: hybrid wake-up • external provides upper bound • internal strives for timely wake-up using prediction The Thrifty Barrier – Li, Martínez, and Huang
Other Considerations (see paper) • Sleep states that do not snoop for coherence requests • flush dirty data before sleeping • defer invalidations to clean data • Overprediction threshold • case of frequent, swinging BITs of modest size • turn off prediction if overpredict beyond threshold • Interaction with context switching and I/O • underprediction threshold • Time sharing issues: multiprogramming, overthreading The Thrifty Barrier – Li, Martínez, and Huang
Experimental Setup • Simultated system: 64-node CC-NUMA • 6-way dynamic superscalar • L1 16KB 64B 2-way 2clk; L2 64KB 64B 8-way 12clk • 16B/4clk memory bus, 60ns SDRAM • hypercube, wormhole, 4clk pipelined routers • 16clk pin to pin • Energy modeling: Wattch (CPU + L1 + L2) • sleep states along lines of Pentium family The Thrifty Barrier – Li, Martínez, and Huang
Experimental Setup • All Splash-2 applications except: • Raytrace – no barriers • LU – better version w/o barriers widely available • Efficiency (64p) 40-82%, avg. 58% Target Group ≥ 10% The Thrifty Barrier – Li, Martínez, and Huang
Energy Savings The Thrifty Barrier – Li, Martínez, and Huang
Performance Impact The Thrifty Barrier – Li, Martínez, and Huang
Related Work Highlights • Quite a bit of work in uniprocessor domain • Elnozahy et al. • server farms, clusters • thirfty barrier targets shared memory, parallel apps. • Moshovos et al., Saldanha and Lipasti • energy-aware cache coherence • prob. compatible with and complementary to thrifty barrier The Thrifty Barrier – Li, Martínez, and Huang
Conclusions • Energy-aware MP mechanisms can and should be pursued • Case of energy-aware barrier synchronization • simple indirect prediction of barrier stall time • hybrid wake-up scheme to minimize impact on exec. time • Encouraging results; target applications • 17% avg. energy savings • 2% avg. performance impact The Thrifty Barrier – Li, Martínez, and Huang
The Thrifty BarrierEnergy-Aware Synchronizationin Shared-Memory Multiprocessors