410 likes | 539 Views
D ynamic Frequency- V oltage S caling for M ultiple C lock D omain Processors. and Implications on Asymmetric M ultiple C ore P rocessors. Avshalom Elyada. Based primarily on the work of Greg Semeraro, David H.Albonesi et. al. University of Rochester, NY. And also
E N D
Dynamic Frequency-Voltage ScalingforMultiple Clock Domain Processors and Implications on Asymmetric Multiple Core Processors Avshalom Elyada
Based primarily on the work of Greg Semeraro, David H.Albonesi et. al. University of Rochester, NY. And also Diana Marculescu et. al. Carnegie Mellon University, PA. DVS, Avshalom Elyada, EE Faculty, Technion
Outline • Multiple Clock Domains • Inter-domain communication and synchronization • Dynamic Frequency-Voltage Scaling • Scaling algorithms • Offline, Attack-Decay, Dynamic Profiling • Results comparison • DVS in Multiple Core Processors...? DVS, Avshalom Elyada, EE Faculty, Technion
End of the Road forGlobally-Synchronous • Global hi-freq clock does not scale well • Low clock reachability within a single clock cycle • Interconnect does not scale well • Clock-tree complexity, skew, power-inefficiency DVS, Avshalom Elyada, EE Faculty, Technion
Multiple-Clock-Domainsor Globally Asynchronous Locally Synchronous • Divide core into separate clock domains • Synchronize communication between synchronous “islands” • Speedup freq of separate smaller domains • Good inter-domain communication design • To minimize synchronization performance costs • Retain traditional synchronous knowledge-base DVS, Avshalom Elyada, EE Faculty, Technion
MCD Processor(Alpha 21264–like Model, Rochester DVS research) DVS, Avshalom Elyada, EE Faculty, Technion
Multi-Synchronous GloballySynchronous Single clock Multi- Synchronous Each domainseparate clockat same frequency MCD DVS, Avshalom Elyada, EE Faculty, Technion
Dynamic Frequency-Voltage Scaling • If all domains always run at max freq, this is usually a waste of power • Only critical domain need run at max freq, others can run slower • This saves power • Performance degradation should be minimal DVS, Avshalom Elyada, EE Faculty, Technion
MCD and GALS GloballySynchronous Single clock Multi-Synchronous Each domainseparate clockat same frequency Globally Async Locally Sync Async domains: Different frequencyper domain MCD DVS, Avshalom Elyada, EE Faculty, Technion
Integer Dominated DVS, Avshalom Elyada, EE Faculty, Technion
Load-Store Dominated DVS, Avshalom Elyada, EE Faculty, Technion
D(F)VS Continued • 20-40% Energy-Delay improvement • Voltage scales down with freq, saving additional power: • Potential for X3 savings • Careful : wrong scaling is catastrophic on performance DVS, Avshalom Elyada, EE Faculty, Technion
Scaling is Gradual and Occurs During Regular Operation • F may be decreased before V decreased • V must be increased before F may increase Freq (MHz) F-V working points 729.6 727.3 Voltage 1.000V 1.172V DVS, Avshalom Elyada, EE Faculty, Technion
MCD and GALS GloballySynchronous Single clock Multi-Synchronous Each domainseparate clockat same frequency DVS (C-GALS) Different frequency per domain Centrally controlled GALS Async domains: Different frequencyper domain Autonomous MCD DVS, Avshalom Elyada, EE Faculty, Technion
Configuration Parameters (XScale-like) • 320 Frequency-Voltage working-points • Freq range 250-1000 MHz • Voltage range 0.65-1.20 V • Step between work-points: 0.172 mV / 2.34 MHz • Change rate: 0.172 uSec / Step (55uSec end-to-end) • Time step: change each 50K cycles DVS, Avshalom Elyada, EE Faculty, Technion
DVS per domain - Considerations • Scaling algorithm: • Determine F-V point of each domain at any time • Temporal granularity • how often to change the F-V point • Synchronization • Multi-Sync - all domains run @ same freq • Simple sync solutions exist (phase compensation) • When GALS – different and changing frequencies • Asynchronous sync. solution, impedes performance • Or think of better solutions… DVS, Avshalom Elyada, EE Faculty, Technion
Power-bounded DVS • Given power envelope • Mobilize energy between domains to attain max performance DVS, Avshalom Elyada, EE Faculty, Technion
Scaling Algorithm • Input : A serial program • Output: Parallel, temporal specification of which domains slowed by how much • Temporal Granularity • Time-step should be short enough to be dynamic • Too short ineffective due to: • Gradual scaling • Overhead of the change DVS, Avshalom Elyada, EE Faculty, Technion
Scaling Algorithms • ‘Offline’ Algorithm • Full preparation on a simulator • Insert F-V config instructions for actual run • ‘Online’ (Attack-Decay) • Done entirely in hardware • Rescale F-V acc. to internal queue levels • Dynamic Profiling • Short profile run, find program phases • Rescale F-V on phase transitions DVS, Avshalom Elyada, EE Faculty, Technion
Offline Algorithm • Run the program on a simulator at max speed, trace Primitive Events • Primitive event = work performed in single domain on behalf of single instruction • Construct Directed Acyclic Graph • functional and data dependencies between primitive events • Arcs represent time between events DVS, Avshalom Elyada, EE Faculty, Technion
Offline Algorithm Contd • Slack appears on non-critical paths • Stretch events that are not in critical time path Stretch Slack DVS, Avshalom Elyada, EE Faculty, Technion
Offline Algorithm Contd. • Now we have desired scale-down of single primitive events • Need to scale down domains per time-step • Construct Event Histograms per domain per time-step: H(domain, time-step) • Assign tolerable performance degradation %p • Determine actual scale-down per-domain according to (H, p) DVS, Avshalom Elyada, EE Faculty, Technion
OnlineAlgorithm queue full • Each time step, sample input queue levels • Attack: if queue level up by ~2%, inc freq by 6% • Decay: if level unchanged, dec freq ~0.2% • Simple, HW only, results ~70% of offline • Watch out for perturbations, local-minima, over-activism & other feedback-related pitfalls freq DVS, Avshalom Elyada, EE Faculty, Technion
Dynamic Profiling • Execution shows repeating ProgramPhases • Phase often delimited by subroutine call or loop • Dynamic Profiling: • Identify phases by a short profiling run • Insert phase marks and FV config into program • When program reaches a mark, reconfig FV DVS, Avshalom Elyada, EE Faculty, Technion
Results Comparison DVS, Avshalom Elyada, EE Faculty, Technion
Improved Dynamic Profiling • Each program will carry its phase-information as initial setup data • Assuming phase info not processor-specific • alternatively, processor-specific compilation • Or, processor itself will perform the profile run • HW based dynamic profiling,eliminating the need forsimulation pre-run DVS, Avshalom Elyada, EE Faculty, Technion
DVS in ACCMP • Conceptual Difference: • MCD Processor: sub-units run @ diff. freq. • MCP: Threads run @ diff. freq. • ACCMP - different size cores • ACCMP with DVS - Cores also dynamically change frequency DVS, Avshalom Elyada, EE Faculty, Technion
M L S DVS - Degree of Freedom • ACCMP • Allocate thread to static strength processor: S M L performance • ACCMP with DVS • Scale processor to performance needs • Dynamically accommodate Stretch-fit 40-50 36-44 40-50 32-38 36-44 32-38 DVS, Avshalom Elyada, EE Faculty, Technion
Dynamic Thread Allocation • 3 sizes DVS processors DVS, Avshalom Elyada, EE Faculty, Technion
Dynamic Thread Allocation • 3 sizes DVS processors • Thread “wants” performance between M & L processors DVS, Avshalom Elyada, EE Faculty, Technion
Dynamic Thread Allocation • 3 sizes DVS processors • Thread “wants” performance between M & L processors • Allocate to M only, hurt performance, but still better than static ACCMP DVS, Avshalom Elyada, EE Faculty, Technion
Dynamic Thread Allocation • 3 sizes DVS processors • Thread “wants” performance between M & L processors • Allocate to M only, hurt performance, but still better than static ACCMP • To L only, waste power DVS, Avshalom Elyada, EE Faculty, Technion
Dynamic Thread Allocation • 3 sizes DVS processors • Thread “wants” performance between M & L processors • Allocate to M only, hurt performance, but still better than static ACCMP • To L only, waste power • Or migrate between both, acc. to performance needs • What is best? DVS, Avshalom Elyada, EE Faculty, Technion
Migration • k Migrations M↔L processors • Phases φM, φL on each of the processors DVS, Avshalom Elyada, EE Faculty, Technion
The End DVS, Avshalom Elyada, EE Faculty, Technion
DVS in Multiple Core Processors • Asymmetric Cores • Asymmetric size cores suggested to better utilize die area when too few threads • But research shows symmetric cores perform better when have enough threads • With DVS, a core’s performance dynamically varies acc. to freq. • Viewed in a Performance/Energy metric, this is a more flexible kind of asymmetry … • Also Simplify SW decision of which thread to assign to which asymmetric core DVS, Avshalom Elyada, EE Faculty, Technion
Inter-Domain Communication • In order to minimize synchronization penalty • divide area into domains where there inherently exists a dual-port queue structure • Dual-port FIFO synchronization solution • Otherwise divide where minimum inter-domain communication Producer Domain Consumer Domain Dual-PortFIFOsynchronizer wclk rclk wen ren wdata rdata full empty DVS, Avshalom Elyada, EE Faculty, Technion
Dual-Port FIFO • Producer/Consumer domains can write/read independently as long as FIFO is not full or empty • Full & Empty are the only signals that need syncing • Therefore sync penalty incurred only when FIFO is full or empty DVS, Avshalom Elyada, EE Faculty, Technion
Syncing Periodic Domains • Synchronization solutions which exploit no knowledge of clock relations are sub-optimal • Examples: two-flop and even dual-port FIFO • DVS: clock relations are Periodic, Dynamic, and Known • Predictive Synchronizer can predict when conflict will occur between different periodic clocks • But conflict prediction sometimes adapts slowly to freq changes • DVS makes possible to exploit the fact that domain frequencies are Known • Propose a multi-freq. sync. that can detect conflict by knowing at which freq. it’s provider and consumer run DVS, Avshalom Elyada, EE Faculty, Technion
Gradual Scaling • Device works throughout the change • Necessary for 2 reasons • Online algorithm based on steadily changing feedback control • ? Synchronizers can’t cope with step-change • Using Dynamic Profiling + adequate synchronizers, can do instant scaling DVS, Avshalom Elyada, EE Faculty, Technion