1k likes | 1.18k Views
Introduction. Minimizing energy consumption is crucial for computing systems Battery operated systems Data centers Wide variety of techniques have been proposed Static (Offline) optimizations Compiler optimizations Accelerators Dynamic (Online) optimizations OS scheduling
E N D
Introduction • Minimizing energy consumption is crucial for computing systems • Battery operated systems • Data centers • Wide variety of techniques have been proposed • Static (Offline) optimizations • Compiler optimizations • Accelerators • Dynamic (Online) optimizations • OS scheduling • C-state management in Intel processors
Introduction • Static techniques • Can afford to take a global view of the problem • More complex algorithms can be used • Dynamic techniques • Fast – low overhead • Have more information about the current state of the system • Hybrid techniques
Our contributions • Static + dynamic optimizations for energy efficiency • Exploiting workload variation in DVFS capable systems • Assuring application-level correctness for programs • Fine-grained accelerator integration with processors
Energy efficient multiprocessor task scheduling under input-dependent variation
Outline • Introduction and motivation • Related work • Problem formulation • Proposed algorithm • Experimental results
Introduction and Motivation • Embedded systems are typically required to meet a fixed performance target • Example – frame rate for decoding of streaming video • A system that has better performance provides no significant benefit • Dynamic Voltage + Frequency Scaling (DVFS) is an effective technique for reducing dynamic energy consumption of processors • Quadratic dependence of energy on voltage (almost) • Linear dependence of performance (frequency) on voltage (almost)
Introduction • DVFS problem: Given a task graph G: • Edges - precedence constraints, • Latency constraint L • Determine the schedule and voltage assignment for each task to minimize energy consumption • Traditional techniques consider worst-case computation time of every task • Ensures that the latency constraint is satisfied. • Real-world applications exhibit significant variation in execution times.
Example: Huffman Encoder in JPEG • Probability distribution of the execution time of the Huffman encoder • Shows significant variation in execution time • Variation in execution time can be exploited to further minimize energy consumption Probability # cycles
Example – Energy Consumption in Worst-case and Typical Case • Using equations based on CMOS for modeling relation between energy, frequency and voltage • Workload - # cycles that a task takes to complete • Input dependent • 4 processor system - Latency constraint of 300 time units • Worst-case scheduling – 1 time unit for clock period of each task • Energy consumption 400*C • Typical case – 75 cycles per task • Energy consumption 168.75*C • Potentially 58% reduction in energy!
Outline • Introduction and motivation • Related work • Problem formulation • Proposed algorithm • Experimental results
Related Works • Single processor systems • List scheduling based heuristics – Gruian 2003. • Minimizing expected energy consumption by exhaustive search – Leung 2005, Xu 2005, Xu 2007 • Convex optimization – Andrei DATE 2005 • Multiprocessor systems • Dynamic slack reclamation - Zhu 2001, Chen 2004 • Partitioning for expected energy minimization - Xian 2007 • Schedule table based • For conditional task graphs – Shin 2003, Wu 2003 • Restricted to conditional task graphs • Convex optimization – Andrei DATE 2005 • Exponential enumeration if applied to multi-processor systems • Dynamic programming – Qiu DATE 2007 • Exponential enumeration
Exploiting Variation • Schedule table • Provides a list of scenarios and how to scale voltage/frequency when a particular scenario becomes active • How to build schedule table? • Enumerate all possible scenarios and optimize separately • Enumerate all possible combinations of number of cycles consumed by tasks • Number of scenarios explodes very quickly! • For a 10 node task graph with 4 possible execution times for each task, the number of scenarios is 410 • Our contribution – method to build schedule table efficiently without exponential enumeration • Optimal for task chains
Processor and Application Model • Processor model • Homogeneous multiprocessor system • Voltage of each processor can be tuned independently in the range [Vlower, Vupper] • Use quadratic approximation to model relation between energy and frequency • Application model • Task graph G with nodes representing tasks • Edges represent precedence constraints • Mapping of tasks to processors assumed to be given • If not, use a priority based mapping heuristic
Idea - Task Chains • What would an (imaginary) Oracle do? • For tasks 4 and 5, the voltage to use is not dependent on individual cycles consumed by tasks 1, 2 and 3 • Depends only on the total number of cycles consumed by sub-chain • Task 4 will start at the same time for a given value of sub-chain length • No need to enumerate #cycles for individual tasks 70 1 90 100 2 60 120 3 140 Total = 290 4 Total = 290 5
Exploiting Variation – Schedule Table • W(v) • Number of cycles for v to execute • Different from execution time (which can vary with voltage) • Cycles elapsed – CE(v) • Number of cycles elapsed when a task v is ready to start • Schedule Table • One row for each task • Each entry in a row is a tuple of the form <ce, cp> • cp is the clock period of task v when the value of CE(v) is ce • Constructed statically (offline) • At run-time, a table look-up is performed to determine the clock period to use for a particular task • Goal: Construct a schedule table such that the average energy consumption of the system for the given task graph is minimized.
Example – Schedule Table • Latency constraint of 650 time units Start(v1)=0, cp(v1) = 2 Finish(v1)=150 W(v1)=75, CE(v1) = 0 cp(v) – clock period for task v W(v) – #cycles for task v CE(v) – cycles elapsed when v is ready Start(v3)=150, cp(v3) = 3 Finish(v2)=450 W(v3)=100, CE(v3) = 75 Start(v2)=150, cp(v2) = 3 Finish(v2)=450 W(v2)=100, CE(v2) =75 Start(v4)=450, cp(v4) = 2 Finish(v4)=600 W(v4)=75, CE(v4) = 175 Vector <ce, cp>
Constructing the Schedule Table • Based on J. Cong, W. Jiang and Z. Zhang ASP-DAC’07 formulation • Time budgeting for operations to minimize energy consumption in high level synthesis • Latency constraint • Variable definitions • bi is the latency of task i • siis the start time of task i • cp(i) is the clock period to use while running task I • Convex optimization with linear constraints • Does not consider variation in latency of individual operations
Constructing the Schedule Table • Idea: Instead of maintaining a single start and finish time associated with every task, maintain a list of start and finish times • One start time and clock period for distinct values of CE(v) – sv,j, cpv,j • One finish for distinct values of CE(v) + W(v) – fv,j • CE(v) helps decide the precedence constraints between the finish times of a task and the start times of its successors • Precedence constraints only between finish time variables and start time variables associated with permitted combinations of workload and CE(v) • Avoids enumeration of all possible workloads
Constructing the Schedule Table v1 f2 = Finish time of v1 when v1 takes 100 cycles f1 = Finish time of v1 when v1 takes 75 cycles • Each task maintains a list of start and finish times • Each start time (and finish time) is associated with the number of cycles elapsed. • Constraints imposed only on valid combinations of start and finish times. s1 = Start time of v2 when v1 takes 75 cycles s2 = Start time of v2 when v1 takes 100 cycles v2 Precedence constraint s1 ≥ f1 s2 ≥ f2 v3 No constraint needed between f2 and s1 ! v4 v2 can start earlier if v1 takes 75 cycles!
Constructing the Schedule Table • Determine the valid combinations of CE(v) for every pair of tasks connected by an edge • where sv,j is the start time of task v when CE(v) is cev,j and fu,m is the finish time of task u when CE(u) is ceu,k and W(u) is wu,l and cev,j ≥ ceu,k + wu,l(valid combination) • Objective function: Average energy consumption
Determining the Values of CE(v) • To keep the problem size from exploding, we keep a constant number of values (K) of CE(v) at each task • Profiling to determine the probability distribution of workload of a task v and CE(v) • Heuristics to determine values of CE(v) to use at each node • Divide the range of CE(v) into K equal parts • Divide the area under the probability v/s CE(v) graph into K equal regions K=5 Probability # cycles
Complexity • No more than K values of CE(v) per node • Number of constraints • Upto K2 precedence constraints per edge • Upto K2 latency constraints per task • O(K2(m+n)) linear constraints • Number of variables • Upto K start, clock time and finish variables per task • O(Kn) variables • Corresponds to the size of the table to be stored • Convex objective function • Solved in polynomial time
Results – Random Task Graphs • Random task graphs generated by TGFF • Compared to • Greedy, dynamic slack reclamation algorithm • Oracle which can correctly predict workload for each task (before execution) • 15% worse (on average) than Oracle • 20% better than dynamic slack reclamation technique
Real-world Applications • Experimentation methodology • SESC+Wattch for energy of processor cores – 90nm • CACTI for caches • Energy values for ALU, decoder etc obtained by scaling to 180nm values provided by Wattch • CACTI provides energy values for 90nm for SRAM based array structure in CPU • FIFOs for communication between processors • Similar to Fast Simplex Links provided by Xilinx • Processors modeled similar to Intel XScale • 7 voltage levels with speeds varying from 100MHz to 800MHz
MJPEG Encoder-Variation • Only the Huffman encoder module shows variation • Unpredictable variation
MJPEG Encoder - Results • Only 4% energy savings (because variation is low) • 15% energy savings when workload can be predicted
Results – MPEG-4 Decoder • Main components • Parser (P), Copy-Controller (CC), Inverse-DCT (IDCT), Motion Compensation (MC) and Texture Update (TU) • IDCT shows no variation • Upto 6 MC and 6 IDCT per macroblock • Task graph unrolled • Performance constraint of 20 frames/s
MPEG-4 Variation • CC and MC show nice variation
MPEG-4 Decoder - Results • Comparison with dynamic slack reclamation algorithm • Upto 20% savings in energy over dynamic slack reclamation • We measure the effect of the number of values in the schedule table
Summary • Exploiting variation in execution time provides significant opportunity for energy minimization • Schedule table based approach • Construction of schedule table in polynomial time • Formulated as convex optimization problem with polynomial number of linear constraints • Optimal for certain special graphs – chains and trees • Average of 20% improvement over dynamic slack reclamation algorithm • Only 15% away from Oracle method • 20% energy saving for MPEG-4 decoder compared to dynamic slack reclamation algorithm
Motivation • Soft errors – issue for correct operation of CMOS circuits • Problem becomes more severe – ITRS 2009 • Smaller device sizes • Low supply voltages • Effect of soft errors on circuits • Karnik 2004, Nguyen 2003 • Effect of soft errors on software and processors • Li et al 2005, Wang et al 2004
Motivation • Traditional notion of correctness • Every last bit of every variable in a program should be correct • Referred to as numerical correctness • Application-level correctness • Several applications can tolerate a degree of error • Image viewer, video decoding etc • However, there exist critical instructions even in such applications • Example: state machine in video decoder
Motivation • Goal: Detect all “critical” instructions in the program • Protect “critical” instructions in the program against soft errors • Using duplication
Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results
Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results
Defining critical instructions • Elastic outputs – program outputs which can tolerate a certain amount of error • Media applications – image, video etc • Heuristics – Support vector machine • Characterizing quality of elastic outputs – Fidelity metric • Example: PSNR (peak signal to noise ratio) for JPEG, bit error rate,
Defining critical instructions • Given application A: • I is the input to the application • A set of outputs Oc - numerical correctness required • A set of elastic outputs O • Fidelity metric F(I,O) for elastic outputs • T – threshold for acceptable output • An execution of A is said to satisfy application-level correctness if: • All outputs εOc are numerically correct • F(I,O) ≥ T for elastic outputs • Nmin – the minimum number of elements of O that need to erroneous for F(I,O) to fall below T
Example: JPEG decoder • PSNR of 35dB is assumed to be good quality • MSE = 20.56 • Using 8-bit pixel values (MAX=255), • Max error = 255 • For a 1024x768 pixel image, Nmin ~ 251
Defining critical instructions • An instruction X is said to be critical if • X affects one of the outputs of Oc (numerical correctness required) OR • X affects Nmin elastic output elements O
Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results
Program representation • LLVM compiler infrastructure • LLVM intermediate representation • Weighted program dependence graph (PDG) – G
Example LLVM IR – 3 address code
Example PDG - based on LLVM IR
Example Node for computing X
Example Node for computing X Node (out_i) to compute C[Z]+X Node (so) to store C[Z]+X into array output
Example Node for computing X Node (so) to write to output array Node (so) to store C[Z]+X into array output Edge to represent dependence between X and out_i Edge to represent dependence between out_i and so
Assigning edge weights • Edge weight u→v - how many instances of node v are affected by 1 instance of u? • Example: • X outside the loop, out_i inside the loop • Edge weight N • Nodes out_i and so are in the same basic block – • Edge weight 1
Outline • Motivation • Definition of critical instructions • Program representation • Static analysis to detect critical instructions • Profiling and runtime monitoring • Results
Static analysis for detecting critical instructions • Find how many instances of output O are affected by node x • propagate(x →v) is the number of instances of v that are affected by an instance of x