Project Proposal:

Project Proposal: • Topic: Compiler Directed Optimization for low power. by K2 and Anjan

Motivation • Phenomenal increase in processor speed • Shrinkage in size • Mobility highly desired • BUT battery technology not improving at the same rate • Growth of embedded devices. • Limiting growth of microprocessor.

Introduction • Reducing Energy Consumption is in focus. • Portable Devices • Servers, PCs • Embedded Systems • Clusters

Power in CMOS P = total power VDD = supply voltage f= clock frequency N = switching (gate transition per clock cycle) Ileak = leakage power Istatic = static power QSC = quantity of charge carried by short-circuit current per transistion

Switching Power • Accounts for most (90%) of power • Two major factor is supply voltage and frequency • Voltage scaling • Frequency scaling

Short Circuit Power • During switching, there is a short period of time when both gates are ON  a path from VDD to ground  power dissipation

Leakage Power • Diode leakage • Source (and drain) together with substrate forms a diode • At times, this diode can be reverse-biased during which current can leak • Sub-threshold leakage • Even when gate is not completely on, enough of a channel can form for some movement of charges from source to drain.

Software Estimation • SPICE simulation • Very slow • PowerMill from Synopsys • CAD Tools • Part of a lot of CAD tool chains, eg. Synopsys • Architectural based simulation • Eg: SimplePower, WATTCH etc. (Fast)

Dealing with it • System / OS • Compiler • Algorithms • Architecture • Circuit/Logic • Technology

Power Reduction Techniques • Hardware Techniques • Voltage scaling • Clock gating • Frequency Scaling • Memory => Depends on implementation of SRAM and DRAM • Libraries • System Power Management (disk etc.)

Power Reduction Techniques • Software Techniques 1. Algorithmic Level. 2. Compiler based Optimizations. • The second one is the focus of our current study.

Compilation for Low Power • Performance = Power most of the time. • Reducing Power conflicting with Performance. • General guidelines required for low power oriented compilation.

Guidelines for Low Power Compilation: • Shortest Instruction Sequence means low power. But, Switching activity depends on bit patterns of successive instructions. • Memory operands consumes more energy. • Replacing expensive operations with less expensive one. For example, i*2 with i<<1. • Int consumes less energy than char or short.

Guidelines: • Avoiding chain of pointers such as a->b->c. • Reducing address bus switching by keeping operands that are used by successive instructions in adjacent memory locations. • Voltage Scaling. • Register Assignment. Use grey code. • Dead Code Elimination, Redundant Computation Elimination are useful.

Proposals: 1. Multi Region Voltage Scaling Using Dynamic Programming. 2. Trace Scheduling for low power.

Proposal • Execution model defining a program and its properties • The optimization problem • The solution

Execution Model & Notation • Program P is a sequence of regions • Two operating frequencies • A is an execution description • R0 operates at f0 and so on.

Execution model … (contd.) • Local energy function • Global energy function • Note that this energy function is not same as natural energy! • Switching overhead – ES

Optimization problem • Given P, generate optimal A such that E(A,N) is minimized. • Brute force method is inefficient – exponential to N

Definition

Notes • We claim is optimal solution! • However, complexity is exponential to N • Common substructure => dynamic prog

Example

Dynamic Prog. to the rescue! • Note that every evaluation of Λ has two children. • Different number of calls to Λ can be N * 2 * 2 • Therefore using dynamic programming we can evaluate Λ(N,0,0) in O(N) steps

The Energy Function • Could be natural energy or energy-delay product • Could be execution time (this would be trivial) • Monotonically increasing function of i

Points • Simplification of a program • Can be extended to more than 2 freq • Why not impose performance constraint? • Complexity becomes O(T * N) • Unacceptable • Optimal under assumptions!

Trace Instruction Scheduling • Reordering instructions minimizing hamming distance can reduce switching activity in instruction bus. • Global Scheduling schemes can give better optimization. • Integrated trace scheduling and Lee and Lee’s algorithms.

Hamming Distance • Two adjacent instructions have lesser hamming distance fewer instruction bus lines recharge from 0 to 1, vice verse

Machine Architecture Our VLIW Experimental Testbed

Lee and Lee’s Algorithms: • Horizontal Scheduling • Permute micro-instructions within a given VLIW instruction. Use a max weight bipartite graph matching algorithm. • Vertical Scheduling • Reorder VLIW instructions’ sequence in a basic block. Use a heuristic based algorithm. ( NP Hard)

Limitations: • Two phase optimization for a multi objective problem. • Employed for local scheduling only.

Modifications. • Use Trace Scheduling Scheme. • Optimization for performance and energy is employed in a single phase.

Problem • Let X_i and X_(i+1) be two successive VLIW instructions. • X_(i)= (x_i1, x_i2 … ) • Horizontal Algorithm will output a new X_(i)’ = (y_i1, y_i2, …) where (y_i1,yi2, … ) is another permutation of (x_i1,xi2, ..).

Problem • Let X = X_(1), X_(2), … is a ordering of VLIW instructions. • Vertical algorithm will output a new ordering Y of VLIW sequences where Y = Y_(1), Y_(2), … and (Y_(1), Y_(2) .. ) is a permutation of (X_(1), X_(2), …).

Optimization • Use a maximum weight bipartite graph matching algorithm where each node is an instruction and the graph is a DAG of control dependency among the instructions. * Construction of edges need to take care of architectural constraints.

Optimization • Weight on the edge connecting two instruction x and y will be as following: -h(x,y) - d(x,y) where, h(x,y) =100* hamming distance (x,y) / 32 (for a 32 bit instruction) d(x,y) = 100*stall if x precedes y / maximum stall possible.

Total Cost • Given an Instruction Ordering X, Total cost =k+ ((Sum_H(X_(i),X_(i+1)) length(X) + Sum_D(X_(i),X_(i+1))) Where length(X) is the total number of bits in X. H: Hamming distance, D: Stall k: A constant parameter.

Algorithm • Generate a sequential program (Code Linearizer). • Analyze each basic block in the sequential program for independent operations (Trace Picker and Tail Duplicator).

Algorithm • Schedule independent operations within the same block in parallel if sufficient hardware resources are available. Use Lee and Lee's horizontal and vertical scheduling algorithms with our weighing scheme and objective function. Output overall cost. • Move operations between blocks when possible. • Repeat step 3 and 4 unless the cost is more than some constant.

Open Issues * Implementation of a power and performance directed compiler. * All optimizations may fail to a shorter code. * Notion of Power Complexity. * Source Code Level Specifications for low power computing. Often more useful than low power compilation.

Project Proposal: