640 likes | 809 Views
Hardware/Software Codesign of Embedded Systems. Power/Voltage Management. Voicu Groza School of Information Technology and Engineering Groza@SITE.uOttawa.ca. Embedded Systems. Power/Energy Aware Embedded Systems Dynamic Voltage Scheduling Dynamic Power Management.
E N D
Hardware/Software Codesign of Embedded Systems Power/Voltage Management Voicu Groza School of Information Technology and Engineering Groza@SITE.uOttawa.ca
Embedded Systems • Power/Energy Aware Embedded Systems • Dynamic Voltage Scheduling • Dynamic Power Management http://www.phys.ncku.edu.tw/~htsu/humor/fry_egg.html Surpassed hot (kitchen) plate …? Why not use it?
Why worry about energy and power? Processing units • Need for efficiency (power + energy): „Power is considered as the most important constraint in embedded systems“[in: L. Eggermont (ed): Embedded Systems Roadmap 2002, STW] Current smart phones can hardly be operated for more than an hour, if data is being transmitted.[from a report of the Financial Times, Germany, on an analysis by Credit Suisse First Boston; http://www.ftd.de/tm/tk/9580232.html?nv=se]
poor design techniques The energy/flexibility conflict- Intrinsic Power Efficiency - Operations/Watt[MOPS/mW] Ambient Intelligence 10 DSP-ASIPs hardwired muxed ASIC 1 Processors µPs Reconfigurable Computing 0.1 0.01 Technology 1.0µ 0.5µ 0.25µ 0.13µ 0.07µ Necessary to optimize HW/SW; otherwise the prize for software flexibility cannot be paid! [H. de Man, Keynote, DATE‘02;T. Claasen, ISSCC99]
E' Power and energy are related to each other P E t In many cases, faster execution also means less energy, but the opposite may be true if power has to be increased to allow faster execution.
Low Power vs. Low Energy Consumption • Minimizing the power consumption is important for • the design of the power supply • the design of voltage regulators • the dimensioning of interconnect • short term cooling • Minimizing the energy consumption is important due to • restricted availability of energy (mobile systems) • limited battery capacities (only slowly improving) • very high costs of energy (solar panels, in space) • cooling • high costs • limited space • dependability • long lifetimes, low temperatures
Application Specific Circuits (ASICS)or Full Custom Circuits • Custom-designed circuits necessary • if ultimate speed or • energy efficiency is the goal and • large numbers can be sold. • Approach suffers from • long design times, • lack of flexibility(changing standards) and • high costs(e.g. Mill. $ mask costs).
Mask cost for specialized HWbecomes very expensive Trend towards implementation in Software [http://www.molecularimprints.com/Technology/tech_articles/MII_COO_NIST_2001.PDF9]
Fundamentals of dynamic voltage scaling (DVS) Power consumption of CMOScircuits (ignoring leakage): Delay for CMOS circuits: Decreasing Vdd reduces P quadratically,while the run-time of algorithms is only linearly increased(ignoring the effects of the memory system).
Potential for Energy Optimization Saving Energy under given Time Constraints: • Reduce the supply voltage Vdd • Reduce switching activity α • Reduce the load capacitance CL • Reduce the number of cycles #Cycles
Processors At the chip level, embedded chips include micro-controllers and microprocessors. Micro-controllers are the true workhorses of the embedded family. They are the original ’embedded chips’ and include those first employed as controllers in elevators and thermostats [Ryan, 1995].
Voltage Scaling and Power ManagementDynamic Voltage Scaling Energy / Cycle [nJ] Vdd
Prescott: 90 W/cm², 90 nm [c‘t 4/2004] Power density continues to get worse Nuclear reactor
Mobile PC Average System Power Mobile PC Thermal Design (TDP) System Power Other 600/500 MHz uP 600/500 MHz uP Other 13% 37% 13% 13% Power Supply Power Supply 10% 10% Memory+Graphics LCD 10" 12% Memory+Graphics 30% 15% HDD LCD 10" 9% HDD 19% 19% Note: Based on Actual Measurements Multiple Platform Components Comprise Average Power CPU Dominates Thermal Design Power Need to consider CPU & System Power [Courtesy: N. Dutt; Source: V. Tiwari]
New ideas can actually reduceenergy consumption Pentium Crusoe Running the same multimedia application. As published by Transmeta [www.transmeta.com]
Dynamic power management (DPM) Example: STRONGARM SA1100 • RUN: operational • IDLE: a sw routine may stop the CPU when not in use, while monitoring interrupts • SLEEP: Shutdown of on-chip activity 400mW RUN 90µs Power fault signal 10µs 160ms 10µs 90µs IDLE SLEEP Power fault signal 50mW 160µW
Variable-voltage/frequency example: INTEL Xscale OS should schedule distribution of the energy budget. From Intel’s Web Site
Key requirement #2: Code-size efficiency • CISC machines: RISC machines designed for run-time-,not for code-size-efficiency • Compression techniques: key idea
16-bit Thumb instr.ADD Rd #constant 001 10 Rd Constant majoropcode source=destination minoropcode zero extended 1110 001 01001 0 Rd 0 Rd 0000 Constant • Reduction to 65-70 % of original code size • 130% of ARM performance with 8/16 bit memory • 85% of ARM performance with 32-bit memory [ARM, R. Gupta] Code-size efficiency • Compression techniques (continued): • 2nd instruction set, e.g. ARM Thumb instruction set: Dynamically decoded at run-time Same approach for LSI TinyRisc, …Requires support by compiler, assembler etc.
Dictionary approach, two level control store(indirect addressing of instructions) “Dictionary-based coding schemes cover a wide range of various coders and compressors.Their common feature is that the methods use some kind of a dictionary that contains parts of the input sequence which frequently appear.The encoded sequence in turn contains references to the dictionary elements rather than containing these over and over.” [Á. Beszédes et al.: Survey of Code size Reduction Methods, Survey of Code-Size Reduction Methods, ACM Computing Surveys, Vol. 35, Sept. 2003, pp 223-267]
Key idea (for d bit instructions) Uncompressed storage of ad-bit-wide instructions requires axd bits. In compressed code, each instruction pattern is stored only once. Hopefully, axb+cxd <axd. Called nanoprogramming in the Motorola 68000. For each instruction address, S contains table address of instruction. b instructionaddress a S b « d bit table of used instructions (“dictionary”) c ≦ 2b small d bit CPU
D P a x x[j-i] a[i] AX AY MY MX Address- registersA0, A1, A2 ..i+1, j-i+1 MF AF +,-,.. * x[j-i]*a[i] +,- Address generation unit (AGU) AR yi-1[j] MR Key requirement #3: Run-time efficiency- Domain-oriented architectures - n-1 Application: y[j] = i=0 x[j-i]*a[i] i: 0i n-1: yi[j] = yi-1[j] + x[j-i]*a[i] Architecture: Example: Data path ADSP210x Application maps nicely onto architecture MR:=0; A1:=1; A2:=n-2; MX:=x[n-1]; MY:=a[0];for ( j:=1 to n) {MR:=MR+MX*MY; MY:=a[A1]; MX:=x[A2]; A1++; A2--}
Modulo addressing sliding window Modulo addressing:Am++ Am:=(Am+1) mod n(implements ring or circular buffer in memory) x t t1 ..x[t1-1]x[t1]x[t1-n+1]x[t1-n+2].. ..x[t1-1]x[t1]x[t1+1]x[t1-n+2].. n most recent values Memory, t=t1 Memory, t2=t1+1
Saturating arithmetic • Returns largest/smallest number in case of over/underflows • Example:a 0111b + 1001standard wrap around arithmetic (1)0000saturating arithmetic 1111(a+b)/2: correct 1000 wrap around arithmetic 0000 saturating arithmetic + shifted 0111 • Appropriate for DSP/multimedia applications: • No timeliness of results if interrupts are generated for overflows • Precise values less important • Wrap around arithmetic would be worse. „almost correct“
Fixed-point arithmetic Shifting required after multiplications and divisions in order to maintain binary point.
Properties of fixed-point arithmetic • Automatic scaling a key advantage for multiplications. • Example:x= 0.5 x 0.125 + 0.25 x 0.125 = 0.0625 + 0.03125 = 0.09375For iwl=1 and fwl=3 decimal digits, the less significant digits are automatically chopped off: x = 0.093Like a floating point system with numbers [0..1),with no stored exponent (bits used to increase precision). • Appropriate for DSP/multimedia applications(well-known value ranges).
Normal Mode 1.3 V 50MHz Slow Module 1.3V 50MHz Standard Modules 1.8V 100MHz Busy Module 3.3V 200MHz Busy Mode 3.3 V 200MHz Spatial vs. Dynamic Supply Voltage Management Analogy of biological blood systems: • Different supply to different regions • High pressure: High pulse count and High activity • Low pressure: Low pulse count and Low activity Not all components require same performance. Required performance may change over time
Example: Processor with 3 voltagesCase a): Complete task ASAP Task that needs to execute 109 cycles within 25 seconds. Ea= 109 x 40 x 10-9 = 40 [J]
Case b): Two voltages Eb= 750 106 x 40 x 10-9 +250 106 x 10 x 10-9 = 32.5 [J]
Case c): Optimal voltage Ec = 109 x 25 x 10-9= 25 [J]
Observations A minimum energy consumption is achieved for the ideal supply voltage of4 Volts. In the following: variable voltage processor = processor that allowsany supply voltage up to a certain maximum. It isexpensive to support truly variable voltages, and therefore, actual processorssupport only a few fixed voltages. Ishihara, Yasuura: “Voltage scheduling problem for dynamically variable voltage processors”, Proc. of the 1998 International Symposium on Low Power Electronics and Design (ISLPED’98)
Generalization Lemma [Ishihara, Yasuura]: • If a variable voltage processor completes a task before the deadline, then theenergy consumption can be reduced. • If a processor uses a single supply voltage Vand completes a task T justat its deadline, then Vis the unique supply voltage whichminimizes theenergy consumption of T. • If a processor can only use a number of discrete voltage levels, then a voltageschedule with at most two voltages minimizes the energy consumptionunder any time constraint. • If a processor can only use a number of discrete voltage levels, then thetwo voltages which minimize the energy consumption are the two immediateneighbors of the ideal voltage Videalpossible for a variable voltageprocessor.
The case of multiple tasks:Assigning optimum voltages to a set of tasks N : the number of tasks ECj: the number of execution cycles of task j L : the number of voltages of the target processor Vi: the ith voltage, with 1 i L Fi: the clock frequency for supply voltage Vi T : the global deadline at which all tasks must have been completed SCj: the average switching capacitance during the execution of task j (SCicomprises the actual capacitance CL and the switching activity ) Xi, j: the number of clock cycles task j is executed at voltage Vi
Minimize Subject to and Designing an IP model Simplifying assumptions of the IP-model includethe following: • There is one target processor that can be operated at a limited number ofdiscrete voltages. • The time for voltage and frequency switches is negligible. • The worst case number of cycles for each task are known.
Voltage Scheduling Techniques • Static Voltage Scheduling • Extension: Deadline for each task • Formulation as IP problem (SS) • Decisions taken at compile time • Dynamic Voltage Scheduling • Decisions taken at run time • 2 Variants: • arrival times of tasks is known (SD) • arrival times of tasks is unknown (DD)
Dynamic Voltage Controlby Operating Systems Voltage Control and Task Scheduling by Operating System to minimize energy consumption Okuma, Ishihara, and Yasuura: “Real-Time Task Scheduling for a Variable Voltage Processor”, Proc. of the 1999 International Symposium on System Synthesis (ISSS'99) Target: • single processor system • Only OS can issue voltage control instructions • Voltage can be changed anytime • only one supply voltage is used at any time • overhead for switching is negligible • static determination of worst case execution cycles
Problem for Operating Systems deadline 2.5V arrival time Task1 5.0V Task2 4.0V Task3 What is the optimum supply voltage assignment for each task in order to obtain minimum energy consumption?
task task Time slot: T The proposed Policy Consider a time slot the task can use without violating real-time constraints of other tasks executed in the future Once time slot is determined: • The task is executed at a frequency of WCEC / T Hz • The scheduler assigns start and end times of time slot
SD DD CPU Time Allocation Start Time Assignment End Time Prediction off-line on-line off-line on-line on-line on-line Two Algorithms Two possible situations: • The arrival time of tasks is known: SD Algorithm Static ordering and Dynamic voltage assignment • The arrival time of tasks is unknown DD Algorithm Dynamic ordering and Dynamic voltage assignment
Task1 Task2 SD Algorithm (CPU Time Allocation) • Arrival time of all tasks is known • Deadline of all tasks is known • WCEC of all tasks is known • CPU time can be allocated statically CPU time is assigned to each task: • assuming maximum supply voltage • assuming WCEC
Current time Task1 WCEC @ Vmax Task2 Free time Current time Task1 Task2 Task2 SD Algorithm (Start Time Assignment) • In SD, it is possible to assign lower supply voltage toTask2 using the free time • In SS, the scheduler can’t use the free time because it has statically assigned voltage
Current time Task2 Task1 DD Algorithm When the task’s arrival time is unknown, its end time can’t be predicted statically using the SD algorithm No predetermined CPU time, start or end times Start Time Assignment: • New task arrives – it either: • Preempts currently executing task • Starts right after currently executing task • Starting time is determined
Current time Completion time assigned at CPU time allocation Task2 Task1 DD Algorithm (cont.) End Time Prediction: Based on the currently executing task’s end time prediction, add the new task’s WCEC time at maximum voltage
Current time Task1 Task1 Task2 Task2 DD Algorithm (cont.) If the currently executing task finishes earlier, then new task can start sooner and run slower at lower voltage
Comparison: SD vs. DD SD Algorithm: Task End Time Start Time DD Algorithm: Task End Time Start Time
Experimental Results: Energy Normal: Processor runs at maximum supply voltage SS: Static Scheduling SD: Scheduling done by SD Algorithm DD: Scheduling done by DD Algorithm
10us 160ms 90us 10us 90us Dynamic power management (DPM) Dynamic Power management tries to assign optimal power saving states Requires Hardware Support Example: StrongARM SA1100 400mW RUN RUN: operational IDLE: a sw routine may stop the CPU when not in use, while monitoring interrupts SLEEP: Shutdown of on-chip activity IDLE SLEEP 50mW 160uW