570 likes | 624 Views
Resource management in embedded systems. Tajana Š imuni ć Rosing HP Labs & Stanford University. Embedded System Trends. Increasingly complex systems Mixed hard & soft real-time requirements Coordination of subsystem & multi-system control Demand for dynamic & adaptive response
E N D
Resource management in embedded systems Tajana Šimunić Rosing HP Labs & Stanford University
Embedded System Trends • Increasingly complex systems • Mixed hard & soft real-time requirements • Coordination of subsystem & multi-system control • Demand for dynamic & adaptive response • Operation in unpredictably changing contexts • Variable performance demands • Management of resources (power, performance, availability, accessibility, throughput, security etc.) • Demand for autonomy • Human feasibility restrictions • Faster hardware • Fast processors and networks • Integrated processing, common platforms • FPGAs vs ASICs, DSPs; SoCs Source : NSF EU-US Workshop ‘01
Overview of Contributions • Resource management in wireless embedded systems • Utilize information already present in the system to deliver lower energy consumption with excellent quality of service • Simultaneous increase in accessibility • Power management of embedded HW • Evolution of SoCs into NoCs • Optimal power and performance management policy for NoCs • Energy optimization of embedded software • Methodology for lowering energy consumption in data-intensive embedded software algorithms using symbolic algebra techniques
Resource Management ofHeterogeneous WirelessEmbedded Systems
Networked Embedded Systems – Sensor Nodes Energy breakdown for MPEG video Energy breakdown for voice Decode Decode Transmit Encode Encode Receive Receive Transmit RADIO TX RX SLEEP SENSORS IDLE CPU Lucent WaveLAN at 2 Mbps & SA-1100 CPU at 150 MPIS Source : Mobicom’01 SensorsTutorial
802.11b PM - Doze Mode Heavy traffic conditions: the card is awake most of the time awake state doze state Medium traffic conditions: the card is awake half of the time network polling Light traffic conditions: the card is sleeping most of the time • MAC layer PM is not sufficient due to continual network polling for data, increased RTT and broadcast traffic issues • Network layer management has no knowledge of application characteristics or of behavior of other clients in the environment
Related Work in PM for Networks • Homogeneous networks: • 802.11e standard allows applications to specify their QoS needs • Separate control and data channels for 802.11 based networks [Shi’02] • Decoupled low power control channel, device sends a wake-up call • Protocols for sensor networks [Estrin’02] • Low duty cycle operations on radio • Power efficient coordination in 802.11b networks [Chen’01] • Forwarding nodes remain active while neighboring nodes are in power save • Heterogeneous networks • Improved network and link layers for mobility • Mobile IP protocol for host mobility [Monarch CMU] • Contact networking [HPL & UIUC, MobiSys '03 ] • Improved hand-off mechanism in overlay networks • Buffering of data at multiple base stations [Barwan UCB] • Distributed file system for wireless [Coda,Odyssey CMU] • Caching of files at clients and duplication on accessible servers • Adjust quality of accessed data to match available resources
Related work in PM Power Manager Command Policy Data Queue Requests Client Power States Active Active Idle 4 3 2 1 Idle Sleep • Focus on power management of a single device (e.g. CPU, hard disk) • Heuristic techniques • Time-out and predictive models [Karlin’94, Hwang 97] • DVS algorithms • Stochastic methods • MDP [Benini 99, Qiu 99] • Memoryless distributions only • Discrepancy between predicted and measured savings • TISMDP [Simunic 01] • Finite history of user occurrences • Decisions based on event occurrences • Large energy savings measured
System Resource Management Client RM Server RM WAN WLAN WPAN • Concentration on resource management aspects of diverse devices in a heterogeneous wireless network • maintain desired QoS • increase accessibility • maximize battery lifetime • Implementation of policies to define • Which devices should communicate using which network interface • When should this communication take place • When should the communicating device be in a low power state • Separation of policy and control • “smart” clients communicate their needs to the resource manager • stochastic modeling of communication patterns for independent clients • policy takes into account client’s needs and determines control decisions
Comparison Server-driven Management Server Server DATA Client Client PM CONTROLS Server Continuously Streaming Server Adaptive Buffering wireless on only when data arrives wireless always on • 802.11b • Doze interval regulation • Manual on/off • Bluetooth • Park,sniff,hold modes have to be specifically initiated by the server • Control power state of all components • Use application knowledge – higher savings possible than just at MAC and network layers • Schedule/buffer with multiple clients State of the art
Server RM Architecture Detail Client application Server application RTP/UDP/TCP Server RM APPLICATION LAYER TCP OS CLIENT RM LP Device Drivers Dynamic clock Speed setting CPU Idle/ Sleep Mode Device RM
Server RM algorithm Estimation Buffer size calculation Memory energy Communication energy & switch evaluation WNIC settings • Server has the additional knowledge of multiple clients workload & traffic conditions • Efficient transmission coordination in multiple clients environment • Scheduling based on the link quality, client data consumption rates and power needs • Server decides the RM policy • Enable/disable various resouces (e.g. BT park) • Adjust resource parameters (e.g. doze interval) • Set duration of the sleep intervals • Schedule on time wake-up (no delay) • Server RM relies on the client RM • Client RM controls the devices • Utilizes Rate Monotonic and Earliest Deadline First algorithms to schedule communication with multiple clients
Estimation and buffer sizing process • Maximum likelihood estimator keeps track of changes in WNIC throughput and data usage pattern of the applications • Size of buffer chosen to maximize sleep times • Total buffer size: • Buffer region actively involved in data transfer in steady state: • Buffer size required during interface switch: • Average sleep time:
Energy consumed by WNIC and RAM Average RAM power vs. buffer size Average WNIC power vs. buffer size • WNIC energy: • RAM energy: • Total energy: Interface Switching
Server RM Scheduling Algorithms • Classical scheduling theory result: There is no dynamical scheduling algorithm for dynamically changing tasks on multiple processors that is provably optimal [Liu’73] • Consequence: server RM scheduler for a multi-client system has to be heuristic • Two different algorithms are compared: • Earliest Deadline First: • Schedule clients with earliest deadline first • Optimal dynamic preemptive algorithm based on dynamic priorities for multiple tasks on a uniprocessor can achieve 100% utilization if: • Rate Monotonic: • schedule clients with highest data consumption rate first (shortest period) • according to RM a set of periodic, independent task can be scheduled to meet their deadlines on a uniprocessor, if the sum of the utilization factors is given as: • Where: • WCTE is worst-case execution time of task j • Tj is the period of task j • U(n) is the utilization bound for n tasks
Experimental setup • Research prototype of HP’s hotspot server • HP’s IPAQ 3970 with Bluetooth (CSR) and 802.11b (CISCO Aironet 350) • Simulator for multiple client scenarios • The applications used are • MPEG4 video • MP3 audio • Email • Telnet • WWW • DSR • Results for: • Server RM to single client • Server RM to multiple clients Data consumption rate of applications kbps
Server RM with MPEG4 Video Streaming • 30 frame/sec to a mobile • 40% of power reduction • 15 frame/sec to a mobile • 65% of power reduction • larger number of concurrent clients with real time playback
Effect of Traffic Conditions • Server RM saves 50% of power in heavy traffic conditions with respect to 802.11b PM • The server RM performs better in medium/heavy traffic conditions • 802.11b PM is useful for systems where broadcast traffic cannot be discarded • Both implementations have real time video playback
Server RM with MP3 Audio over 802.11b • Large energy savings with concurrent increase in number of clients supported • factor of 24 relative to current usage model for 802.11b • factor of 3 with respect to 802.11 PM with no broadcast traffic • Quality of service remains constant – playback is in real time
Server RM for MP3 & E-mail via Bluetooth • Large increase in availability of server to clients • Factor of 2 for MP3 streaming • Factor of 1000 for email • Concurrent large savings in power • Factor of 2 for MP3 streaming and a factor of 28 for email
WNIC switch • Comparison of power consumption for an application trace consisting of MP3 audio, Email, Telnet, WWW and MPEG4 video. • A factor of 3.0x improvement over solely using Bluetooth or 802.11b
Server RM to Multiple Clients • Compared Server RM Rate Monotonic scheduler with Earliest Deadline First in time period of 2hrs on two WNICs: Bluetooth (3 clients) and WLAN (30 clients) with multiple applications • Clients with lower scheduling priority had an average 10% delay penalty Maximum power savings factor of 42 Average a factor of 4.6 compared to always on: 0.18 W Maximum power savings factor of 40 Average a factor of 2 compared to always on: 0.8W with 802.11b PM
Related Work • SOC interconnect standards [AMBA,CoreConnect,VSI,OCP] • NOC architecture based on packet model • Fat tree router topology [Guerrier00] • Tiled architecture with flit-reservation flow control [Dally01] • Correct-by-construction protocol stack – MESCAL tools [Sgroi01] • Reduction of energy consumption in NOCs • Maia processor has 21 satellite units; its configuration changes according to application needs – large energy savings [Wan00] • Node and network-centric power management suggested [Benini02] • Recently proposed power management systems • Exclusively node-centric, with little or no outside information utilized • Power management & dynamic voltage scaling occur separately • Open loop control • policies designed once with no further optimization at run time
Network on a Chip ARM PCM DMA ARM DMA DSP Core Buffer Controller Core Controller MPEG Audio Core Speech Processing Flash PM PM Flash Audio Out Audio In RAM RAM Controller Controller Router ARM DMA MAC Baseband Embedded DSP Core Controller controller DSP CPU MPEG Video Core Communications Flash PM PM Radio Display RAM RAM Flash EEPROM Controller
Power Manager Implementation Local Power Manager Router Control Core Core Policy Function Traffic Core Estimator Network PM Request Core • Power management • Node-centric – fully contained in a local power manager • Network-centric – network power management requests • Local power manager implements closed-loop power management: • Estimator • Observes incoming core traffic, core state & network PM requests • Estimates parameters used in recalculation of power management policy • Controller • Sets core’s energy and performance states based on estimator input
Node-centric PM Departure Renewal point Active State foVo Idle State queue > 0 queue = 0 Arrival Transition to Arrival Sleep State Transition to Active State queue > 0 No Arrival Sleep State Arrival • Power management is based on Renewal Model
Renewal Policy Optimization • Basic assumptions: • general distribution governs the first request arrival • exponential distribution represents arrivals after the first arrival • user, device and queue are stationary • Optimize average performance under average power constraint • randomized policy Globally optimal policy calculated in seconds using LP
Closed-loop Renewal Policy Optimization • Formulate dual of the Lagrangian • Variables v,u & l are the Lagrangian multipliers • Obtain a minimum crossing point of a set of lines specified by the following equation: • Indexes of Lagrangian multipliers which form a solution, together with original constraints, are used to obtain the probabilities of transitioning into sleep state • Real-time closed-loop control is possible Globally optimal policy calculated in milliseconds
Node-centric estimation Exponential Pareto • estimate parameters a & b using least-squares method on the log of Pareto distribution • calculate maximum likelihood ratio for all rate settings • calculate interarrival (or interservice) time sums ( S tj) • evaluate natural log of maximum likelihood ratio, ln (Pmax ) • if ratio is larger than the one obtained from the lookup table, assume that the rate has changed • Estimation of exponential and Pareto distribution parameters
Controller implementation Optimal Policy Synposys synthesis FPGA synthesis • Consists of LFSR for generating probability & policy logic • Controller on entry to idle state: • obtains a random number RND & finds jh for which RND>p(jh) • if no arrival during jh seconds, the core enters sleep state, otherwise it stays active • Frequency and voltage are set so the average expected processing delay in the queue is kept constant:
Network centric PM Local Power Manager Router Control Core Core Renewal point Policy Function Traffic Core Estimator Network PM Request Core Departure Active State foVo Idle State Network request queue > 0 queue = 0 Arrival Transition to Arrival Sleep State Transition to Network request Active State queue > 0 No Arrival Sleep State Arrival
Network centric PM implementation • Controller implementation changes: • When all network cores release the local core, the probability of transition to sleep is 1.0 • As soon as a request comes from a network core to the local core, the local core transitions to the active state with probability 1.0 • Estimator continues to have the same function as before • Renewal model is expanded to include network requests • Node-centric PM is still needed to implement DVS and PM in situations when early network requests are not available
Network-centric results MPEG Audio Core Speech Processing Communications MPEG Video Core Power savings factor • Network-centric DPM increases power savings from a factor of 2.9 to 4.1, while at the same time reducing performance penalty by more than 10%
Complex Library Mapping for Embedded Software Using Symbolic Algebra
Related Work • Tree covering code generation [Aho] • Map to simple processor instruction • Retargetable compiler [Goossens, Paulin, Marwedel] • Instruction mapping of ASIPs • Power aware compiling • Memory access optimization [Catthoor, Kandemir] • Instruction reordering [Tiwari] • Symbolic algebra for data-flow synthesis [ICCAD’01]
Methodology Profiling Polynomial Formulation Symbolic Library Mapping Optimized C Code using Embedded Library Algorithmic-level C Code Pre-optimized Embedded Library Library Characterization Critical Code Target Code
Library characterization • Target library: • Commercial library (e.g. Intel’s integrated performance primitives library) • A set of in-house pre-optimized routines • IEEE floating-point math library for Linux OS • Each library element is labeled with: • Type of inputs and outputs • Performance and energy consumption obtained from cycle-accurate simulator • Functionality & accuracy of its polynomial representation
Target code identification Algorithm Source Code Software Profile for ( i=0; i<30; i++) fun energy { ----------------- Profiling x[i] = y[i] + 2 * x[i + 1]; getD 15% sort 10% z[i] -= x[i]; init 2% y[i] = x[i] + z[i]; ... Critical Code } LD R21, #30; ADD R21, R23,R27; ... Polynomial Formulation Target Code ARM Instruction-level Simulator Profiler Processor Core Model L1 Cache Energy Consumption Processor & L1 Cache Energy Model DC-DC Converter Energy Model Interconnect Energy Model Battery L2 Cache Memory Energy Model Energy Model • Identify critical code segments with a profiler • Formulate maximum size polynomials • Higher likelihood of finding more complex library elements • Achieved by transformation such as loop unrolling, constant and variable propagation, inlining… • Polynomial representation for the critical code segments is calculated as follows: • Linear functions are extracted directly • Bit manipulations or Boolean functions use interpolation-based algorithms [Smith01] • Nonlinear functions are approximated with a Taylor or Chebyshev expansion whose accuracy is verified via cycle-accurate simulation
Mapping Algorithm THR Factor Expand Horner Select Side Relation Set Add to Side Relation Set Simplify No Mapped? Yes Choose Best Solution Polynomial Representation of Critical Code Polynomial Rep. of Library Elements
Example • Phase shift keying modulation • Map a code segment of PSK to Library • S := • 1-.5*x0^2-x0*x1-.5*x1^2+.041667*x0^4+.166668*x0^3*x1 • +.250002*x0^2*x1^2+.166668*x0*x1^3+.041667*x1^4; Butterfly IDCT PSK Cos Sin Mac Library
Example 1.+.041667*y^4-.5*y^2 • Phase shift keying modulation • Map a code segment of PSK to Library • S := • 1-.5*x0^2-x0*x1-.5*x1^2+.041667*x0^4+.166668*x0^3*x1 • +.250002*x0^2*x1^2+.166668*x0*x1^3+.041667*x1^4; Butterfly IDCT PSK Cos Sin Mac • siderel := {y=x0+x1}; • simplify(S, siderel, [x0,x1,y]); y := x0 + x1; s := cos(y); Library
Experimental Setup PCMCIA power Processor power • SA-1100 based embedded system with MP3 from ISO as an application • System & component power measurements, timer for performance • Tool implemented in C with calls to Maple V side 1 side 2 PCMCIA slot SDRAM slot • Fine grain power measurements by data acquisition board Data acquisition board
Original Code • One frame decodes in 2.6 seconds • Profiler results show three critical functions • Generate as large as possible polynomials for the critical functions
MP3 Final Results • Runs a factor of four faster than real-time • Additional energy savings are possible by using frequency and voltage scaling
Summary • Server Resource Management • Scheduling algorithms for multi-client heterogeneous wireless embedded systems • Large energy savings with little or no cost in performance • Embedded HW management of NOCs • Node and network-centric approaches with closed-loop control • Power savings of a factor of 4 with network-centric approach, while performance penalty reduced by more than 10% • Embedded software library mapping methodology • Symbolic algebra method maps to pre-optimized library elements • Significant productivity improvement with large energy savings
Next steps • Server Resource Management • Combine independent and managed clients • Prediction techniques for changes in the channel • Better scheduling • Knowledge of the environment for help with scheduling (e.g. location) • Seamless transition to WAN when needed • Heuristics to help EDF/RM algorithms handle conflicts • Embedded HW management of NOCs • Integrated resource management for cores and the interconnect topology • Reliability as another aspect of RM • Embedded software optimization • Integration of optimizations with compilers • Driver & OS optimization • Hardware-driven software optimization • Given specific set of hardware components, apply appropriate optimizations automatically
Bluetooth • Supports point-to-point and point-to-multipoint (piconet) synchronous and asynchronous connections • Maximum throughput for asynchronous connections is 109-723 kbps • Supported low power modes: Hold, Park, Sniff and Deep sleep (vendor specific)
WaveLAN-802.11b • Designed to work in adhoc as well as infrastructure mode • System states: Active (Transmit/Receive), Doze (802.11b PM) & Off • IEEE standard power management • Traffic indication map (TIM) after every 100ms • Doze mode activation if no data present