1 / 59

Power management in Real-time systems

Power management in Real-time systems. Collaborators : Daniel Mosse Bruce Childers PhD students: Hakan Aydin Dakai Zhu Cosmin Rusu Nevine AbouGhazaleh Ruibin Xu. Power Management. Why? Battery operated: Laptop, PDA and Cell phone Heating : complex Servers (multiprocessors)

andrew
Download Presentation

Power management in Real-time systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Power management in Real-time systems Collaborators : Daniel Mosse Bruce ChildersPhD students: Hakan Aydin Dakai Zhu Cosmin Rusu Nevine AbouGhazaleh Ruibin Xu

  2. Power Management • Why? • Battery operated: Laptop, PDA and Cell phone • Heating : complex Servers (multiprocessors) • Power Aware: maintainQoS, reduce energy • How? • Power off un-used parts: LCD, disk for Laptop • Gracefully reduce the performance • CPU: dynamic power Pd = Cef Vdd2 f • Cef : switch capacitance • Vdd: supply voltage • f : processor frequency  linearly related to Vdd

  3. Energy f 0.6E T1 T2 T2 T1 time 0.6E T1 T2 time fmax/2 E/4 T1 T2 time Power Aware Scheduling • Static Power Management (SPM) • Static slack: uniformly slow down all tasks • Gets more interesting for multiprocessors fmax Static Slack D E T1 T2 idle time

  4. CPU speed deadline WCET Smax Smin time time PMP PMP (power management points) ACET PMP Remaining time Remaining time Dynamic Power management (DPM) • Dynamic slack: average execution 10% • Utilize slack to slow down future tasks (Proportional, Greedy, aggressive,…)

  5. β1D (1-β1)D D Stochastic power management slack β1

  6. β4=100% β3=xx% vs. β2=xx% vs. β1=xx% vs. Computing βi in Reverse Order T1 T2 T3 T4

  7. Dynamic Speed adjustment techniques for non-linear code PMP p1 p3 p2 min average max At a PMP • Remaining WCET is based on the longest path • Remaining average case execution time is based on the branching probabilities (from trace information).

  8. PMPs: Power management points PMHs: Power management hints Application Source Code PMHs time Interrupts for executing PMPs Who should manage? Run-time information Static analysis Compiler (knows future better) OS (knows past Better)

  9. Maximizing system’s utility (as opposed to minimizing energy consumption) Energy constrains Time constrains (deadlines or rates) System utility (reward) Increased reward with increased execution Determine appropriate versions to execute Determine the most rewarding subset of tasks to execute

  10. Many problem formulations • Continuous frequencies, continuous reward functions • Discrete operating frequencies, no reward for partial execution • Version programming – an alternative to the IRIS (IC) QoS model Optimal solutions Heuristics EXAMPLE For homogeneous power functions, maximum reward is when power is allocated equally to all tasks. Add a task if constraintsis violated no yes Repair schedule

  11. Rechargeable systems (additional constrains on energy and power) battery • Solar panel (needs light) • Tasks are continuously executed • keep the battery level above a threshold at all times • Frame based system • Three dynamic policies (greedy, speculative and proportional) Available power Use to store (recharge) split merge consume time Schedulable system Example:

  12. Multiprocessing systems

  13. P P P P Scheduling Policy • Partition Tasks to Processors • Each processor applies PM individually • Distributed • Global Queue • Global management • Shared memory

  14. a b c d e f Miss b d e f Slack b c f a time a d e c time D D Dynamic Power Management • Greedy • Any available slack is given to next ready task • Feasible for single processor systems • Fails for multi-processor systems

  15. Steaming Applications • Streaming applications are prevalent • Audio, video, real-time tasks, cognitive applications • Executing on • Servers, embedded systems • Multiprocessors and processor clusters • Chip Multiprocessors : TRIPS, RAW, etc. • Constrains: • Interarrival time (T) • End-to-end delay (D) • Two possible strategies: • Master-slave • Pipelining T D

  16. T PE PE PE PE D Master-slave Strategy • Single streaming application • The optimal number, n, of active PEs strikes a balance between static and dynamic power • Given n, the speed on each PE is chosen to minimize energy consumption • Multiple steaming applications • Determine the optimal number of active PEs • Given the number of active PEs, • First assign streams to groups of PEs (ex: balance load using the minimum span algorithm). • Adjust the speed on each PE to minimize energy

  17. PE PE PE PE Pipeline Strategy (1) Linear pipeline (# of stages = # of PEs)

  18. PE PE PE PE Pipeline Strategy (2) Linear pipeline (# of stages > # of PEs) Solution 1 (optimal) • Discretize the time and use dynamic programming Solution 2 (use some heuristics) (3) Nonlinear pipeline • # of stages = # of PEs • Formulate an optimization problem with multiple sets of constraints, each corresponding to a linear pipeline • Problem : the number of constraints (can be exponential) • Solution : add additional variables denoting the finishing time for each stage • # of stages > # of PEs • Apply partitioning/mapping first and then do power management

  19. Level 1 A Stage 1 Level 2 B C Level 3 D G H I Stage 2 Level 4 E F Stage 3 Level 5 J Scheduling into a 2-D processor array (CMP) A B C D E F G H I J • Step 1: topological-sort-based morphing • Step 2: A dynamic programming approach to find the optimal # of stages and optimal # of processors for each stage

  20. Tradeoff: Energy & Dependability

  21. Time slack (unused processor capacity) Use to reduce speed Use for redundancy Use to do more work Fault tolerance Power management Increase productivity space redundancy Time redundancy Effect of DVS on reliability

  22. Exploring time redundancy The slack is used to: 1) add checkpoints 2) reserve recovery time 3) reduce processing speed For a given slack and checkpoint overhead, We can find the number of checkpoints and the placement of checkpoints Such that we minimizes energy consumption, and guarantee recovery and timeliness. Energy # of checkpoints

  23. TMR vs. Duplex TMR Duplex r : overhead of checkpoint p : ratio of static/dynamic power r r TMR is more Energy efficient Load=0.7 0.035 Load=0.6 Load=0.5 Duplex is more Energy efficient 0.02 p p 0.1 0.2 Identified energy efficient operating regions for TMR/Duplex

  24. Effect of DVS on SEU rate • Lower voltages  higher fault rate • Lower speed  less slack for recovery Reliability requirement Available slack Fault model Acceptable level of DVS

  25. Near-memory Caching for Improved Energy Consumption

  26. Near-CPU vs. Near-memory caches CPU • Thesis: Need to balance the allocation of the two for better delay and energy. • Caching to mask memory delays • Where? cache cache • Which is more power and performance efficeint ? Main Memory

  27. On-memory block size Near-memory caching: Cached-DRAM (CDRAM) • On-memory SRAM cache • accessing fast SRAM cache  Improves performance. • High internal bandwidth  use large block sizes • Improves performance but consume more energy Same config. as in Hsu et al., 2003

  28. E_cache E_DRAM E_tot Power-aware CDRAM • Power management in near-memory caches • Use distributed near-memory caches • Choose adequate cache configuration • to reduce miss rate & energy per access. • Power management in DRAM-core • Use moderate sized SRAM cache • Turn the DRAM core to low power state • Use immediate shutdown • Near-memory versus DRAM energy tradeoff – cache block size

  29. Wireless Networks Collaborators : Daniel MossePhD student: Sameh Gobrial

  30. Saving Power • Power is proportional to the square of the distance • The closer the nodes, the less power is needed • Power-aware Routing (PARO) identifies new nodes “between” other nodes and re-routes packets to save energy • Nodes decide to reduce/increase their transmit power

  31. C C B B A A Asymmetry in Transmit Power • Instead of C sending directly to A, it can go through B • Saves transmit power, but may cause some problems.

  32. RTS MSG CTS CTS MSG MSG Problems due to one-way links. • Collision avoidance (RTS/CTS) scheme is impaired • Even across bidirectional links! • Unreliable transmissions through one-way link. • May need multi-hop Acks at Data Link Layer. • Link outage can be discovered only at downstream nodes. A B C

  33. Problems for Routing Protocols • Route discovery mechanism. • Cannot reply using inverse path of route request. • Need to identify unidirectional links. (AODV) • Route Maintenance. • Need explicit neighbor discovery mechanism. • Connectivity of the network. • Gets worse (partitions!) if only bidirectional links are used.

  34. Wireless bandwidth and Power savings • In addition to transmit power, what else can we do to save energy? • Power has a direct relation with signal to noise ratio (SNR) • The higher the power, the higher the signal, the less noise, the less errors, the more data a node can transmit • Increasing the power allows for higher bandwidth • Turn transceivers off when not used – this creates problems when a node needs to relay messages for other nodes.

  35. Using Optical Interconnections in Supercomputers Collaborators : Alex JonesPhD student: Ding Zhu Dan Li (now doing AI) Shuyi Shao

  36. Motivation for using Optical Circuit Switching (OCS) in Suprecomputers Many HPCS applications have only a small degree (6-10) of high bandwidth communication among processes/threads • Rest of a thread’s / process’ communication traffic are low bandwidth communication “exceptions”. Many HPCS applications have persistent communication patterns • Fixed over all the program’s run time, or slowly changing But there are “bad” applications, or phases in applications, which are chaos • GUPS…..! Optics is good for high bandwidth, bad for fast switching. Electronics is the other way around and is good for processing (collectives) • Need two networks to complement each other

  37. The OCS Network fabric 2 networks complement each other Circuit-Switched all Optical Fat-Trees made of 512x512 MEMS-based optical switches One of multiple fat-tree networks One of multiple fat-tree networks One of multiple fat-tree networks One of multiple fat-tree networks OCS One of multiple fat-tree networks One of multiple fat-tree networks One of multiple fat-tree networks One of multiple fat-tree networks Fat-tree network Fat-tree network Fat-tree network Fat-tree network Fat-tree network Fat-tree network Fat-tree network Storage/IO Network PERCS D-block PERCS D-block Intelligent Network (1/10 or less BW) Including collective communication

  38. 50 sec phase 60 sec phase 30 sec Communication Pattern: node48 AMR CTH Communication patterns change in phases lasting 10’s of Sec (Node 48) 250 sec phase Communication Phases

  39. >60% 40% to 60% 20% to 40% 0% to 20% No communication UMT2K – Fixed, Irregular Communication Pattern Percentage of Traffic By Bandwidth Communication Matrix Max communication degree from each node to any other node is about 10. The pattern is irregular but fixed. 100 10 1

  40. Handling HPCS application in OCS Communication Un-predictability Use multiple hops through OCS OR use intelligent network Un-Predictable Run-time predictor Communication Run-Time Predictable Compile Time Statically Analyzable Compiled Communication Temporal Locality Low High NOTE: No changes in the applications’ code. OCS setup by the compiler, run time auto prediction, and multi hop routing

  41. Paradigm of compiled communication MPI trace code HPC systems Traces MPI application Optimized MPI code Compiler Communication Patterns Network configuration Instruction Enhanced MPI code Network Configurations Performance statistics Compiled communication Run-time predictor HPC systems (Simulator)

  42. Compilation Framework Compiler: • Recognize and represent communication patterns • Communication compiling • Enhance applications with network configuration instructions • Automate trace generation • Targets MPI applications

  43. phase phase phase Communication Pattern • Communication Classification: • Static • Persistent • Dynamic • Executions of parallel applications show phases—communication phases

  44. Processor Node Communication Predictor Low BW Network NIC OCS control router NIC Inter D-Block: High Bandwidth Optical Network Intra D-Block The Communication Predictor • Initially, setup the OCS for random traffic • Keep track of connections’ utilization • A migration policy to create circuits in OCS • A simple threshold policy • An intelligent pattern predictor • An evacuation policy to remove circuits from OCS • LRU replacement • A compiler inserted directive

  45. 0 0 0 0 1 1 1 1 2 2 2 3 2 3 4 3 4 3 5 5 4 4 6 6 5 5 7 7 6 6 8 8 7 7 8 8 1) Route from node 2 to node 7 2) Route from node 7 to node 4 Dealing with Unpredictable Communications Set up the OCS planes so that any D-block can reach any other D-block with at most two hops through the network Example: Route from node 2 to node 4 (second node in second group)

  46. Scheduling in Buffer Limited Networks Collaborators : Taieb ZnatiPhD student: Mahmoud Elhaddad

  47. Packet-switched Network with Fixed-Size Buffers • Packet routers connected via time-slotted buffer-limited links • Packet duration is one slot • You cannot freely size packet buffers to prevent loss • All-optical packet routers • On-chip (line-driver chip) SRAM buffers • Connections: • Ingress--egress traffic aggregates • Fixed bandwidth demand • Connection has fixed path • Loss rate of a connection • Loss rate is fraction of lost packets • goal is to Guarantee loss rate • Loss guarantee depends on the path of connection

  48. Link scheduling algorithms • Packet Service discipline: • FCFS, LIFO, Fixed Priority, Nearest To Go… • Drop policy: • Drop tail, drop front, random drop, Furthest To Go… • Must be Work conserving: • Drop excess packets only when buffer overflows • Serve packet in every slot as long as buffer not empty • Must use only local information • No hints or coordination between routers

  49. Link scheduling in buffer-limited networks • Problem: • Minimize guaranteed loss rate for every connection • Key question: Is there a class of algorithms that lead to better loss bounds as fn of utilization and path length? FCFS scheduling with drop tail Proposed rolling priority scheduling

  50. Link scheduling in buffer-limited networks • Findings: • A local fairness property is necessary to minimize the guaranteed loss rate for every path length and utilization constraint. • FCFS/RD (Random Drop) is locally fair • A locally-fair algorithm Rolling Priority that improves the loss guarantees compared to FCFS/RD, and is simple to implement • Rolling Priority is optimal • FCFS/RD is near-optimal at light load

More Related