Integrated Management of Power Aware Computing & Communication Technologies

Integrated Management of Power Aware Computing & Communication Technologies Review Meeting Nader Bagherzadeh, Pai H. Chou, Fadi Kurdahi, UC Irvine Jean-Luc Gaudiot, USC,Nazeeh Aranki, Benny Toomarian, JPL DARPA Contract F33615-00-1-1719 June 13, 2001 JPL -- Pasadena, CA

Agenda • Administrative • Review of milestones, schedule • Technical presentation • Progress • Applications (UAV/DAATR, Rover, Deep Impact, distributed sensors) • Scheduling (system-level pipelining) • Advanced microarchitecture power modeling (SMT) • Architecture (mode selection with overhead) • Integration (Copper, JPL, COTS data sheet) • Lessons learned • Challenges, issues • Next accomplishments • Questions & action items review.

behavioral system model high-level components composition operators parameterizable components system architecture busses, protocols Quad Chart Behavior Innovations high-level simulation • Component-based power-aware design • Exploit off-the-shelf components & protocols • Best price/performance, reliable, cheap to replace • CAD tool for global power policy optimization • Optimal partitioning, scheduling, configuration • Manage entire system, including mechanical & thermal • Power-aware reconfigurable architectures • Reusable platform for many missions • Bus segmentation, voltage / frequency scaling functional partitioning & scheduling Architecture mapping system integration& synthesis static configuration dynamic powermanagement Year 1 Year 2 Impact Kickoff 2Q 02 2Q 00 2Q 01 • Static & hybrid optimizations • partitioning / allocation • scheduling • bus segmentation • voltage scaling • COTS component library • FireWire and I2C bus models • Static composition authoring • Architecture definition • High-level simulation • Benchmark Identification • Dynamic optimizations • task migration • processor shutdown • bus segmentation • frequency scaling • Parameterizable components library • Generalized bus models • Dynamic reconfiguration authoring • Architecture reconfiguration • Low-level simulation • System benchmarking • Enhanced mission success • More task for the same power • Dramatic reduction in mission completion time • Cost saving over a variety of missions • Reusable platform & design techniques • Fast turnaround time by configuration, not redesign • Confidence in complex design points • Provably correct functional/power constraints • Retargetable optimization to eliminate overdesign • Power protocol for massive scale

Program Overview • Power-aware system-level design • Amdahl's law applies to power as well as performance • Enhance mission success (time, task) • Rapid customization for different missions • Design tool • Exploration & evaluation • Optimization& specialization • Technique integration • System architecture • Statically configurable • Dynamically adaptive • Use COTS parts & protocols

Personnel & teaming plans • UC Irvine - Design tools • Nader Bagherzadeh - PI • Pai Chou - Co-PI • Fadi Kurdahi • Jinfeng Liu • Dexin Li • Duan Tran • USC - Component power optimization • Jean-Luc Gaudiot - faculty participant • Seong-Won Lee - student • JPL - Applications & benchmarking • Nazeeh Aranki • Nikzad “Benny” Toomarian - students

Milestones & Schedule • Static & hybrid optimizations • partitioning / allocation • scheduling • bus segmentation • voltage scaling • COTS component library • FireWire and I2C bus models • Static composition authoring • Architecture definition • High-level simulation • Benchmark Identification • Dynamic optimizations • task migration • processor shutdown • bus segmentation • frequency scaling • Parameterizable components library • Generalized bus models • Dynamic reconfiguration authoring • Architecture reconfiguration • Low-level simulation • System benchmarking

Review of Progress • May'00 Kickoff meeting (Scottsdale, AZ) • Sept'00 Review meeting (UCI) • Scheduling formulation, UI mockup, System level configuration • Examples: Pathfinder & X-2000 (manual solution) • Nov'00 PI meeting (Annapolis, MD) • Tools: scheduler + UI v.1 (Java) • Examples: Pathfinder & X-2000 (automated) • Apr'01 PI meeting (San Diego, CA) • Tools: scheduler + UI v.2 - v.3 (Jython) • Examples: Pathfinder & initial UAV (Pipelined) • June'01 Review meeting we are here!

New for this Review (June '01) • Tools • Scheduler + UI v.4 (pipelined, buffer matching) • Mode selector v.1 (mode change overhead, constraint based) • SMT model • Examples: • Pathfinder, µAMPS sensors (mode selection) • UAV, Wavelet (dataflow) (pipelined, detailed estimate) • Deep Impact (command driven) (planning) • Integration • Input from Copper: timing/power estimation (PowerPC simulation model) • Output to Copper: power profile + budget (Copper Compiler) • Within IMPACCT: initial Scheduler + Mode Selector integration

Overview of Design Flow • Input • Tasks, constraints, component library • Estimation (measurement or simulation via COPPER) • Refinement Loop • Scheduling (pipeline/transform…) • Mode Selection (either before or after scheduling) • System level simulation (planned integration) • Output: to COPPER • Interchange Format: • Power Profile, Schedule, Selected modes • Code Generation • Microarchitecture Simulation

Design Flow task allocation, component selection task model, timing /power constraints scheduler high-level simulator IMPACCT component library mode model mode selector power + timing estimation power profile, C program powersimulator Compiler low-level simulator COPPER executable

Power Aware Scheduling • Execution model • Multiple processors, multiple power consumers • Multiple domains: digital, thermal, mechanical • Constraint driven • Min / Max power • Min / Max timing constraints • Handles problems in different domains • Time Driven • System level pipelining -- in time and in space • Parallelism extraction • Experimental results • Coarse to fine grained parallelism tradeoffs

Prototype of GUI scheduling tool • Power-aware Gantt chart • Time view • Timing of all tasks on parallel resources • Power consumption of each task • Power view • System-level power profile • Min/max power constraint, energy cost • Interactive scheduling • Automated schedulers – timing, power, loop • Manual intervention – drag & drop • Demo available

Power-Aware Scheduling • New constraint-based application model [paper at Codes'01] • Min/Max Timing constraints • Precedence, subsumes dataflow, general timing, shared resource • Dependency across iteration boundaries – loop pipelining • Execution delay of tasks – enables frequency/voltage scaling • Power constraints • Max power – total power budget • Min power – controls power jitter or force utilization of free source • System-level, multi-scenario scheduling [paper at DAC'01] • 25% Faster while saving 31% energy cost • Exploits "free" power (solar, nuclear min-output) • System-level loop pipelining [working papers] • Borrow time and power across iteration boundaries • Aggressive design space exploration by new constraint classification • Achieves 49% speedup and 24% energy reduction

Scheduling case study:Mars Pathfinder • System specification • 6 wheel motors • 4 steering motors • System health check • Hazard detection • Power supply • Battery (non-rechargeable) • Solar panel • Power consumption • Digital • Computation, imaging, communication, control • Mechanical • Driving, steering • Thermal • Motors must be heated in low-temperature environment

Scheduling case study:Mars Pathfinder • Input • Time-constrained tasks • Min/Max Power constraints • Rationale: control jitter, ensure utilization of free power • Core algorithm • Static analysis of slack properties • Solves time constraints by branch&bound • Solves power constraints by local movements within slacks • Target architecture • X-2000 like configurable space platform • Symmetric multiprocessors, multi-domain power consumers, solar/batt • Results • Ability to track power availability • Finishes task faster while incurring less energy cost

More aggressive scheduling:System-level pipelining • Borrow tasks across iterations • Alleviates "hot spots" by spreading to another iteration • Smooth out utilization by borrowing across iterations • Core techniques • Formulation: separate pseudo dependency from true dependency • Static analysis and task transformation • Augmented scheduler for new dependency • Results -- on Mars Pathfinder example • Additional energy savings with speedup • Smoother power profile

Scheduling case study:UAV DAATR • Example of a very different nature! • Algorithm, rather than "system" example • Target architecture • C code -- unspecified; assume sequential execution, no parallelism • MatLab -- unmapped • Algorithm • Sequential, given in MatLab or C • Potential parallelism in space, not in time • Constraints & dependencies • Dataflow: partial ordering • Timing: latency; no pairwise Min/Max timing • Power: budget for different resolutions

Scheduling case study:UAV example (cont'd) • Challenge: Parallelism Extraction • Essential to enable scheduling • Difficult to automate; need manual code rewrite • Different pipeline stages must be relatively similar in length • Rewritten code • Inserted checkpoints for power estimation • Error prone buffer mapping between iterations • Found a dozen bugs in benchmark C code • Missing Summation in standard deviation calculation • Frame buffer off by one line • Dangling pointers not exposed until pipelined

ComputeDistance ComputeDistance ComputeDistance ComputeDistance ATR application: what we are given 1 Frame Bugs Target Detection 3 filters m Detections FFT FFT FFT FFT FFT FFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT

Bug report • Misread input data file • OK, no effect to the algorithm • Miscalculate mean, std for image • OK, these values not used (currently) • Wrong filter data for SUN/PowerPC • OK for us, since we operate on different platforms • Bad for SUN/PowerPC users, wrong results • Misplaced FFT module • The algorithm is wrong • If images are turned upside-down, the results are different • Not sure whether it is correct • However, these problems are not captured in the output image files

FFT FFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT ComputeDistance ComputeDistance ComputeDistance ComputeDistance What it should look like 1 Frame Target Detection m Detections 3 filters k distances

FFT FFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT ComputeDistance ComputeDistance ComputeDistance ComputeDistance What it really should look like 1 Frame Target Detection m Detections 3 filters k distances

Problems • Limited parallelism • Serial data flow with tight dependency • Parallelism available (diff. detections, filters, etc) but limited • Limited ability to extract parallelism • Limited by serial execution model (C implementation) • No available parallel platforms • Limited scalability • Cannot guarantee response time for big images (N2 complexity) • Cannot apply optimization for small images (each block is too small) • Limited system-level knowledge • High-level knowledge lost in a particular implementation

Single DFG (vertical flow) Cluster by N DFGs(horizontal duplication) Input:N simultaneous frames Target Detection Target Detection N Frames(N target detection) Partitioning (horizontal cuts) m Detections m Detections FFT FFT M Targets(M FFTs) FFT FFT 3 filters 3 filters Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT Filter/IFFT M Targets(3M IFFTs) k distances k distances ComputeDistance ComputeDistance ComputeDistance ComputeDistance ComputeDistance ComputeDistance ComputeDistance ComputeDistance K Distances(2K IFFTs) Our vision: 2-dimensional partitioning Output: target detection w/ distance for N simultaneous frames

System-level blocks Input:N simultaneous frames N Frames(N target detection) Target Detection M Targets(M FFTs) FFT M Targets(3M IFFTs) Filter/IFFT K Distances(2K IFFTs) Compute Distance Output: target detection w/ distance for N simultaneous frames

Group 0 Group 1 Group 0 Group 5 Group 2 Group 2 Group 3 Group 1 Group 0 Group 0 Group 1 Group 2 Group 4 Group 1 Group 3 Group 3 Group 0 Group 2 Group 4 Group 1 System-level pipelining Input:N simultaneous frames Target Detection FFT Filter/IFFT Compute Distance Output: target detection w/ distance for N simultaneous frames

What does it buy us? • Parallelism • All modules run in PARALLEL • Each module processes N (M, K) INDEPENDENT instances, that could all be processed in parallel • NO DATA DEPENDENCY between modules • Throughput • Throughput multiplied by processing units • Process N frames at a reduced response time • Better utilization of resources

What does it buy us? (cont'd) • Flexibility • Insert / remove modules at any time • Adjust N, (M or K) at any time • Make each module parallel / serial at any time • More knobs to tune: parallelism / response time / throughput / power • Driven by run-time constraints • Scalability • Reduced response time on big images (small N and/or deeper pipe) • Better utilization/throughput on small images • More compiler support • Simple control / data flow: each module is just a simple loop, which is essentially parallel • Need an automatic partitioning tool to take horizontal cuts

What does it buy us: how power-aware is it? • Subsystems shut-down • Turn on / off any time based on power budget • Split / merge (migrate) modules on demand • Power-aware scheduling • Each task can be scheduled at any time during one pipe stage, since they are totally independent • More scheduling opportunity with an entire system • Dynamic voltage/frequency scaling • The amount of computation N, (M or K) is known ahead of time • Scaling factor = C / N (very simple!) • Less variance of code behavior => strong guarantee to meet deadline, more accurate power estimates • Run-time code versioning • Select right code based on N, (M or K)

Experimental implementation:pipelining transformation • Goal • To make everything completely independent • Methodology • Dataflow graph extraction (vertical) • Initial partitioning (currently manual with some aids from COPPER) • Horizontal clustering • Horizontal cut (final partitioning) • Techniques • Buffer assignment: each module gets its own buffer • Buffer renaming: read/write on different buffer • Circular buffer: each module gets a window of fixed buffer size • Our approach: the combination

c c c c c c b b b b b b d d d d d d a a a a a a Time = 0 Time = 1 Time = 4 Time = 5 Time = 2 Time = 3 Buffer rotation Circular buffer B B Pipe stages: a, b, c, d

Single circular buffer One serial data flow path All data flows are of same type same size Multiple buffers Multiple data flow paths Different type, size Background - acyclic dataflow a a b b c c d d

A A A A A A B B B B B B c b d a Time = 3 Time = 2 Time = 0 Time = 1 Time = 5 Time = 4 A more complete picture 3. Life-time spent in pipeline 4. Buffer dead 2. Buffer live Circular buffer A, B rotate at the same speed Pipe stages: a, b, c, d 1. Buffer ready(raw data, e.g. ATR images) Head pointer

How does it work? • Raw data is dumped into the buffer from the data sources • A head pointer keeps incrementing • Buffer is ready, but not live (active in pipeline) yet • Example, ATR image data coming from sensors • Buffer becomes live in pipeline • Raw data are consumed and/or forwarded • New data are produced/consumed • When a buffer is no longer needed by any pipeline stages, it is dead and recycled • Is everything really independent? • Yes! • At each snapshot, each module is operating on different data

A B b c b b d a c c a b b c c a d d a a d c b a d d What are we trading off? Speed computation intensity, parallelism,throughput,power Time Response time, delay Workload amount of computation, energy

N = 2,t = T / 2 N = 4,t = T / 4 3-D Design space navigation Workload N frames N = 4 N = 2 Time Speed Valid design points form a 3-D surface

a b • 3-D table • Power • Time • Workload c d PT N Design flow C Source code IMPACCT pipeline code transformation DFG Pipelined C Source code COPPER power simulator Task-level constraints Power-aware schedule IMPACCT scheduler and mode selection System-level constraints

Demo: power-aware ATR Power-aware schedule and run-time power profile Input N frames Output N frames Control panel, timing/power constraints, group size N

What it can do • Interactive performance monitor • Run ATR (or any other) algorithms on PC, network or external boards • Monitor power/performance at run-time • Giving timing/power budget on the fly • System-level simulator • Run ATR algorithms on (distributed) component-level simulators (e.g. COPPER) • Coordinate component-level simulators to construct the whole system • Examine the power/performance on the system level with verified results on components

What it can do • Dynamic power manager • Apply dynamic power management policies • Power management decision based on verified results from simulation • Pre-examine different dynamic power management policies without the real execution platform • Out first stage to go dynamic • What’s in the current demo? • ATR toolbox • Run ATR on different images • Operate on all image formats, not only the .bin binary format • Performance monitor / simulator • User inputs power/time/group size • Power/time based on COPPER simulation results • Dynamic power manager • Dynamic voltage/frequency scaling based on given timing constraints • Only minimize power, not taking power budget into constraint yet

a b • 3-D table • Power • Time • Workload c d PT N How it is implemented C Source code Algorithm, DFG Pipeline code transformation Pipelined C Source code power simulator ATR GUI Python/C interfacing C source code w/Python interface Compiler Windows DLLUNIX shared obj System-level simulator Pythoninterpreter Pythonimage lib Scheduler Python source code Python source code Tkinterwidget lib IMPACCT COPPER Other

What did we learn from this? • Component-level vs. system-level • Component-level • Finer grain algorithms on specific data (FFT) • Low-level programming (C code) • System-level • Coerce grain algorithms on data flows (ATR) • Increased level of programming (scripting, GUI) • System-level pipelining and code transformation • More parallelism by eliminating data dependency • Need automated compiler support • System-level simulation on ATR • Can potentially plug in any other simulators, library modules • Can integrate different component-level techniques • Power management at system-level with more confidence • Starting point to dynamic power management

Scheduling case study:Wavelet compression (JPL) • Algorithm in C • Wavelet decomposition • Compression: "knob" to choose lossy factor or lossless • Example category • Dataflow, similar to DAATR • Finer grained, better structure • IMPACCT improvements • Transformation to enable pipelining • Exploit lossy factor in trade space

Wavelet Algorithm • Wavelet Decomposition • Quantization • Entropy coding

Wavelet Algorithm structure For all image blocks Initialization (check params, allocate memory) block init.,set params, read image block decomp(), (lossless FWT) • Sequential execution blocks • No data dependency between image blocks (remove overlap) Bit_plane_decomp, (set decomp param) (1st level entropy coding) Output result to file (bit_plane encoding)

Wavelet: experiments • Experiments being conducted • Checkpoints marked up manually • Initial power estimation obtained • Code being manually rewritten / restructured for pipelining • Appears better structured than UAV example • Trade space • High performance to low power • Pipelining in space and in time, similar to UAV example • Lossy compression parameter

Ongoing scheduling case study:Deep Impact • "Planning" level example • Coarse grained, system level • Hardware architecture • COTS PowerPC 750 babybed, emulating a Rad-Hard PPC at 4x=> Models the X-2000 architecture using DS1 software • COTS PowerPC 603e board, emulating I/O devices in real time • Software architecture • vxWorks, static priority driven, preemptive • JPL's own software architecture -- command based • 1/8 second time steps; 1-second control loops • Task set • 60 tasks to schedule, 255 priority levels

NASA Deep Impact project • Platform • X-2000 configurable architecture • to be using RAD 6000 (Rad-Hard PowerPC 750 @133MHz) • Testbed (JPL Autonomy Lab) • PPC 750 single-board computer -- runs flight software • Prototype @233MHz, Real flight @133MHz • COTS board, L1 only, no L2 cache • PowerPC 603e -- emulate the I/O devices • connected via compact PCI • DS1: Deep Space One (legacy flight software ) • Software architecture: • 8 Hz ticks, command based • running on top of vxWorks • Perfmon: performance monitoring utility in DS1 • 11 test activities • 60 tasks

Deep Impact example (cont'd) • Available form: Real-time Traces • Collected using Babybed • 90 seconds of trace, time-stamped tasks, L-1 cache • Input needed • Algorithm (not available) • Timing / power constraints (easy) • Functional constraints • Sequence of events • Combinations of illegal modes • Challenges • Modeling two layers of software architecture (RTOS + command)

Design Flow task allocation, component selection task model, timing /power constraints scheduler high-level simulator IMPACCT component library mode model mode selector power + timing estimation power profile, C program powersimulator Compiler low-level simulator COPPER executable

Integrated Management of Power Aware Computing & Communication Technologies