240 likes | 448 Views
Morphable Computer Architectures for Highly Energy Aware Systems: PACC Program Review: Nov. 1-3; Annapolis, MD. Peter M. Kogge: CSE Dept. University of Notre Dame kogge@cse.nd.edu Kanad Ghose: CS Dept. SUNY-Binghamton; ghose@cs.binghamton.edu Nikzad “Benny” Toomarian:
E N D
Morphable Computer Architecturesfor Highly Energy Aware Systems:PACC Program Review: Nov. 1-3; Annapolis, MD Peter M. Kogge: CSE Dept. University of Notre Dame kogge@cse.nd.edu Kanad Ghose: CS Dept. SUNY-Binghamton; ghose@cs.binghamton.edu Nikzad “Benny” Toomarian: Center for Integrated Space Microsystems (CISM) Jet Propulsion Lab; benny@cism.jpl.nasa.gov
Outline • Quad Chart • “Gear-Shifting” Simplified • The Morph Program • The Morph Architecture • Test Bed & Benchmarks
New Ideas • Morphable microarchitecture to allow dynamic changes in energy expended per cycle • Energy efficient morphable memory hierarchies • Energy efficient ISA extensions to process data more energy efficiently • Adaptive algorithms to select best configuration • Energy aware run-time which can reconfigure system MORPH Adds An “Energy Gear” to Dynamically Configurable Embedded Systems • IMPACT • Focus on energy, not just power, management • Develops suite of widely applicable energy-reducing architectural techniques • Adds extra technology-independent degrees of freedom to dynamic energy control • Provides an overall inherently more energy efficient embedded computing system • Designed for transfer to real missions 5/00 11/00 5/01 11/01 5/02 Profiles Baseline Morphable Node Data Placement Adaptive Algorithms Run-time Demo & Eval MORPH: Dynamic Low Energy Architectures
What is “Gear-Shifting” all about? • Definitions: • IPC = Instructions per Cycle • EPC = Energy per Cycle • C = Cycles per Second • Performance = “Instructions/second” = IPCxC • Power = “Energy/second” = EPCxC • M = performance required during some mode (instructions/second) • Real world: performance needs change very dramatically • Observations on Conventional Designs: • Conventional designs fix IPC at some IPCmax to meet peak need • In such designs EPC = KxIPCa, where “a” can range to almost 4 • Assume arbitrary clock selection (up to a maximum clock Cmax) • Ignore Vdd changes for now • Power @ M = KxIPCmaxax(M/ IPCmax) = KxMxIPCmaxa-1 • Dependent on clock only thru M
Some Simplified Gear Equations • Assume IPC smoothly changeable from IPCmin to IPCmax • Let R = (IPCmax/IPCmin) = “dynamic ratio” of performance range • Let g be a gear setting, ranging from 0 to 1 to change IPC • IPC(g) = IPCmin + (IPCmax - IPCmin)g = IPCmax[1/R + (1-1/R)g] • EPC(g) = Kx{IPCmax[1/R + (1-1/R)g]}a • Power(g, C) = K x {IPCmax[1/R + (1-1/R)g]}a x C GEARS Large R: OUR CHALLENGE
Cmax 1 G C 0 0 Performance Rqmt Performance Rqmt 0 0 Imax x Cmax Imax x Cmax Imin x Cmax Imin x Cmax A Gear-Shifting Strategy To minimize power as we vary performance requirement M: • Use most efficient IPCmin as long as possible (until clock at maximum) • G = 0 • Then smoothly vary g while using Cmax
The Result Ratio of Power under optimal gear change to conventional fixed IPC Power 1 (1/R)a-1 0 Potentially huge for large R And we can still use all the other tricks to lower peak power! Power Savings Factor Huge savings if applications spend most time here Performance Rqmt M 0 IminCmax ImaxCmax
The Morph Program • Develop a microarchitecture with a large dynamic R • “Multi-cluster” superscalar CPU • Intelligent placement of data within mixed memory type hierarchy • Inherently low energy caches • Low energy ISA extensions • Define & use a realistic embedded benchmark suite • Drawn from deep-space processing needs - initially rovers • Include other DARPA benchmarks such as from DIS • Baseline on variety of systems • Develop real-time algorithms for reconfiguration • Demonstrate potential gains via simulation • Simplescalar + energy models • Technology transfer to potential future JPL missions
The Team • Overall Goals: • Architectures with variable IPC, EPC • Tools & S/W to manage morphing • Realistic demonstrations Peter Kogge Vincent Freeh Jay Brockman • UNIVERSITY • OF NOTRE DAME • Morphable multi-cluster architecture • “At the sense amps” ISA extension • Runtime with hooks for dynamic morphing control Kanad Ghose Energy Aware Data Placement • SUNY-BINGHAMTON • Morphable Caches, RFs • Dynamic Bit Slicing • Energy Eff VLIW archs • Supporting compiler techniques • JET PROPULSION • LABORATORY • Scenarios & benchmarks • Baseline characterizations • Runtime adaptation algorithms Nikzad Toomarian Mohammed Mojarradi Savio Chau
Starting A Solution:Multi Cluster Architecture (c) New Multi Cluster (a) Simple Pipeline (b) Classical Superscalar w(IW/w)k << (IW)k w Clusters Issue Width (IW) IW/w Problem: single large centralized register files with many ports Solution: multiple smaller register files with few ports EPC/IPC ~ (IW)k k as high as 1.9
EEPROM FLASH DRAM SRAM Energy-aware data placement Alternative ISA features Embedded+external memory Dynamic issue width Dynamic ALU width Low energy caches Selective substrate bias Dynamic data path width Target Morph Configuration Variable multi-cluster microarchitecture
PACC Benchmarks + Today’s Performance Only Design Point + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Energy Efficient Family + + + + + + + + + + + + + + + EPC: Energy per Cycle + + + + + + + + + IPC: Instructions per Cycle Evaluation Methodology
Multi-Cluster vs Conventional Results Conventional Morph: dynamically change the cluster size & ride the EPC/IPC Savings 1x8 2x6 4x4 1x6 2x4 1x4 4x2 Up to 1/2 the energy at same IPC, or 20% better IPC at same energy
On-chip Caches: Addressing Dynamic & Static Leakage • On-chip caches dissipate 25% to 45% of total energy • Likely to increase because of leakage • Added line buffers (4 to 16) reduce dynamic energy dissipation by 40% to 65+%, with no penalty in access time and with 4% to 6% area penalty • Use of dynamic activation of recently-accessed L2 cache areas reduce dynamic dissipation component by 40% to 80% • Only selected areas of L2 in active mode, rest in standby • Size of bit-cell groups controlled is critical • Additional L2 area penalty of approx. 8% • Heuristics for controlling transitions between active & standby modes
Exploiting Bit-Slice Inactivity in Datapaths • Expectation: Higher-order data bits likely to be insignificant at least some of the time • Opportunity: exploit byte slice inactivity over transfer paths, within storage devices (register files, caches) & function units FOR SPECfp95 DP FOR INTEGERS FROM SPECfp95 A circuit to provide read-enables in RFs to avoid energy dissipation on access
Deep Space: The Ultimate Power-Constrained Embedded System • Limited energy/power sources • Renewable variable power: Solar cells • Constant power: RPGs • Fixed energy: batteries • Multiple operational modes, all compute/energy constrained • Cruise • Communication: compression vs transmission • Data gathering vs analysis • Movement: collision avoidance • Today: • “Pre-canned” power management by serialized operations Morph Initial Focus: Rovers
Energy Required Function Time and Calculation 7.51W-hr 5.63W-hr 6.92W-hr 1.83W-hr 0.45W-hr 1.2W-hr 5.2W-hr 0.63W-hr 15.0W-hr 50W-hr 95W-hr motor heating: 1 motor at a time motor heating: 2 motors at a time driving (extreme terrain @ -80degC) hazard detection imaging (3 images @ 2 min/image) image compression (compress 3 images @ 6 min/image) 6Mbit communication @ 50min/sol 42, 10 sec health checks during day remainder of 7 hr daytime CPU operation WEB heating (as needed) = 7.51W x 1hr = 11.26W x 0.5hr = 13.85W x 0.5hr = 7.33W x 0.25hr = 4.5W x 0.1hr = 3.7W x 0.3hr = 6.27W x 0.8hr = 6.27W x 0.1hr = 3.7W x 4hr = 50W-hr Pathfinder Sojourner vs peak 15 W-hr Solar Cells + 150 W-hr non-rechargeable battery • Effects on application code: • Many actions sequential, not simultaneous • No dynamic scheduling, no autonomy • Not even CPU-clock management • Nowhere near enough CPU performance • Designed to limit worst case power • Dump excess power into heaters
Athena/Mars ’03 Rovers Rover Configuration Pancam/Mini-TES • 3 Hrs/day of solar @ 50 W • 5 amp hr 16V batteries • More complex communication • More complex on-board eqpt • Still statically scheduled Instrument Arm Cluster : Raman Spectrometer Alpha-Proton-X-Ray Spectrometer (APXS) Mössbauer Spectrometer Microscopic Imager Mini-Corer
MUSES-CN Asteroid NanoRover • To run a command: • Determine available solar power. • Minimum required power = device + CPU power • If available power < minimum required: • if parameter enables re-orienting , re-orient to maximize solar power • if still not enough and parameter enables waiting, wait up to parameter limit for solar power • if still not enough, abort command • Set CPU speed to maximum allowable based on (power available) - (minimum needed for devices) • Perform command: during command execution, if power drops significantly (or load shed indication?...): • CPU speed is reduced to minimum required • Operate motors one-at-a-time • Return CPU speed to parameter-specified idle • Still “sequential” operation • Solar powered @ 1 watt • including RF telecommunications system for communications to lander or small-body orbiter for relay to Earth. • Clock-adjustable CPU speed
Oscilloscope Logic Analyzer PowerPC 750 NT Box Ethernet Some Morph Test Beds • Different PowerPC configurations • Microarchitecture • Clock rates • ISA extensions • Run rover/PACC application code • Measure time/power • Use as input to Simplescalar simulation • PACC-Blue • 400MHz PPC 7400 • Enhanced superscalar + Altivec • Linux • PACC-Gold • 400MHz PPC 750 • Linux • JPL PPC-SBC • 200 MHz 750 • VxWorks
high-rateinput symmetric multiprocessor modules reconfigurable hardware blocks communication module (CDMA) (camera) high-speed bus (e.g. IEEE 1394) low-speed bus (e.g. I2C ) bus power controller microcontroller-directed subnet - power regulations & control - analog telemetry sensors - safety inhibits - valve & pyro drive altimeter subnet The NASA X2000 Avionics System • Design for 10-20X reduction in power, at 10-20X performance increase • With long-term survivability & technology scaling • Application-specific adaptive configuration to match run-time power supply constraints
PCI Bus analyzer Current Meter Current Meter Current Meter Micro Gyro Built-In Power Supply Built-In Power Supply Built-In Power Supply cPCI bus (6U chassis) cPCI bus (6U chassis) cPCI bus (6U chassis) PMC EPP Adapter PMC PMC PMC PPC 750 (Synergy) 1394a I/F (Saderta) 1394a I/F (Saderta) Dual I2C I/F (JPL) Empty Slot Empty Slot GPIB PPC 750 (Synergy) 1394a I/F (Saderta) 1394a I/F (Saderta) Dual I2C I/F (JPL) Empty Slot Empty Slot FPGA Rapid Prototype PPC 750 (Synergy) PPC 750 (Synergy) PPC 750 (Synergy) Empty Slot PPC 750 (Synergy) 1394a I/F (Saderta) 1394a I/F (Saderta) Dual I2C I/F (JPL) Empty Slot Empty Slot Hard Drive Hard Drive Hard Drive Terminal Server SUN Ultra 10 Workstation SUN Ultra 10 Workstation Pentium III w/1394a analyzer (Saderta) Pentium III w/1394a analyzer (Saderta) Hard Drive Hard Drive SUN E3500 Workstation (35 GB HD) PPC 750 (Synergy) PPC 750 (Synergy) 1394a I/F (Saderta) 1394a I/F (Saderta) Dual I2C I/F (JPL) Empty Slot Empty Slot Empty Slot PPC 750 (Synergy) PPC 750 (Synergy) 1394a I/F (Saderta) Dual I2C I/F (JPL) Empty Slot Empty Slot Empty Slot 1394a I/F (Saderta) Current Meter Current Meter cPCI bus (6U chassis) cPCI bus (6U chassis) Built-In Power Supply Built-In Power Supply Legends Ethernet RS232 COTS IEEE 1394 I2C SCSI IEEE 488 COTS Support Equipment JPL In-House Product Outlets for power measurement X2000 FD Testbed with Power Awareness
Near Term Activities • Extract Rover application code • Run on SBC & Apples for baseline data • Continue microarchitectural design and simulation • Continue activities not mentioned here • Instruction annotation for energy-aware data access • Benchmark analysis for data placement • ISA extensions