230 likes | 429 Views
Asynchronous Architectures for Energy Efficient Computing & Communication (AEC2). Alain J. Martin Asynchronous VLSI Group Department of Computer Science California Institute of Technology 12 Jun 2002. Program Concepts and Goals. Concepts Asynchronous approach to energy efficiency
E N D
Asynchronous Architectures for Energy Efficient Computing & Communication(AEC2) Alain J. Martin Asynchronous VLSI GroupDepartment of Computer ScienceCalifornia Institute of Technology12 Jun 2002
Program Concepts and Goals • Concepts • Asynchronous approach to energy efficiency • High level synthesis • Goals • Design and fabrication of the world’s most energy efficient microprocessor/microcontroller • Methods, tools, and circuits • Energy complexity of computation
Microprocessor -- Results MIPS Energy 33nJ async-0.6m 70nJ sync-0.6m MIPS CycleTime 6ns async-0.6m 21ns sync-0.6m Microcontroller -- Estimation 10.00nJ (1X) sync-0.5m 8051 Energyper Instr 1.67nJ (6X) async-0.5m icache fetch 0.56nJ (18X) async-0.18m@1.8V 0.14nJ (72X) async-0.18m@0.9V exec units (adder) (shifter) (fblock) (mem) (mult/div) 20ns (1X) sync-0.5m 8051 CycleTime 10ns (2X) async-0.5m 5ns (4X) async-0.18m@1.8V decode write back 10ns (2X) async-0.18m@0.9V regfile (bypass) Asynchronous Architectures for Energy Efficient Computing & CommmunicationCaltech Energy Breakdown More than 100X Et2 improvement over any other 8051
Energy Complexity Theory • Optimization metric: Et2 • Et2-optimal pipeline is shorter (MiniMIPS was overpipelined) • Transistor sizing is not minimal: C 2P • Optimal Energy: E 3E0 • Optimal Delay: tt • Sequential Computation of A & B optimal whenPower(A) = Power(B) • Most energy is in communication (only 10% in computation)
Consequences for Asynchronous Design Methodology • Different transistor sizing • Less communication (Ex: LAX protocol) • Less pipelining • Different buffers (tree buffers) • Simpler ALU • Different cache design (memory cell bank size) • Shorter busses (Huffman-tree encoding of busses based on instruction group frequency)
Sequential CHP Concurrent CHP • HSE: Handshaking Expansion- Everything in boolean notation • 4 phase handshakes (set Data, wait for Ack, reset Data, wait for reset Ack) • Reshuffle the non data-dependent portions of 4 phase communication to improve speed & size HSE PRS PRS for CMOS PRS: Production Rule Set- No explicit sequencing: concurrent set of rules- Each rule abstraction for PUP & PDN networks Sized PRS Physical Design Design Flow – Stages CHP: Communicating Hardware Processes- High-level language (selections, loops, etc.)- Decompose a large sequential CHP process into a system of smaller, concurrent, communicating CHP processes
Physical Layout High-LevelSimulator m3-3 DDD klay ROMantic edgar PRS TransistorNetlist PL2 Energy Throughput Et2 Low-energy systemthat is slack-matched Concurrent system of small processes EnergySimulator esim New Design Tools SequentialProgram
m3-3 • Programming language, built on Modula-3 • Hence includes compiler, runtime, and debugging • Very expressive: any Modula-3 subroutine allowed • Allows simulation and performance analysis of an asynchronous system • Does not require the system to be already expressed in CMOS circuits
m3-3 Performance Analysis • Energy analysis • Channel usage statistics • Measures total energy in number of bits sent • Delay analysis • Forward-Backward-Internal (FBI) model • Allows identification of token-limited, bubble-limited, and throughput-limited critical paths • Each communication is marked with a timestamp, and a “reason”, which is some subset of {F,B,I} • Measures total latency in logic transitions
Accomplishments and Milestones 1 • Et2 theory : doneSee the book! • Circuit family: done • Redesign of the MIPS :``fetch loop’’ done, design postponed • Asynchronous pulse logic and SPAM processor: theory done, prototype postponed
Accomplishments and Milestones 2: Tools • m3-3 high-level simulator: done • esim energy simulator: done • Automatic design decomposition: in progress • PL2 circuit synthesizer: in progress • klay layout synthesizer: in progress
Asynchronous 8051 – the Lutonium The 8051 is the most common microcontroller today • Overview • Microcontroller Architecture • Design Style • Advantages • Performance Estimates • Relation to Tools • Project Status & Future Work
8051 ISA • Direct address space, 256 bytes • 128 general-purpose registers (RegFile) • Direct or indirect addressing (0..127) • Up to 128 special registers (SFRs) • Direct addressing only (128..255) • A,B,PSW,SP,DPL,DPH,IE,IP • Ports (external I/O and timers) • Separate program space, up to 64K, read-only • Separate external address space, 64K
Complex Instructions • Read-modify-write • Rn registers • Must read the PSW to compute their actual address • Indirect addressing (@Ri) • Some instructions use 16-bit data • CALL; RET; INC DPTR; MOVX A,@DPTR • The average execution time will be very different from the maximum execution time • Asynchronous performance might far exceed synchronous performance
Example: Fetch/IMem Design • Instructions have variable length (1-3 bytes) • Always fetches 2 bytes from memory • Handles MOVC instructions for code reads and code writes • Only reads interrupt registers when there is the possibility of an interrupt
8051-specific Lutonium Advantages • Voltage adaptation is easy • Sleep sequence without race condition • Modeled after wait/signal with condition variables • Instant wake-up from deep sleep • Pipelined but not speculative • Enhanced off-chip interface: no static power
Lutonium Performance • Lutonium-50 (0.5 micron): • Est. 100 MIPS, 600 MIPS/W (@3.3V) • Philips Sync.: 4.0 MIPS, 100 MIPS/W • Philips Async.: 4.0 MIPS, 444 MIPS/W • Dallas DS89C420 “ultra high speed”: 50 MIPS, 100 MIPS/W (0.5 micron) • Lutonium-18 (0.18 micron): • Est. 200 MIPS, 1800 MIPS/W (@1.8V) • Est. 66 MIPS, 7200 MIPS/W (@0.9V)
Lutonium-18 Prototype • TSMC SCN018 through MOSIS • 0.18mm CMOS • 1.8V nominal • |Vt| = 0.4V to 0.5V • Expected area: 5mm2 (including 8kB SRAM) • Performance from low-level simulation (conservative!) High Vt process (0.5V) We could do better with a low Vt process
Lutonium – Project Status • Entirely designed at component level • 23K lines of m3-3 • Timing simulation • Energy simulation • “Fetch-loop” designed at the transistor level
Lutonium – Future Work • Production-rule generation for execution units, register file and busses • Power-saving mechanisms (supply-voltage adaptation, threshold-voltage control) • Layout