1 / 23

Asynchronous Architectures for Energy Efficient Computing & Communication (AEC2)

Asynchronous Architectures for Energy Efficient Computing & Communication (AEC2). Alain J. Martin Asynchronous VLSI Group Department of Computer Science California Institute of Technology 12 Jun 2002. Program Concepts and Goals. Concepts Asynchronous approach to energy efficiency

betty
Download Presentation

Asynchronous Architectures for Energy Efficient Computing & Communication (AEC2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Asynchronous Architectures for Energy Efficient Computing & Communication(AEC2) Alain J. Martin Asynchronous VLSI GroupDepartment of Computer ScienceCalifornia Institute of Technology12 Jun 2002

  2. Program Concepts and Goals • Concepts • Asynchronous approach to energy efficiency • High level synthesis • Goals • Design and fabrication of the world’s most energy efficient microprocessor/microcontroller • Methods, tools, and circuits • Energy complexity of computation

  3. Microprocessor -- Results MIPS Energy 33nJ async-0.6m 70nJ sync-0.6m MIPS CycleTime 6ns async-0.6m 21ns sync-0.6m Microcontroller -- Estimation 10.00nJ (1X) sync-0.5m 8051 Energyper Instr 1.67nJ (6X) async-0.5m icache fetch 0.56nJ (18X) async-0.18m@1.8V 0.14nJ (72X) async-0.18m@0.9V exec units (adder) (shifter) (fblock) (mem) (mult/div) 20ns (1X) sync-0.5m 8051 CycleTime 10ns (2X) async-0.5m 5ns (4X) async-0.18m@1.8V decode write back 10ns (2X) async-0.18m@0.9V regfile (bypass) Asynchronous Architectures for Energy Efficient Computing & CommmunicationCaltech Energy Breakdown More than 100X Et2 improvement over any other 8051

  4. Energy Complexity Theory • Optimization metric: Et2 • Et2-optimal pipeline is shorter (MiniMIPS was overpipelined) • Transistor sizing is not minimal: C  2P • Optimal Energy: E  3E0 • Optimal Delay: tt • Sequential Computation of A & B optimal whenPower(A) = Power(B) • Most energy is in communication (only 10% in computation)

  5. Consequences for Asynchronous Design Methodology • Different transistor sizing • Less communication (Ex: LAX protocol) • Less pipelining • Different buffers (tree buffers) • Simpler ALU • Different cache design (memory cell bank size) • Shorter busses (Huffman-tree encoding of busses based on instruction group frequency)

  6. Sequential CHP Concurrent CHP • HSE: Handshaking Expansion- Everything in boolean notation • 4 phase handshakes (set Data, wait for Ack, reset Data, wait for reset Ack) • Reshuffle the non data-dependent portions of 4 phase communication to improve speed & size HSE PRS PRS for CMOS PRS: Production Rule Set- No explicit sequencing: concurrent set of rules- Each rule abstraction for PUP & PDN networks Sized PRS Physical Design Design Flow – Stages CHP: Communicating Hardware Processes- High-level language (selections, loops, etc.)- Decompose a large sequential CHP process into a system of smaller, concurrent, communicating CHP processes

  7. Physical Layout High-LevelSimulator m3-3 DDD klay ROMantic edgar PRS TransistorNetlist PL2 Energy Throughput Et2 Low-energy systemthat is slack-matched Concurrent system of small processes EnergySimulator esim New Design Tools SequentialProgram

  8. m3-3 • Programming language, built on Modula-3 • Hence includes compiler, runtime, and debugging • Very expressive: any Modula-3 subroutine allowed • Allows simulation and performance analysis of an asynchronous system • Does not require the system to be already expressed in CMOS circuits

  9. m3-3 Performance Analysis • Energy analysis • Channel usage statistics • Measures total energy in number of bits sent • Delay analysis • Forward-Backward-Internal (FBI) model • Allows identification of token-limited, bubble-limited, and throughput-limited critical paths • Each communication is marked with a timestamp, and a “reason”, which is some subset of {F,B,I} • Measures total latency in logic transitions

  10. Accomplishments and Milestones 1 • Et2 theory : doneSee the book! • Circuit family: done • Redesign of the MIPS :``fetch loop’’ done, design postponed • Asynchronous pulse logic and SPAM processor: theory done, prototype postponed

  11. Accomplishments and Milestones 2: Tools • m3-3 high-level simulator: done • esim energy simulator: done • Automatic design decomposition: in progress • PL2 circuit synthesizer: in progress • klay layout synthesizer: in progress

  12. Asynchronous 8051 – the Lutonium The 8051 is the most common microcontroller today • Overview • Microcontroller Architecture • Design Style • Advantages • Performance Estimates • Relation to Tools • Project Status & Future Work

  13. 8051 ISA • Direct address space, 256 bytes • 128 general-purpose registers (RegFile) • Direct or indirect addressing (0..127) • Up to 128 special registers (SFRs) • Direct addressing only (128..255) • A,B,PSW,SP,DPL,DPH,IE,IP • Ports (external I/O and timers) • Separate program space, up to 64K, read-only • Separate external address space, 64K

  14. Complex Instructions • Read-modify-write • Rn registers • Must read the PSW to compute their actual address • Indirect addressing (@Ri) • Some instructions use 16-bit data • CALL; RET; INC DPTR; MOVX A,@DPTR • The average execution time will be very different from the maximum execution time • Asynchronous performance might far exceed synchronous performance

  15. Lutonium Design

  16. Example: Fetch/IMem Design • Instructions have variable length (1-3 bytes) • Always fetches 2 bytes from memory • Handles MOVC instructions for code reads and code writes • Only reads interrupt registers when there is the possibility of an interrupt

  17. Fetch/Imem: Decomposition

  18. Fetch/Imem: Ready for Layout

  19. 8051-specific Lutonium Advantages • Voltage adaptation is easy • Sleep sequence without race condition • Modeled after wait/signal with condition variables • Instant wake-up from deep sleep • Pipelined but not speculative • Enhanced off-chip interface: no static power

  20. Lutonium Performance • Lutonium-50 (0.5 micron): • Est. 100 MIPS, 600 MIPS/W (@3.3V) • Philips Sync.: 4.0 MIPS, 100 MIPS/W • Philips Async.: 4.0 MIPS, 444 MIPS/W • Dallas DS89C420 “ultra high speed”: 50 MIPS, 100 MIPS/W (0.5 micron) • Lutonium-18 (0.18 micron): • Est. 200 MIPS, 1800 MIPS/W (@1.8V) • Est. 66 MIPS, 7200 MIPS/W (@0.9V)

  21. Lutonium-18 Prototype • TSMC SCN018 through MOSIS • 0.18mm CMOS • 1.8V nominal • |Vt| = 0.4V to 0.5V • Expected area: 5mm2 (including 8kB SRAM) • Performance from low-level simulation (conservative!) High Vt process (0.5V) We could do better with a low Vt process

  22. Lutonium – Project Status • Entirely designed at component level • 23K lines of m3-3 • Timing simulation • Energy simulation • “Fetch-loop” designed at the transistor level

  23. Lutonium – Future Work • Production-rule generation for execution units, register file and busses • Power-saving mechanisms (supply-voltage adaptation, threshold-voltage control) • Layout

More Related