Clockless Logic or How do I make hardware fast, power-efficient, less noisy, and easy-to-design?

Clockless LogicorHow do I make hardware fast, power-efficient, less noisy, and easy-to-design? Montek Singh Tue, Jan 14, 2003

Course Information (1) Course Number: COMP290-084 Time and Place • Tue/Thu 3:30-4:45pm, Sitterson Hall 325 Instructor • Montek Singh • montek@cs.unc.edu(not singh@cs!) • SN 245, 962-1832 • Office hours: most afternoons/by appointment Teaching Assistant • None Course Web Page • http://www.cs.unc.edu/~montek

Course Information (2) Prerequisites: • undergraduate knowledge of: digital logic, algorithms, discrete math (sets and graphs) • no knowledge of advanced circuit design or of VLSI is assumed • relevant topics will be covered in class as needed • you are assumed to know the following topics: • digital logic: Boolean algebra, logic gates, and latches and registers • algorithms: search techniques, enumeration, divide and conquer, and time complexity • discrete math: elementary set theory and graph theory

Course Information (3) Reading Material: • Papers and technical reports supplied by instructor Course Content: • The following topics will be covered: • Introduction to clockless logic • Graphical representation of asynchronous systems • Algorithms for logic synthesis • Combinational • Sequential • Design techniques • High-performance • Low-power • Formal methods (performance analysis and verification) • Case studies of real-world asynchronous processors

Course Information (4) Grading • 30% homework assignments • 35% class project • your choice of topic: from pure algorithms to VLSI design • 30% exams • 5% class participation Honor Code is in effect • encouraged to discuss ideas/concepts • work handed in must be your own

Lecture 1: Introduction What is asynchronous design? Why do we want to study it? How is data represented in an asynchronous system? How is information exchanged?

Introduction: Clocked Digital Design clock Most current digital systems are synchronous: • Clock:a global signal that paces operation of all components Benefit of clocking: enables discrete-time representation • all components operate exactly once per clock tick • component outputs need to be ready by next clock tick • allows “glitchy” or incorrect outputs between clock ticks

Microelectronics Trends Current and Future Trends: Significant Challenges • Large-Scale “Systems-on-a-Chip” (SoC) • 100 Million ~ 1 Billion transistors/chip • Very High Speeds • multiple GigaHertz clock rates • Explosive Growth in Consumer Electronics • demand for ever-increasing functionality … • … with very low power consumption (limited battery life) • Higher Portability/Modularity/Reusability • “plug ’n play” components, robust interfaces

Challenges to Clocked Design Breakdown of Single-Clock Paradigm: • Chip will be partitioned intomultiple timing domains • challenge: gluing together multiple timing domains • glue logic is susceptible to “metastability” (=incorrect values transferred) and latency overheads Increasing Difficulties with Clocked Design: • Clock distribution: requires significant designer effort • Performance bottleneck: a single slow component • Clock burns large fraction of chip power (~40-70%) • Fixed clock rate: poor match for • designing reusable components • interfacing with mixed-timing environments

What is Asynchronous Design? handshaking interface clock Synchronous System (Centralized Control) Asynchronous System (Distributed Control) • Digital design with no centralized clock • Synchronization using local “handshaking”

Why Asynchronous Design? (1) • Higher Performance • May obtain “average-case” operation (not “worst-case”) • not limited by slowest component • Avoids overheads of multi-GHz clock distribution • Lower Power • No clock power expended • Inactive components consume negligible power • Better Electromagnetic Compatibility • Smooth radiation spectra: no clock spikes • Much less interference with sensitive receivers [e.g., Philips pagers, smartcards] • Greater Flexibility/Modularity • Naturally adapt to variable-speed environments • Supports reusable components

Why Asynchronous Design? (2) • The world already is mostly asynchronous! • Events at the level of (or in between) large-scale systems are asynchronous • several seconds to several milliseconds • e.g., PC-printer communication, keyboard inputs, network comm. • Events at the board level (or between chips) are often asynchronous • milliseconds to 100 nanoseconds • e.g., CPU-memory interface, interface with I/O subsystem (interrupts) • Events within a chip, at the level of functional units (e.g., adders, control logic) are currently synchronous • several nanoseconds to 100 picoseconds • Events at the level of a single logic gate are asynchronous • 10 picoseconds • Events at the quantum level are asynchronous • picoseconds to femtoseconds • So, why bother with clocks at all?! • make everything asynchronous  greater elegance and robustness

Challenges of Asynchronous Design communication must be hazard-free! special design challenge =“hazard-free synthesis” Testability Issues: absence of clock means no “single-stepping” Lack of Commercial CAD Tools: chicken-and-egg problem clock tick no problemfor clockedsystems clean signals hazardous signals • Hazards: potential “glitches” on wire

Asynchronous Design: Past & Present Async Design: In existence for 50 years, but … … many recent technical advances: • Hazard-Free Circuit Design: • several practical techniques for controllers [Stanford/Columbia] • Design for Testability: • several test solutions, e.g. Philips Research • Maturing Computer-Aided-Design (“CAD”) Tools: • software tools for automated design [Philips,Columbia,Manchester] • Successful Fabricated Chips: • embedded processors, high-speed pipelines, consumer electronics…

Recent Commercial Interest Several commercial asynchronous chips: • Philips: asynchronous 80c51 microcontrollers • used in commercial pagers [1998] and smartcards [2001] • Univ. of Manchester: async ARM processor [2000] • Motorola: async divider in PowerPC chip [2000] • HAL: async floating-point divider • in HAL-I and II processors [early 1990’s] Recent experimental chips: • IBM, Sun and Intel: • fast pipelines, arbiters, instruction-length decoder… • IBM/Columbia/UNC: asynchronous digital FIR filter Several recent startups: • Theseus Logic, Fulcrum, Self-Timed Solutions…

A 5-minute Homework Problem Alice Bob Alice and Bob live on opposite sides of a wide river: Aliceis supposed to send a message (say, a “Yes”/”No”) across to Bob around midnight. Both have flashlights, but neither owns a watch. What should they do? Suggest several strategies, and discuss pros and cons of each.

Solution 1 got it yes/no ready Aliceuses 2 lamps: • 1 to indicate that she is ready with the message, and • 1 for the message itself Bobuses 1 lamp: • to indicate that he has received the message Alice Bob

Solution 2 got it yes no Aliceuses 2 lamps: • Green lamp to indicate “yes” • Red lamp to indicate “no” Bobuses 1 lamp: • to indicate that he has received the message Alice Bob

Solution 3 What if Alice and Bob could keep time? Aliceuses 1 lamp for the message: • At 12 midnight: turns on lamp if message = “yes” • At 12:01: turns lamp off Bobneeds no lamps! • Takes down the message between 12 and 12:01 Pros: Fewer signals, lesser processing needed Cons: Alice and Bob must keep their clocks closely synchronized • If Bob’s watch is off by a minute, incorrect communication possible

Data Representation Styles: “Bundled Data” matched delay request done bit 1 bit 1 done indicates valid data bit n bit m functionblock Single-rail “Bundled Datapath”: simplest approach • widely used Features: • datapath: 1 wire per bit (e.g. standard sync blocks) • matched delay: produces delayed “done” signal • worst-case delay: longer than slowest path • Practical style: can reuse sync components; small area • Fixed (worst-case) completion time

Data Representation Styles: Dual-Rail bit 1 bit 1 bit n bit m Dual-rail: uses 2 wires per data bit Each Dual-Rail Pair: provides both data value and validity • provides robust data-dependent completion • needs completion detectors

Dual-Rail (contd.) bit0 bitn bit1 OR OR OR Done C Dual-Rail Completion Detector: • combines dual-rail signals • indicates when all bits are valid (or reset) C-element: • if all inputs=1, output  1 • if all inputs=0, output  0 • else, maintain output value • OR together 2 rails per bit • Merge results using a Müller “C-element”

Handshaking Styles: 4-phase get ready for next event start event Request ready for next event event done Acknowledge 4-Phase: requires 4 events per handshake • “Level-sensitive” simpler logic implementation • Overhead of “return-to-zero” (RTZ or resetting) • extra events which do no useful computation

Handshaking Styles: 2-phase start next event start event Request next event done event done Acknowledge 2-Phase: requires 2 events per handshake • Elegant: no return-to-zero • Slower logic implementation: • logic primitives are inherently level-sensitive, not event-based (at least in CMOS)

Handshaking + Data Representation bit 1 bit m ack Several combinations possible: • dual-rail 4-phase, single-rail 4-phase, dual-rail 2-phase, and single-rail 2-phase Example: dual-rail 4-phase • dual-rail data: functions as animplicit “request” • 4-phase cycle: between acknowledgeand implicit request A B

Other Data Representation Styles data phase • Level-Encoded Dual-Rail (LEDR) • 2 wires per bit: “data” and “phase” • exactly one wire per bit changes value • if new value is different, “data” wire changes value • else “phase” wire change value • M-of-N Codes • N wires used for a data word • M wires (M <= N) change value • Values of N and M: have impact on… • information transmitted, power consumed and logic complexity • Knuth codes, Huffman codes, …

Which to use? Depends on several performance parameters: • speed • single-rail vs. dual-rail • single-rail may be faster (if designed aggressively) • dual-rail may be faster (if completion times vary widely) • 2-phase vs. 4-phase • 2-phase may be faster (if logic overhead is small) • 4-phase may be faster (if overhead of return-to-zero is small) • power consumption • 2-phase typically has fewer gate transitions ( lower power) • amount of logic used (#gates/wires/pins  chip area) • single-rail needs fewer gates/wires/pins • design and verification effort • dual-rail, 1-of-N, M-of-N, Knuth codes…: • delay-insensitive: robust in the presence of arbitrary delays • single-rail: requires greater timing verification effort

Sutherland’s Micropipelines Seminal Paper

Focus of Sutherland’s Turing Award Lecture: Pipelining Motivation:Pipelining is at the heart of nearly all high-performance digital systems Additional Benefits: • Low power • Interfacing with mixed systems • Modular and scalable design

Background: Pipelining fetch decode execute A “coarse-grain” pipeline (e.g. simple processor) A “fine-grain” pipeline (e.g. pipelined adder) What is Pipelining?: Breaking up a complex operation on a stream of data into simpler sequential operations Storage elements(latches/registers) Throughput = #data items processed/second + Throughput: significantly increased – Latency:somewhat degraded

Focus of Async Community Our Focus: Extremely fine-grain pipelines • “gate-level” pipelining = use narrowest possible stages • each stage consists of only a single level of logic gates • some of the fastest existing digital pipelines to date Application areas: • multimedia hardware (graphics accelerators, video DSP’s, …) • naturally pipelined systems, throughput is critical • input is often “bursty” • optical networking • serializing/deserializing FIFO’s • genomic string matching? • KMP style string matching: variable skip lengths

Clockless Logic or How do I make hardware fast, power-efficient, less noisy, and easy-to-design?

Clockless Logic or How do I make hardware fast, power-efficient, less noisy, and easy-to-design?

Presentation Transcript

Programming Logic and Design Fifth Edition, Comprehensive

Coleco ADAM Hardware Design with comparison to “modern” PCs

Interface design

Multilingual Detection of Code Clones Using ANTLR Grammar Definitions

ENERGY AUDIT AT R-INFRA DAHANU THERMAL POWER STATION (250 X 2 MW UNIT)

EELE 367 – Logic Design

LOGIC DESIGN EENG 210/CS 230/Phys 319 section 02

Chapter 15 Completion of the Design of a Power Transmission

Chapter 4 Combinational Logic Design Principles ( 组合逻辑设计原理 )

Introduction to VLSI Design Custom and semi custom design

Fast

VLSI Design Chapter 5 CMOS Circuit and Logic Design

Fundamentals of Hardware Description Language

Developing Efficient Graphics Software

Various Low-Power SoC Design Techniques

CSE 205: Digital Logic Design

Characteristics of a RTS

Computer Architecture I: Digital Design Dr. Robert D. Kent

数字逻辑设计及应用