310 likes | 440 Views
Clockless Logic or How do I make hardware fast, power-efficient, less noisy, and easy-to-design?. Montek Singh Tue, Jan 14, 2003. Course Information (1). Course Number: COMP290-084 Time and Place Tue/Thu 3:30-4:45pm, Sitterson Hall 325 Instructor Montek Singh
E N D
Clockless LogicorHow do I make hardware fast, power-efficient, less noisy, and easy-to-design? Montek Singh Tue, Jan 14, 2003
Course Information (1) Course Number: COMP290-084 Time and Place • Tue/Thu 3:30-4:45pm, Sitterson Hall 325 Instructor • Montek Singh • montek@cs.unc.edu(not singh@cs!) • SN 245, 962-1832 • Office hours: most afternoons/by appointment Teaching Assistant • None Course Web Page • http://www.cs.unc.edu/~montek
Course Information (2) Prerequisites: • undergraduate knowledge of: digital logic, algorithms, discrete math (sets and graphs) • no knowledge of advanced circuit design or of VLSI is assumed • relevant topics will be covered in class as needed • you are assumed to know the following topics: • digital logic: Boolean algebra, logic gates, and latches and registers • algorithms: search techniques, enumeration, divide and conquer, and time complexity • discrete math: elementary set theory and graph theory
Course Information (3) Reading Material: • Papers and technical reports supplied by instructor Course Content: • The following topics will be covered: • Introduction to clockless logic • Graphical representation of asynchronous systems • Algorithms for logic synthesis • Combinational • Sequential • Design techniques • High-performance • Low-power • Formal methods (performance analysis and verification) • Case studies of real-world asynchronous processors
Course Information (4) Grading • 30% homework assignments • 35% class project • your choice of topic: from pure algorithms to VLSI design • 30% exams • 5% class participation Honor Code is in effect • encouraged to discuss ideas/concepts • work handed in must be your own
Lecture 1: Introduction What is asynchronous design? Why do we want to study it? How is data represented in an asynchronous system? How is information exchanged?
Introduction: Clocked Digital Design clock Most current digital systems are synchronous: • Clock:a global signal that paces operation of all components Benefit of clocking: enables discrete-time representation • all components operate exactly once per clock tick • component outputs need to be ready by next clock tick • allows “glitchy” or incorrect outputs between clock ticks
Microelectronics Trends Current and Future Trends: Significant Challenges • Large-Scale “Systems-on-a-Chip” (SoC) • 100 Million ~ 1 Billion transistors/chip • Very High Speeds • multiple GigaHertz clock rates • Explosive Growth in Consumer Electronics • demand for ever-increasing functionality … • … with very low power consumption (limited battery life) • Higher Portability/Modularity/Reusability • “plug ’n play” components, robust interfaces
Challenges to Clocked Design Breakdown of Single-Clock Paradigm: • Chip will be partitioned intomultiple timing domains • challenge: gluing together multiple timing domains • glue logic is susceptible to “metastability” (=incorrect values transferred) and latency overheads Increasing Difficulties with Clocked Design: • Clock distribution: requires significant designer effort • Performance bottleneck: a single slow component • Clock burns large fraction of chip power (~40-70%) • Fixed clock rate: poor match for • designing reusable components • interfacing with mixed-timing environments
What is Asynchronous Design? handshaking interface clock Synchronous System (Centralized Control) Asynchronous System (Distributed Control) • Digital design with no centralized clock • Synchronization using local “handshaking”
Why Asynchronous Design? (1) • Higher Performance • May obtain “average-case” operation (not “worst-case”) • not limited by slowest component • Avoids overheads of multi-GHz clock distribution • Lower Power • No clock power expended • Inactive components consume negligible power • Better Electromagnetic Compatibility • Smooth radiation spectra: no clock spikes • Much less interference with sensitive receivers [e.g., Philips pagers, smartcards] • Greater Flexibility/Modularity • Naturally adapt to variable-speed environments • Supports reusable components
Why Asynchronous Design? (2) • The world already is mostly asynchronous! • Events at the level of (or in between) large-scale systems are asynchronous • several seconds to several milliseconds • e.g., PC-printer communication, keyboard inputs, network comm. • Events at the board level (or between chips) are often asynchronous • milliseconds to 100 nanoseconds • e.g., CPU-memory interface, interface with I/O subsystem (interrupts) • Events within a chip, at the level of functional units (e.g., adders, control logic) are currently synchronous • several nanoseconds to 100 picoseconds • Events at the level of a single logic gate are asynchronous • 10 picoseconds • Events at the quantum level are asynchronous • picoseconds to femtoseconds • So, why bother with clocks at all?! • make everything asynchronous greater elegance and robustness
Challenges of Asynchronous Design communication must be hazard-free! special design challenge =“hazard-free synthesis” Testability Issues: absence of clock means no “single-stepping” Lack of Commercial CAD Tools: chicken-and-egg problem clock tick no problemfor clockedsystems clean signals hazardous signals • Hazards: potential “glitches” on wire
Asynchronous Design: Past & Present Async Design: In existence for 50 years, but … … many recent technical advances: • Hazard-Free Circuit Design: • several practical techniques for controllers [Stanford/Columbia] • Design for Testability: • several test solutions, e.g. Philips Research • Maturing Computer-Aided-Design (“CAD”) Tools: • software tools for automated design [Philips,Columbia,Manchester] • Successful Fabricated Chips: • embedded processors, high-speed pipelines, consumer electronics…
Recent Commercial Interest Several commercial asynchronous chips: • Philips: asynchronous 80c51 microcontrollers • used in commercial pagers [1998] and smartcards [2001] • Univ. of Manchester: async ARM processor [2000] • Motorola: async divider in PowerPC chip [2000] • HAL: async floating-point divider • in HAL-I and II processors [early 1990’s] Recent experimental chips: • IBM, Sun and Intel: • fast pipelines, arbiters, instruction-length decoder… • IBM/Columbia/UNC: asynchronous digital FIR filter Several recent startups: • Theseus Logic, Fulcrum, Self-Timed Solutions…
A 5-minute Homework Problem Alice Bob Alice and Bob live on opposite sides of a wide river: Aliceis supposed to send a message (say, a “Yes”/”No”) across to Bob around midnight. Both have flashlights, but neither owns a watch. What should they do? Suggest several strategies, and discuss pros and cons of each.
Solution 1 got it yes/no ready Aliceuses 2 lamps: • 1 to indicate that she is ready with the message, and • 1 for the message itself Bobuses 1 lamp: • to indicate that he has received the message Alice Bob
Solution 2 got it yes no Aliceuses 2 lamps: • Green lamp to indicate “yes” • Red lamp to indicate “no” Bobuses 1 lamp: • to indicate that he has received the message Alice Bob
Solution 3 What if Alice and Bob could keep time? Aliceuses 1 lamp for the message: • At 12 midnight: turns on lamp if message = “yes” • At 12:01: turns lamp off Bobneeds no lamps! • Takes down the message between 12 and 12:01 Pros: Fewer signals, lesser processing needed Cons: Alice and Bob must keep their clocks closely synchronized • If Bob’s watch is off by a minute, incorrect communication possible
Data Representation Styles: “Bundled Data” matched delay request done bit 1 bit 1 done indicates valid data bit n bit m functionblock Single-rail “Bundled Datapath”: simplest approach • widely used Features: • datapath: 1 wire per bit (e.g. standard sync blocks) • matched delay: produces delayed “done” signal • worst-case delay: longer than slowest path • Practical style: can reuse sync components; small area • Fixed (worst-case) completion time
Data Representation Styles: Dual-Rail bit 1 bit 1 bit n bit m Dual-rail: uses 2 wires per data bit Each Dual-Rail Pair: provides both data value and validity • provides robust data-dependent completion • needs completion detectors
Dual-Rail (contd.) bit0 bitn bit1 OR OR OR Done C Dual-Rail Completion Detector: • combines dual-rail signals • indicates when all bits are valid (or reset) C-element: • if all inputs=1, output 1 • if all inputs=0, output 0 • else, maintain output value • OR together 2 rails per bit • Merge results using a Müller “C-element”
Handshaking Styles: 4-phase get ready for next event start event Request ready for next event event done Acknowledge 4-Phase: requires 4 events per handshake • “Level-sensitive” simpler logic implementation • Overhead of “return-to-zero” (RTZ or resetting) • extra events which do no useful computation
Handshaking Styles: 2-phase start next event start event Request next event done event done Acknowledge 2-Phase: requires 2 events per handshake • Elegant: no return-to-zero • Slower logic implementation: • logic primitives are inherently level-sensitive, not event-based (at least in CMOS)
Handshaking + Data Representation bit 1 bit m ack Several combinations possible: • dual-rail 4-phase, single-rail 4-phase, dual-rail 2-phase, and single-rail 2-phase Example: dual-rail 4-phase • dual-rail data: functions as animplicit “request” • 4-phase cycle: between acknowledgeand implicit request A B
Other Data Representation Styles data phase • Level-Encoded Dual-Rail (LEDR) • 2 wires per bit: “data” and “phase” • exactly one wire per bit changes value • if new value is different, “data” wire changes value • else “phase” wire change value • M-of-N Codes • N wires used for a data word • M wires (M <= N) change value • Values of N and M: have impact on… • information transmitted, power consumed and logic complexity • Knuth codes, Huffman codes, …
Which to use? Depends on several performance parameters: • speed • single-rail vs. dual-rail • single-rail may be faster (if designed aggressively) • dual-rail may be faster (if completion times vary widely) • 2-phase vs. 4-phase • 2-phase may be faster (if logic overhead is small) • 4-phase may be faster (if overhead of return-to-zero is small) • power consumption • 2-phase typically has fewer gate transitions ( lower power) • amount of logic used (#gates/wires/pins chip area) • single-rail needs fewer gates/wires/pins • design and verification effort • dual-rail, 1-of-N, M-of-N, Knuth codes…: • delay-insensitive: robust in the presence of arbitrary delays • single-rail: requires greater timing verification effort
Sutherland’s Micropipelines Seminal Paper
Focus of Sutherland’s Turing Award Lecture: Pipelining Motivation:Pipelining is at the heart of nearly all high-performance digital systems Additional Benefits: • Low power • Interfacing with mixed systems • Modular and scalable design
Background: Pipelining fetch decode execute A “coarse-grain” pipeline (e.g. simple processor) A “fine-grain” pipeline (e.g. pipelined adder) What is Pipelining?: Breaking up a complex operation on a stream of data into simpler sequential operations Storage elements(latches/registers) Throughput = #data items processed/second + Throughput: significantly increased – Latency:somewhat degraded
Focus of Async Community Our Focus: Extremely fine-grain pipelines • “gate-level” pipelining = use narrowest possible stages • each stage consists of only a single level of logic gates • some of the fastest existing digital pipelines to date Application areas: • multimedia hardware (graphics accelerators, video DSP’s, …) • naturally pipelined systems, throughput is critical • input is often “bursty” • optical networking • serializing/deserializing FIFO’s • genomic string matching? • KMP style string matching: variable skip lengths