250 likes | 264 Views
This lecture provides an introduction to clockless computing and focuses on the basics of pipelining. It explains the concept of breaking complex operations into simpler sequential operations and the performance impact of pipelining. It also discusses the use of clockless computing in various application areas. The lecture presents the MOUSETRAP pipeline as an example of an ultra-high-speed transition signaling asynchronous pipeline and explains its implementation and control signaling. The lecture also introduces the concept of forks and joins and discusses their implementation and the potential problems with linear pipelining.
E N D
Clockless ComputingLecture 3 Montek Singh Thu, Aug 30, 2007
Handshaking Example:Asynchronous Pipelines Pipelining basics Fine-grain pipelining Example Approach: MOUSETRAP pipelines
Background: Pipelining fetch decode execute A “coarse-grain” pipeline (e.g. simple processor) A “fine-grain” pipeline (e.g. pipelined adder) What is Pipelining?: Breaking up a complex operation on a stream of data into simpler sequential operations Storage elements(latches/registers) Performance Impact: + Throughput: significantly increased (#data items processed/second) – Latency:somewhat degraded (#seconds from input to output)
Focus of Asynchronous Community A Key Focus: Extremely fine-grain pipelines • “gate-level” pipelining = use narrowest possible stages • each stage consists of only a single level of logic gates • some of the fastest existing digital pipelines to date Application areas: • general-purpose microprocessors • instruction pipelines: often 20-40 stages • multimedia hardware (graphics accelerators, video DSP’s, …) • naturally pipelined systems, throughput is critical; input “bursty” • optical networking • serializing/deserializing FIFO’s • string matching? • KMP style string matching: variable skip lengths
MOUSETRAP: Ultra-High-SpeedTransition-Signaling Asynchronous Pipelines Singh and Nowick, Intl. Conf. on Computer Design (ICCD), September 2001 & IEEE Trans. VLSI June 2007
MOUSETRAP Pipelines Simple asynchronous implementation style, uses… • standard logic implementation: Boolean gates, transparent latches • simple control:1 gate/pipeline stage MOUSETRAP uses a “capture protocol:” Latches … • are normally transparent: beforenew data arrives • become opaque: afterdata arrives (“capture” data) Control Signaling:transition-signaling = 2-phase • simple protocol: req/ack = only 2 events per handshake (not 4) • no “return-to-zero” • each transition (up/down) signals a distinct operation Our Goal: very fast cycle time • simple inter-stage communication
MOUSETRAP: A Basic FIFO Stages communicate usingtransition-signaling: Latch Controller 1 transition per data item! ackN-1 ackN En doneN reqN reqN+1 Data in Data out Data Latch Stage N-1 Stage N Stage N+1 2nd data item flowing through the pipeline 1st data item flowing through the pipeline 1st data item flowing through the pipeline
MOUSETRAP: A Basic FIFO (contd.) Latch is disabled when current stage is “done” Latch is re-enabled when next stage is “done” Latch controller (XNOR) acts as “protocol converter”: • 2 distinct transitions (up or down) pulsed latch enable Latch Controller 2 transitions per latch cycle ackN-1 ackN En reqN reqN+1 doneN Data in Data out Data Latch Stage N-1 Stage N Stage N+1
MOUSETRAP: FIFO Cycle Time 3 Latch Controller 2 ackN-1 ackN En reqN reqN+1 doneN 1 2 Data in Data out Data Latch Fast self-loop: N disables itself Stage N-1 Stage N Stage N+1 Cycle Time = N re-enabled to compute N+1 computes N computes
Detailed Controller Operation • One pulse per data item flowing through: • down transition:caused by“done” of N • up transition:caused by“done” of N+1 Stage N’s Latch Controller ackfrom N+1 donefrom N to Latch
MOUSETRAP: Pipeline With Logic logic logic logic Logic Blocks:can use standard single-rail (non-hazard-free) “Bundled Data” Requirement: • each“req”must arrive after data inputs valid and stable Simple Extension to FIFO: insert logic block + matching delay in each stage Latch Controller ackN-1 ackN reqN+1 reqN delay delay delay doneN Data Latch Stage N-1 Stage N Stage N+1
Complex Pipelining: Forks & Joins fork join Non-Linear Pipelining: has forks/joins Contribution: introduce efficient circuit structures • Forks: distributedata + controlto multiple destinations • Joins: mergedata + controlfrom multiple sources • Enabling technology for building complex async systems Problems with Linear Pipelining: • handles limited applications; real systems are more complex
Forks and Joins: Implementation ack1 C ack ack2 req1 C req req req2 Stage N Stage N Join:merge multiple requests Fork:merge multiple acknowledges
Performance, Timing and Optzn. Stage Latency = Cycle Time = MOUSETRAP with Logic:
Timing Analysis Latch Controller ackN-1 ackN reqN+1 reqN delay delay doneN logic logic Data Latch Stage N Stage N-1 Main Timing Constraint: avoid “data overrun” (hold time) Data must be safely “captured” by Stage N before new inputs arrive fromStage N-1 • simple 1-sided timing constraint: fast latch disable • Stage N’s “self-loop” faster than entire path thru prior stage
Experimental Results • Simulations of FIFO’s: • ~3 GHz (in 0.13u IBM process) • Recent fabricated chip: GCD • ~2 GHz simulated speed • Chips tested to be fully functional • Will show demo later
In-Class Exercise • Modify MOUSETRAP to remove the “data overrun” timing constraint • How is the performance affected?
Homework #3 (due Tue Sep 11, 2007) • Read MOUSETRAP paper [TVLSI Jun ’07] • Modify MOUSETRAP to reduce power consumption • Make the latches normally opaque • Latches become transparent only when new data arrives at their inputs • Prevents glitchy/garbage data from propagation • How is the performance (throughput, latency) affected?
MOUSETRAP Advanced Topics
Special Case: Using “Clocked Logic” pull-up network A B “keeper” “keeper” logic inputs logic inputs En En logic output logic output En En pull-down network A B A General C2MOS gate Clocked-CMOS = C2MOS: eliminate explicit latches • latch folded into logic itself C2MOS AND-gate
Gate-Level MOUSETRAP: with C2MOS En,En 2 2 2 (ack,ack’) (done,done’) (En,En’) Use C2MOS:eliminate explicit latches New Control Optimization =“Dual-Rail XNOR” • eliminate 2 inverters from critical path Latch Controller ackN-1 ackN 2 2 2 doneN 2 2 reqN reqN+1 pair of bit latches C2MOS logic Stage N-1 Stage N+1 Stage N
Timing Optzn: Reducing Cycle Time Analytical Cycle Time = Goal:shorten (in steady-state operation) Steady-state = no undue pipeline congestion Observation: • XNOR switches twice per data item: • only 2nd (up) transition criticalfor performance: Solution: reduce XNOR output swing • degrade “slew” for start of pulse • allows quick pulse completion: faster rise time Still safe when congested:pulse starts on time • pulse maintained until congestion clears
Timing Optzn (contd.) “optimized” XNOR output “unoptimized” XNOR output N’s latch disabled N’s latch re-enabled N “done” N+1 “done” latch only partly disabled; recovers quicker! (no pulse width requirement)
Comparison with Wave Pipelining Two Scenarios: • Steady State: • both MOUSETRAP and wave pipelines act like transparent “flow through” combinational pipelines • Congestion: • right environment stalls: each MOUSETRAP stage safely captures data • internal stage slow: MOUSETRAP stages to its left safely capture data congestion properly handled in MOUSETRAP Conclusion: MOUSETRAP has potential of… • speed of wave pipelining • greater robustness and flexibility
Timing Issues: Handling Wide Datapaths En En reqN doneN reqN+1 reqN doneN reqN+1 Stage N-1 Stage N Stage N-1 Stage N Buffers inserted to amplify latch signals (En): Reducing Impact of Buffers: • control uses unbuffered signals buffer delay off of critical path! • datapath skewed w.r.t. control Timing assumption: buffer delays roughly equal