Clockless Computing

Clockless Computing Montek Singh Thu, Sep 13, 2007

Dynamic Logic Pipelines (contd.) Drawbacks of Williams’ PS0 Pipelines Lookahead Pipelines [Singh/Nowick 2000] High-Capacity Pipelines [Singh/Nowick 2000]

Drawbacks of PSO Pipelining • Poor throughput: • long cycle time: 6 events per cycle • data “tokens” are forced far apart in time • Limited storage capacity: • max only 50% of stages can hold distinct tokens • data tokens must be separated by at least one spacer My Research Goals have been: address both issues • still maintain very low latency

Recent Approaches 3 novel styles for high-speed async pipelining: • MOUSETRAP Pipelines [Singh/Nowick, TAU-00, ICCD-01] • “Lookahead Pipelines” (LP) [Singh/Nowick, Async-00] • “High-Capacity Pipelines” (HC) [Singh/Nowick, WVLSI-00] Goal:significantly improve throughput of PS0 Two Distinct Strategies: • LP: introduce protocol optimizations • “shave off” components from critical cycle • HC: fundamentally new protocol • greater concurrency: “loosely-coupled” stages  

Outline Dynamic circuit style Static circuit style • New Asynchronous Pipelines: • MOUSETRAP Pipelines • Lookahead Pipelines (LP) • High-Capacity Pipelines (HC)

Lookahead Pipeline Styles Singh and Nowick Async-2000 [Best Paper Award]

Lookahead Pipelines: Strategy #1 Use non-neighbor communication: • stage receives information from multiple later stages • allows “early evaluation” Benefit: stage gets head-start on next cycle

Lookahead Pipelines: Strategy #2 Use early completion detection: • completion detector moved before stage (not after) • stage indicates“early done”in parallel with computation early completion detector Benefit: again, stage gets head-start on next cycle

Lookahead Pipelines: Overview 5 New Designs: • “Dual-Rail” Data Signaling: • LP3/1:“early evaluation” • LP2/2:“early done” • LP2/1:“early evaluation” + “early done” • “Single-Rail” Bundled-Data Signaling: • LPSR2/2:“early done” • LPSR2/1:“early evaluation” + “early done”

Dual-Rail Design #1: LP3/1 Optimization = “early evaluation” • each stage has two control inputs: from stages N+1 and N+2 Idea: shorten precharge phase • terminate precharge early: when N+2 is done evaluating PC Eval Data in Data out N N+1 N+2 Completion Detector ProcessingBlock From N+2

LP3/1 Protocol New! 4 3 N+1 indicates “done” 3 1 2 Enables “early evaluation!” • PRECHARGEN:when N+1 completes evaluation • EVALUATEN:whenN+2completes evaluation N+2 indicates “done” N N+1 N+2 N+2 evaluates N evaluates N+1 evaluates

LP3/1: Comparison with PS0 indicates “done” Enables “early evaluation!” 4 3 evaluates evaluates evaluates EVALUATE N: when N+2 completes evaluation PRECHARGE N: when N+1completes evaluation indicates “done” EVALUATE N: when N+1 completes precharging 5 4 6 3 1 2 3 3 1 2 evaluates evaluates evaluates N N+1 N+2 LP3/1 Only 4 events in cycle! N N+1 N+2 PS0 6 events in cycle

LP3/1 Performance 4 3 1 2 Cycle Time = saved path Savings over PS0:1 Precharge + 1 Completion Detection

LP3/1: Inside a Stage Timing Issues: must satisfy several simple constraints Ex.:PC must arrive before Eval de-asserted 1-sided timing requirement easily satisfied in practice Merging 2 Control Inputs: PC (From Stage N+1) Eval (From Stage N+2) NAND “old Eval” “early Eval”

Dual-Rail Design #2: LP2/2 Optimization = “early done” • Idea: move completion detector beforeprocessing block • stage indicates when“about to”precharge/evaluate “early” Completion Detector “early done” Data in Data out Processing Block

LP2/2 Completion Detector PC bit0 bitn bit1 OR OR OR Done C + + + Modified completion detectors needed: • Done=1 when stage starts evaluating, and inputs valid • Done=0 when stage starts precharging • asymmetric C-element

LP2/2 Protocol “early done” of N+1 eval 2 3 “early done” of N+1 prech 4 “early done” of N+2 eval 1 2 3 Completion Detection: performedin parallel with evaluation/precharge of stage N N+1 N+2 N evaluates N+1 evaluates

LP2/2 Performance 3 Cycle Time = 4 1 2 LP2/2 savings over PS0: 1 Evaluation + 1 Precharge

Dual-Rail Design #3: LP2/1 Cycle Time = Hybrid of LP3/1 and LP2/2. Combines: • early evaluation of LP3/1 • early done of LP2/2

Lookahead Pipelines: Overview 5 New Designs: • “Dual-Rail” Data Signaling: • LP3/1:“early evaluation” • LP2/2:“early done” • LP2/1:“early evaluation” + “early done” • “Single-Rail” Bundled-Data Signaling: • LPSR2/2:“early done” • LPSR2/1:“early evaluation” + “early done”

Single-Rail Design: LPSR2/1 delay delay delay • “Ack” to previous stages is “tapped off early” • once in evaluate (precharge), dynamic logic insensitive to input changes Derivative of LP2/1, adapted to single-rail: • bundled-data: matched delays instead of completion detectors

Inside an LPSR2/1 Stage PC (From Stage N+1) Eval (From Stage N+2) matcheddelay “ack” NAND “req” out done “req” in data out data in aC + • “done” generated by an asymmetric C-element • done=1 when stage evaluates, and data inputs valid • done=0 when stage precharges PC and Eval are combined exactly as in LP3/1

LPSR2/1 Protocol N+1 indicates “done” N+2 indicates “done” 2 3 2 1 N+1 evaluates N+2 evaluates Cycle Time = N N+1 N+2 N evaluates

FIFO Results (simulations) LP dual-rail: over 80% faster than Williams’ PS0 • comparable latency LP single-rail: even faster 0.19 CMOS 3.3 V, 300°K dual-rail single-rail

Practicality of Gate-Level Pipelining fan-out=2 done comp. det. fan-in = 2 datapath width = 32 dual-rail bits! When datapath is wide: • Can often split into narrow “streams” • Use “localized” completion detector for each stream: • need to examine only a few bits  small fan-in • send “done” to only a few gates  small fan-out • comp. det. fairly low cost!

High-Capacity Pipelines Singh/Nowick WVLSI-00, ISSCC-02, Async-02

HC Pipeline Style High-Capacity Pipelines (HC) • bundled datapaths; dynamic logic function blocks • latch-free: no explicit latches needed • dynamic logic provides implicit latching • novel highly-concurrent protocol maximizes storage capacity • traditional latch-free approaches: “spacers” limit capacity to 50% Key Idea: Obtain greater control of stage’s operation • separate control of pull-up/pull-down • result = new “isolate phase” • stage holds outputs/impervious to input changes Advantage: Each stage can hold a distinct data item • 100% storage capacity Extra Benefit: Obtain greater concurrency  High throughput

HC: Basic Structure Key Idea: 2 independent control signals: pc: controls precharge eval: controls evaluation Allows novel 3-phase cycle: Evaluate “Isolate” (hold) Precharge Single-rail “Bundled Datapath”: • matched delay: produces delayed “done” signal • worst-case delay: longer than slowest path for data stage controller pc eval ack delay delay delay N N+1 N+2

HC: Inside a Stage controls evaluation controls precharge eval pc “keeper” inputs outputs Independent Controls of pull-up and pull-down: • allows new 3rd phase: “isolate” • pc asserted: precharge • eval asserted: evaluate • pc and eval de-asserted: enter “isolate” (hold) phase

HC: Protocol Most Existing Protocols: 3 synchronization arcs 1 forward arc: data dependency 2 backward arcs: control synchronization Our protocol: only 2synchronization arcs only 1 backward arc once stage N+1 evaluates, N can complete entire next cycle! Eval pc=1eval=1 Isolate pc=1eval=0 Precharge pc=0eval=0 Stage N Stage N+1 Eval X Isolate Precharge

Formal Specification of Controller (Start evaluate) pc+ eval+ (Evaluate of N+1 complete) T+ (Evaluate complete) S+ eval- (Isolate) (Start precharge) pc- (Precharge of N+1 complete) T- (Precharge complete) S- Problem: Specification too concurrent for direct synthesis • desired precharge condition: N and N+1 have evaluated same data • problem: this condition not uniquely captured by given signals! • N may evaluate next data item,while N+1 stuck on current item!

Modified Specification of Controller pc+ eval+ (Evaluate of N+1 complete) T+ S+ eval- T- (Precharge of N+1 complete) pc- ok2pc+ S- ok2pc- Solution: Add a state variable ok2pc ok2pc records whether N+1 has “absorbed” N’s data item • ok2pc resets immediately when N deletes item (N precharges) • ok2pc is set when N+1 deletes item (N+1 precharges)

Controller implementation S pc T NAND3 S aC + ok2pc T eval S INV Controller implementation is very simple: • each signal implemented using a single gate • ok2pc typically off the critical path

HC: Stage Implementation state variable: off the critical path + from current stage self-loop: key to fast “isolation” from next stage early ack NAND INV eval pc ack req done delay

HC: Operation (fast self-loop) (fast self-loop) N isolates 1 3 (early Ack) 2 N enables itself for next evaluation N N+1 N evaluates N precharges N+1 starts to evaluate Cycle Time = 8 CMOS gate delays

Performance 2 2 3 N isolates 1 Cycle Time = N N+1 N+2 N enables itself for next evaluation N precharges N evaluates N+1 evaluates

FIFO Results (simulations) LP dual-rail: over 80% faster than Williams’ PS0 • comparable latency LP single-rail: even faster 0.19 CMOS 3.3 V, 300°K dual-rail single-rail

Fabricated Chip: HC FIFO • 2.5 GHz in 0.18u

Clockless Computing