1 / 38

Clockless Computing

Explore strategies to improve throughput using Lookahead Pipelines for clockless computing. Discover LP3/1, LP2/2, LP2/1 designs, and performance benefits over PS0 pipelines. Understand Dual-Rail and Single-Rail signaling concepts.

khurst
Download Presentation

Clockless Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clockless Computing Montek Singh Thu, Sep 13, 2007

  2. Dynamic Logic Pipelines (contd.) Drawbacks of Williams’ PS0 Pipelines Lookahead Pipelines [Singh/Nowick 2000] High-Capacity Pipelines [Singh/Nowick 2000]

  3. Drawbacks of PSO Pipelining • Poor throughput: • long cycle time: 6 events per cycle • data “tokens” are forced far apart in time • Limited storage capacity: • max only 50% of stages can hold distinct tokens • data tokens must be separated by at least one spacer My Research Goals have been: address both issues • still maintain very low latency

  4. Recent Approaches 3 novel styles for high-speed async pipelining: • MOUSETRAP Pipelines [Singh/Nowick, TAU-00, ICCD-01] • “Lookahead Pipelines” (LP) [Singh/Nowick, Async-00] • “High-Capacity Pipelines” (HC) [Singh/Nowick, WVLSI-00] Goal:significantly improve throughput of PS0 Two Distinct Strategies: • LP: introduce protocol optimizations • “shave off” components from critical cycle • HC: fundamentally new protocol • greater concurrency: “loosely-coupled” stages  

  5. Outline Dynamic circuit style Static circuit style • New Asynchronous Pipelines: • MOUSETRAP Pipelines • Lookahead Pipelines (LP) • High-Capacity Pipelines (HC)

  6. Lookahead Pipeline Styles Singh and Nowick Async-2000 [Best Paper Award]

  7. Lookahead Pipelines: Strategy #1 Use non-neighbor communication: • stage receives information from multiple later stages • allows “early evaluation” Benefit: stage gets head-start on next cycle

  8. Lookahead Pipelines: Strategy #2 Use early completion detection: • completion detector moved before stage (not after) • stage indicates“early done”in parallel with computation early completion detector Benefit: again, stage gets head-start on next cycle

  9. Lookahead Pipelines: Overview 5 New Designs: • “Dual-Rail” Data Signaling: • LP3/1:“early evaluation” • LP2/2:“early done” • LP2/1:“early evaluation” + “early done” • “Single-Rail” Bundled-Data Signaling: • LPSR2/2:“early done” • LPSR2/1:“early evaluation” + “early done”

  10. Dual-Rail Design #1: LP3/1 Optimization = “early evaluation” • each stage has two control inputs: from stages N+1 and N+2 Idea: shorten precharge phase • terminate precharge early: when N+2 is done evaluating PC Eval Data in Data out N N+1 N+2 Completion Detector ProcessingBlock From N+2

  11. LP3/1 Protocol New! 4 3 N+1 indicates “done” 3 1 2 Enables “early evaluation!” • PRECHARGEN:when N+1 completes evaluation • EVALUATEN:whenN+2completes evaluation N+2 indicates “done” N N+1 N+2 N+2 evaluates N evaluates N+1 evaluates

  12. LP3/1: Comparison with PS0 indicates “done” Enables “early evaluation!” 4 3 evaluates evaluates evaluates EVALUATE N: when N+2 completes evaluation PRECHARGE N: when N+1completes evaluation indicates “done” EVALUATE N: when N+1 completes precharging 5 4 6 3 1 2 3 3 1 2 evaluates evaluates evaluates N N+1 N+2 LP3/1 Only 4 events in cycle! N N+1 N+2 PS0 6 events in cycle

  13. LP3/1 Performance 4 3 1 2 Cycle Time = saved path Savings over PS0:1 Precharge + 1 Completion Detection

  14. LP3/1: Inside a Stage Timing Issues: must satisfy several simple constraints Ex.:PC must arrive before Eval de-asserted 1-sided timing requirement easily satisfied in practice Merging 2 Control Inputs: PC (From Stage N+1) Eval (From Stage N+2) NAND “old Eval” “early Eval”

  15. Dual-Rail Design #2: LP2/2 Optimization = “early done” • Idea: move completion detector beforeprocessing block • stage indicates when“about to”precharge/evaluate “early” Completion Detector “early done” Data in Data out Processing Block

  16. LP2/2 Completion Detector PC bit0 bitn bit1 OR OR OR Done C + + + Modified completion detectors needed: • Done=1 when stage starts evaluating, and inputs valid • Done=0 when stage starts precharging • asymmetric C-element

  17. LP2/2 Protocol “early done” of N+1 eval 2 3 “early done” of N+1 prech 4 “early done” of N+2 eval 1 2 3 Completion Detection: performedin parallel with evaluation/precharge of stage N N+1 N+2 N evaluates N+1 evaluates

  18. LP2/2 Performance 3 Cycle Time = 4 1 2 LP2/2 savings over PS0: 1 Evaluation + 1 Precharge

  19. Dual-Rail Design #3: LP2/1 Cycle Time = Hybrid of LP3/1 and LP2/2. Combines: • early evaluation of LP3/1 • early done of LP2/2

  20. Lookahead Pipelines: Overview 5 New Designs: • “Dual-Rail” Data Signaling: • LP3/1:“early evaluation” • LP2/2:“early done” • LP2/1:“early evaluation” + “early done” • “Single-Rail” Bundled-Data Signaling: • LPSR2/2:“early done” • LPSR2/1:“early evaluation” + “early done”

  21. Single-Rail Design: LPSR2/1 delay delay delay • “Ack” to previous stages is “tapped off early” • once in evaluate (precharge), dynamic logic insensitive to input changes Derivative of LP2/1, adapted to single-rail: • bundled-data: matched delays instead of completion detectors

  22. Inside an LPSR2/1 Stage PC (From Stage N+1) Eval (From Stage N+2) matcheddelay “ack” NAND “req” out done “req” in data out data in aC + • “done” generated by an asymmetric C-element • done=1 when stage evaluates, and data inputs valid • done=0 when stage precharges PC and Eval are combined exactly as in LP3/1

  23. LPSR2/1 Protocol N+1 indicates “done” N+2 indicates “done” 2 3 2 1 N+1 evaluates N+2 evaluates Cycle Time = N N+1 N+2 N evaluates

  24. FIFO Results (simulations) LP dual-rail: over 80% faster than Williams’ PS0 • comparable latency LP single-rail: even faster 0.19 CMOS 3.3 V, 300°K dual-rail single-rail

  25. Practicality of Gate-Level Pipelining fan-out=2 done comp. det. fan-in = 2 datapath width = 32 dual-rail bits! When datapath is wide: • Can often split into narrow “streams” • Use “localized” completion detector for each stream: • need to examine only a few bits  small fan-in • send “done” to only a few gates  small fan-out • comp. det. fairly low cost!

  26. High-Capacity Pipelines Singh/Nowick WVLSI-00, ISSCC-02, Async-02

  27. HC Pipeline Style High-Capacity Pipelines (HC) • bundled datapaths; dynamic logic function blocks • latch-free: no explicit latches needed • dynamic logic provides implicit latching • novel highly-concurrent protocol maximizes storage capacity • traditional latch-free approaches: “spacers” limit capacity to 50% Key Idea: Obtain greater control of stage’s operation • separate control of pull-up/pull-down • result = new “isolate phase” • stage holds outputs/impervious to input changes Advantage: Each stage can hold a distinct data item • 100% storage capacity Extra Benefit: Obtain greater concurrency  High throughput

  28. HC: Basic Structure Key Idea: 2 independent control signals: pc: controls precharge eval: controls evaluation Allows novel 3-phase cycle: Evaluate “Isolate” (hold) Precharge Single-rail “Bundled Datapath”: • matched delay: produces delayed “done” signal • worst-case delay: longer than slowest path for data stage controller pc eval ack delay delay delay N N+1 N+2

  29. HC: Inside a Stage controls evaluation controls precharge eval pc “keeper” inputs outputs Independent Controls of pull-up and pull-down: • allows new 3rd phase: “isolate” • pc asserted: precharge • eval asserted: evaluate • pc and eval de-asserted: enter “isolate” (hold) phase

  30. HC: Protocol Most Existing Protocols: 3 synchronization arcs 1 forward arc: data dependency 2 backward arcs: control synchronization Our protocol: only 2synchronization arcs only 1 backward arc once stage N+1 evaluates, N can complete entire next cycle! Eval pc=1eval=1 Isolate pc=1eval=0 Precharge pc=0eval=0 Stage N Stage N+1 Eval X Isolate Precharge

  31. Formal Specification of Controller (Start evaluate) pc+ eval+ (Evaluate of N+1 complete) T+ (Evaluate complete) S+ eval- (Isolate) (Start precharge) pc- (Precharge of N+1 complete) T- (Precharge complete) S- Problem: Specification too concurrent for direct synthesis • desired precharge condition: N and N+1 have evaluated same data • problem: this condition not uniquely captured by given signals! • N may evaluate next data item,while N+1 stuck on current item!

  32. Modified Specification of Controller pc+ eval+ (Evaluate of N+1 complete) T+ S+ eval- T- (Precharge of N+1 complete) pc- ok2pc+ S- ok2pc- Solution: Add a state variable ok2pc ok2pc records whether N+1 has “absorbed” N’s data item • ok2pc resets immediately when N deletes item (N precharges) • ok2pc is set when N+1 deletes item (N+1 precharges)

  33. Controller implementation S pc T NAND3 S aC + ok2pc T eval S INV Controller implementation is very simple: • each signal implemented using a single gate • ok2pc typically off the critical path

  34. HC: Stage Implementation state variable: off the critical path + from current stage self-loop: key to fast “isolation” from next stage early ack NAND INV eval pc ack req done delay

  35. HC: Operation (fast self-loop) (fast self-loop) N isolates 1 3 (early Ack) 2 N enables itself for next evaluation N N+1 N evaluates N precharges N+1 starts to evaluate Cycle Time = 8 CMOS gate delays

  36. Performance 2 2 3 N isolates 1 Cycle Time = N N+1 N+2 N enables itself for next evaluation N precharges N evaluates N+1 evaluates

  37. FIFO Results (simulations) LP dual-rail: over 80% faster than Williams’ PS0 • comparable latency LP single-rail: even faster 0.19 CMOS 3.3 V, 300°K dual-rail single-rail

  38. Fabricated Chip: HC FIFO • 2.5 GHz in 0.18u

More Related