1 / 51

Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication

Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication. Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin Columbia University, Electrical Engineering Steven M. Nowick Columbia University, Computer Science. Outline.

ryanadan
Download Presentation

Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Asynchronous Protocol Converters for Two-Phase Delay-Insensitive Global Communication Amitava Mitra Intel Corp., Bangalore, India William F. McLaughlin Columbia University, Electrical Engineering Steven M. Nowick Columbia University, Computer Science

  2. Outline • Motivation and Contribution • System-on-Chip: Concepts and Trends • Asynchronous Signaling Styles • Target Asynchronous SOC Architecture • Contribution • Proposed System Architecture • Experimental Results • Extensions: Other Signaling Styles • Conclusions and Future Work

  3. System-on-Chip (SOC): Concept and Trends • Microelectronic trends enabling SOC design • Increasing integration density + chip size • Formerly discrete functions (memory, I/O) now integrated • Popularity of “multi-core” designs • Heterogeneous SOC: • Large complex chip with broad functionality • Many independent computation nodes • Multiple cores, memories, accelerators, multimedia processing, etc. • Often includes multiple timing domains • Complex network-style interconnect fabric • Challenges in Heterogeneous SOC design: • Wire costs not scaling down with device size • Increasing proportion of power and delay in interconnect • Robust and high-performance interconnect design: • High latencies between remote nodes • Mixed timing, timing variability/uncertainty • Need to support varied components: modular/scalable design

  4. SOC Communication Fabric • Growing factor in overall system performance • Ideal Requirements: • Speed: high throughput, low latency • Low power • Robust to timing variations • Flexibility: integrate modular IPs and upgrades • Asynchronous design well-suited to these goals • Timing robust flexible designs • Lower power than synchronous • Work by Quinton, Greenstreet, and Wilton [ICCD 2005] • GALS-style: • global LEDR interconnect + local synchronous blocks • does not provide details of protocol converters

  5. Asynchronous for SOC Communication • Advantages of asynchronous global communication • Delay-insensitive (DI) encoding • Removes timing constraints on global routing • No clock signals to route across chip • Significant power advantage • Can support both async + sync computation • Delay-insensitive async logic combats growing variability concerns • GALS style: Globally-Asynchronous Locally-Synchronous • Several popular async signaling protocols • Dual rail four-phase, LEDR, 1-of-4, bundled data, others • No single protocol ideal for both logic and communication

  6. Background: LEDR Signaling • Dual-rail encoding: two wires per bit – delay-insensitive • “Level-encoding”: • Data rail: holds actual data value • Parity rail: holds parity value • Alternating-phase protocol: • Encoding parity alternates between odd and even Bit value LEDR Encoding data rail parity rail Phase

  7. LEDR Signaling • Exactly one wire transition for each new data item Data rail: carries bit value in both phases 0 1 0 0 1 1 1 data parity even odd even odd even odd even Parity rail: phase alternates with each data item

  8. Four-Phase Dual-Rail Signaling • Alternative DI Code • Key Differences: • Four-phase (Return-to-Zero) protocol • Spacer (reset) state required between each data item • One-hot encoding: • True rail (encodes 1) & false rail (encodes 0) 1 0 1 1 Data values True rail False rail Evaluation (one rail high) Reset (both rails low)

  9. Four-Phase Dual-Rail vs. LEDR • Advantages of four-phase dual-rail: • Delay-insensitive logic using standard gates • Implementations are simple and fast: widely used • LEDR: complex & impractical • Disadvantages of four-phase dual-rail: • System-level communication throughput: • Spacer state doubles round-trip communication latency • LEDR: no spacer required • Power dissipation: • Two transitions/bit (up and down) for each data item • LEDR: only one transition/bit • Conclusion: • Four-phase dual-rail better for implementing function blocks • LEDR is better for global communication

  10. Target Asynchronous SOC Architecture • Three major components: • Global communication network (LEDR) • Local computation nodes (varied styles) • New requirement: protocol converters at interfaces • Allow full separation of computation and communication Our goal – Protocol converters to enable this global LEDR SOC

  11. Contribution • High-speed protocol converters to enable heterogeneous SOC architectures • Supports high-throughput, robust global communication • LEDR encoding • Supports efficient design of local function blocks • (i) 4-phase dual-rail, (ii) 1-of-4, (iii) single-rail bundled data • Features: • Family of low-latency protocol converters: • support above 3 local encoding styles • High throughput: • facilitates concurrent interaction of nodes • Timing-robust: • converters almost entirely QDI • Low design effort: • standard cell design flow • Fully implemented in 0.18 μm CMOS • Layout and simulation • FIFO throughputs up to 250 MHz

  12. Two Target SOC Topologies 1. “Pipeline-style” topology • Feed-forward data path: • uni-directional token flow • Receiving node returns a single ACK (control signal) • Supports concurrency between nodes Data feeds forward Acknowledge sent back

  13. Two SOC Topologies (cont.) 2. “Server-style” topology • Client passes data token to server • Server computes/returns data token to client (result) • Explicit ACK unnecessary • Proposed SOC architecture supports both topologies Four-phase server Four-phase data client Bi-directional data flow: data passed back to client on completion

  14. Outline • Motivation and Contribution • Proposed System Architecture • Architecture Overview • System Simulation • Detailed Hardware Implementation • Timing Analysis • Experimental Results • Extensions: Other Signaling Styles • Conclusions and Future Work

  15. Architecture Overview Four-phase core • External LEDR interface, internal four-phase core • Four-phase signals are shown in red • Two-phase or transition signals are shown in yellow LEDR input LEDR output

  16. Control Signals • Two-phase control signals Phase of LEDR input (request from left) Phase of LEDR output (forward complete) Acknowledge to left neighbor Acknowledge from right neighbor

  17. Control Signals • Four-phase control signals Completion detect four-phase evaluate and RZ Enable four-phase evaluate and RZ

  18. System Simulation • LEDR inputs begin arriving at quiescent system LEDR inputs arrive Completion detection

  19. System Simulation • Input completion detection sent to control All input phases matching Transition to new phase

  20. System Simulation • Control enables four-phase evaluate phase Enable rises

  21. System Simulation • LEDR input converted to four-phase Enable now high One wire of each four-phase pair rises

  22. System Simulation • Four-phase function evaluation

  23. System Simulation • Four-phase bits decoded to LEDR • Each bit converted as soon as it computes LEDR outputs to next node generated Four-phase complete not used in evaluate phase

  24. System Simulation • LEDR output completion detection Output pairs ACK from right may come any time after all pairs are sent

  25. System Simulation • Control enables four-phase reset phase Enable falls

  26. System Simulation • Function block inputs return-to-zero • ACK is sent concurrently to left Enable now low Pipeline concurrency: request new data during reset phase

  27. System Simulation • Four-phase reset propagates through logic block New data may arrive now that ACK has been sent Reset Completion detection Enable remains low

  28. System Simulation • Four-phase reset completes • Complete internal cycle has now been performed Complete falls

  29. System Simulation • New evaluate phase begins when Enable rises again • Pre-conditions: reset finished, new data REQ, and old data ACK Three-way synchronization Input phase transitions when new data ready ACK transitions when outputs safe to change Complete low (means reset finished)

  30. Detailed Hardware Implementation • Each block implemented in CMOS standard cells • Design has few non-QDI timing constraints Four-phase core LEDR input LEDR output

  31. Four-phase Encode (Input Converter) • Converts LEDR input to four-phase dual-rail • Enable=‘1’: outputs evaluate based on LEDR data • Enable=‘0’: outputs reset (LEDR data blocked)

  32. Four-phase Decode (Output Converter) • Converts four-phase bits to LEDR output • LEDR data rail encoding • Assert either S (1 value) or R (0 value), then return-to-hold • More robust alternative: C-element

  33. Four-phase Decode (Output Converter) • Converts four-phase bits to LEDR output • LEDR parity rail encoding • Parity output: based on 4-phase data and LEDR input phase (parity) • Alternating phases: green vs. red gates • D-latch: blocks new input parity arrival until 4-phase reset complete even phase odd phase

  34. 1-Bit Completion Detectors • LEDR CD at input and output • Four-phase CD in function block • Both protocols have one gate CD • XOR (parity) for LEDR • OR for four-phase dual-rail 1-bit LEDR completion detector 1-bit four-phase completion detector

  35. N-Bit Completion Detectors • C-element trees • Used for both LEDR and four-phase • C-element: standard cell implementation (AOI222 w/feedback)

  36. Control Block • Main Purpose: controls 4-phase function block • 4-phase eval requires 3-way synchronization • Function block: previous RZ complete • Primary inputs: new data arrival • Right interface (in pipeline): ACK received • In pipeline topology: also sends left ACK For pipeline topology only

  37. Control Block • Converts two-phase inputs to four-phase outputs Two-phase to four-phase conversion

  38. Control Block: Signaling Conversion Pulse-mode (timed) Transition-signal (falling or rising ) Four-phase (level-sensitive) SR latch captures the pulse Inverter and XNOR form simple pulse gen

  39. Timing Requirements • Circuits almost entirely QDI • Exceptions: • Control block: • Two-sided timing constraint on length of pulse • Sensitive to both gate and wire delays • Careful layout required • Latches: simple hold time constraints • SR latches can be replaced by C-elements • C-elements also have implementation-specific timing constraints • SR latch much faster than our standard cell C-element • D latch can be removed at cost of concurrency

  40. Outline • Motivation and Contribution • Proposed System Architecture • Experimental Results • Design Methodology • Datapath Setup • Simulation Results • Latency and Throughput Analysis • Extensions: Other Signaling Styles • Conclusions and Future Work

  41. Design Methodology • Standard cell design flow with complete layout • 0.18 μm TSMC CMOS process • 4 metal layers of 7 available used in routing • Custom place-and-route used • Only major layout concern: pulse generator circuit • Design could be automated with constraints on pulse • Analog simulations: based on layout-extracted design • Test vectors including limiting fast and slow cases

  42. Datapath Implementation • Two function blocks implemented • An 8x8 carry-save multiplier • An empty FIFO stage • FIFO contains four-phase completion detector only • Demonstrates minimum possible node latency • Blocks are QDI in evaluate, but “eager” in reset • Implemented in combinational CMOS • “DIMS”-style logic (with C-elements) could be used instead • QDI in both directions • Increases both forward and reverse latencies

  43. Multiplier Layout • Includes dual rail multiplier and all conversion circuits • Total area of 0.051 mm2 • FIFO stage has area of 0.018 mm2

  44. Measured Block Latencies

  45. Performance Results 3 Metrics: • Forward Latency: input arrival  output data available • Average Values: Multiplier:6.8 ns; FIFO:2.9 ns. • Stabilization Time: input arrival  reset complete (circuit quiescent) • Multiplier:10.5 ns; FIFO:6.3 ns. • Pipelined Cycle Time: min processing time/data item (steady-state) • Multiplier:8.3 ns; FIFO4.0 ns.

  46. Performance Analysis • Forward latency: overhead • 2.2 ns for both nodes • Overhead independent of function block size • Includes: • LEDR CD, control unit, input/output converters • Throughput: increased by concurrency • Benefit: 2.2 ns reduction in cycle time (vs. post-reset ACK) • Savings achieved even in environment without channel latency • “Core converter” overhead (no CD) extremely low • Only 1.1 ns average latency for converters + control • Completion detectors: • Account for half of forward latency overhead • Account for 55% of FIFO cycle time • Faster CDs would provide big improvement

  47. Outline • Motivation and Contribution • Proposed System Architecture • Experimental Results • Extensions: Other Signaling Styles • Converters for 1-of-4 function blocks • Converters for bundled data function block • Conclusions and Future Work

  48. Extensions to Other Local Protocols • Only small changes to handle 1-of-4 or bundled data • No change to control block • 1-of-4 encoding: • Input/output converters: • Small changes to logic • Needs standard 1-of-4 completion detector • Single-rail bundled data: • Input converter: not needed – use LEDR data rail • Output converter: • New basic circuit required (see paper for details) • Function block completion detection: • Use bundled ‘done’ signal • Asymmetric delay chain (fast reset)

  49. Outline • Background and Motivation • Contribution • Proposed System Architecture • Experimental Results • Extensions: Other Signaling Styles • Conclusions and Future Work • Summary and Conclusion • Future Work

  50. Summary and Conclusions • Support heterogeneous SOCs using hybrid protocols • LEDR: low-power, delay-insensitive communication fabric • Dual rail four-phase: Simple, fast logic blocks • Designed Converters for LEDR/four-phase SOC: • Low latency, high throughput, timing robust design • Robust concurrency system developed • Exploits four-phase reset to mask communication time • Simulations with realistic mid-sized function nodes • Demonstrated low latency overhead • Demonstrated low area overhead • Achieved throughputs up to 250 MHz for FIFO stage

More Related