1 / 37

Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis

Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis. Tiberiu Chelcea , Girish Venkataramani, Seth C. Goldstein Department of Computer Science Carnegie Mellon University. QDI: Orphans problem. Early propagation : “A” arrives early => Z transitions

yosefu
Download Presentation

Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein Department of Computer Science Carnegie Mellon University

  2. QDI: Orphans problem • Early propagation: • “A” arrives early => Z transitions • Stale values on the other signals • Incorrect behavior: inputs acknowledged before being received A1 X1 B1 A0 X0 B0 Z1 C1 Y1 Z0 D1 C0 Y0 D0

  3. DoneA Done C NCL-X solution A1 X1 B1 A0 X0 B0 Z1 N1 C1 Y1 Z0 D1 C0 N3 Y0 D0 N2 Add completion detection

  4. QDI Gate Delays  QDI implementations always assume the worst: equal probability for any gate delay

  5. Motivation • Quasi-Delay Insensitive (QDI) circuits: • One timing constraint • Naturally tolerate parametric variation, but… • Have large area overheads • Added completion detection for correctness

  6. Goal: pay only what is necessary Parametric Variation and Gate Delays ITRS’05: 35% parametric variation by 2020 

  7. Use timing information to reduce size of completion detection Use mixed gates to further reduce area w/ early propagation w/o early propagation regular gates strict gates Goal: Optimizing Sync→Async Flow

  8. Contributions Three new relative-timing area optimizations: • Direct method: • Timing analysis + simple CD elimination • Greedy method: fast but not optimal • Uses strict gates, but may increase area • Exact method: optimal, but slow • Solves an mILP problem

  9. Outline • Timing analysis & Direct Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions

  10. Basics • QDI circuits: • Unbounded but finite delays on gates and wires • One timing assumption: isochronic fork • Timed circuits: • Delays on gates and wires: bounded time intervals • Given input arrival times: compute propagation intervals for each gate and wire

  11. GlobalPI (1.5,1.9) (1.0,1.2) (0,0) (0,0) (1.1,1.2) (0.5,0.7) (0,0) (2.0,5.6) (2.0,5.6) (3.0,4.0) (0,0) (0.5,0.7) (0,0) (3.6,4.9) (3.5,4.1) (0.6,0.8) (3.6,4.9) (0,0) (0,0) Timing Computation X • Conservative assumption: any input change can trigger an output change A (1.5,1.9) B N1 Z C N3 D N2 Y

  12. Done Under any input change, gate quiescent when output produced 1.9 < 2.0 (1.0,1.2) (1.1,1.2) C Direct Optimization Method (1.5,1.9) X • Gate completion detection iff gate may not be stable when outputs are produced A (1.5,1.9) B N1 Z (2.0,5.6) (2.0,5.6) (3.0,4.0) C (3.6,4.9) N3 (3.5,4.1) D (3.6,4.9) N2 Y

  13. All inputs must arrive before producing an output Eliminate early propagation effect Extremely expensive Decrease length of propagation interval C C C Strict Gates A B

  14. (1.5,1.9) (1.0,1.2) (1.5,1.9) (1.1,1.2) (5.0,6.8) (5.0,6.8) (3.0,4.0) Done (3.6,4.9) (3.5,4.1) (3.6,4.9) Timing Computation with Strict Gates X A • Entire completion detection: single OR gate B N1 Z C (1.4,1.9) N3 D N2 Y • This circuit: area not reduced • Goal: smart insertion of strict gates

  15. Outline • Timing analysis & Direct Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions

  16. Greedy Optimization (1) • Strict gates: area implications • GlobalPI may be narrower and delayed • Fewer gates non-quiescent • Smaller completion detection • Greedy optimization framework: • Flip gates in the circuit from normal to strict • Select most promising candidate • Continue until no improvements possible

  17. Greedy Optimization (2) Algorithm: • For each gate Gi in the circuit • Flip each gate Gi in turn from regular to strict • Perform timing analysis, compute GlobalPIi • Flip back Gi to regular • Select Gk with the narrowest GlobalPIk • If GlobalPIk narrower than previous best: • Flip Gk to strict permanently • Continue (goto 1) Else: finish

  18. Greedy Optimization (3) • Algorithm does not optimize for area directly • Instead: may reduce the completion detection by narrowing the output interval • Results promising, but individual benchmarks may result in larger area

  19. Outline • Timing analysis & Direct Method • Greedy optimization method • Exact optimization method • Results • Conclusions

  20. Exact Optimization Method • mixed Integer Linear Programming (mILP) • Transform circuit graph into an optimization problem: • Introduce variables for each gate, wire and primary input/output • Matrix coefficients: from library (gate areas) and back-annotation (gate/wire delays) files • Decision variables (GS) should gate be strict?

  21. mILP formulation • Minimize: TotalArea = GateArea+CDArea • GateArea = i (GSi·SAreai + (1-GSi)·NAreai) • CDArea = SCD·Or2Area + (SCD-1)·CArea • SCD: # gates that need completion detection • NeedsCD: does a gate need CD? • NeedsCD = 0 if PIM < GlobalPIm or successor is strict; otherwise 1 • Rest of the model implements timing computation

  22. Improving the mILP Model • Basic mILP model: too slow even for small circuits (hours for dozen gates) • Leverage problem knowledge into model improvements: • Branching order: gates closer to the output are more likely to become strict => inspected first • Single input gates: never strict • Provide initial solution (result of greedy opt) • Can solve problems with hundreds of gates in minutes

  23. Related Work: Optimizations • Cortadella et al: • logical function decompositions • can achieve substantial area savings • can be the starting point for our methods • Zhou et al: consider strict gates in optimization, but no timing information • Sokolov et al: two timing optimizations • Alternate levels: unrealistic assumptions for gate delays • Longest path: applicable only for small circuits

  24. Experimental Setup • Tool flow: • Synthesis & tech-mapping with Synopsys Design Compiler • Perl scripts for dual-rail implementations • Optimization tool reads structural Verilog and timing back-annotations • End result: optimized circuits (Verilog) • Experiments: • Arithmetic and ISCAS’89 benchmarks • Pre-layout runs in 0.18m technology

  25. Greedy: 2.83x NCL-X area for le32 Direct: 0.83x Greedy: 0.55x mILP: 0.43x mILP does not finish in less than 1 hour Partial results Area: Ratio vs. NCL-X method

  26. 8/168 strict 4.7% before → 40% after Over twice as small than NCL-X Area breakdown

  27. Parametric Variation: BK adder

  28. Conclusions • Paper introduced: • a method to translate synchronous circuits into optimized asynchronous circuits • Three new relative timing optimizations for improving area • Direct: extremely simple • Greedy: fast, good results • Exact: optimal, may be extremely slow • Analyzed the impact of parametric variation on these circuits

  29. Backup slides

  30. Outline • Background • Timing analysis & Direct Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions

  31. Introduction • Future deep sub-micron technologies: • large parametric variations (ITRS’05 predicts 35% by 2020). • Asynchronous design a natural fit • Asynchronous handshaking: widespread • Acceptance for asynchronous circuits is predicated on quality CAD tools: • “Pure” async: from scratch • Sync to async translation

  32. A1 X1 B1 A0 X0 B0 Z1 N1 C1 Y1 Z0 D1 C0 N3 Y0 D0 N2 Dual-rail circuit Synchronous to Asynchronous Translation Z = (A·B)·(C+D) A X N1 B Z N3 C N2 D Y Synchronous circuit Template-based replacement of each sync gate

  33. Related Work • Numerous approaches for translating synchronous circuits into asynchronous • Dealing with the orphans problem: • Kondratiev et al: NCL-X (discussed below) • Brej: anti-tokens • Allows for early propagation • Completion detection in background • Even larger area overheads

  34. CrtSol: current best Integer solution Best Estimation: best guess of how far the optimum is When 0, optimum found ILP optimization for 32-bit BK adder

  35. Outline • Timing analysis & Direc Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions

  36. 8/168 strict 4.7% before → 40% after Over twice as small than NCL-X Area breakdown

  37. mILP Run Time

More Related