380 likes | 525 Views
Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis. Tiberiu Chelcea , Girish Venkataramani, Seth C. Goldstein Department of Computer Science Carnegie Mellon University. QDI: Orphans problem. Early propagation : “A” arrives early => Z transitions
E N D
Area Optimizations for Dual-Rail Circuits Using Relative-Timing Analysis Tiberiu Chelcea, Girish Venkataramani, Seth C. Goldstein Department of Computer Science Carnegie Mellon University
QDI: Orphans problem • Early propagation: • “A” arrives early => Z transitions • Stale values on the other signals • Incorrect behavior: inputs acknowledged before being received A1 X1 B1 A0 X0 B0 Z1 C1 Y1 Z0 D1 C0 Y0 D0
DoneA Done C NCL-X solution A1 X1 B1 A0 X0 B0 Z1 N1 C1 Y1 Z0 D1 C0 N3 Y0 D0 N2 Add completion detection
QDI Gate Delays QDI implementations always assume the worst: equal probability for any gate delay
Motivation • Quasi-Delay Insensitive (QDI) circuits: • One timing constraint • Naturally tolerate parametric variation, but… • Have large area overheads • Added completion detection for correctness
Goal: pay only what is necessary Parametric Variation and Gate Delays ITRS’05: 35% parametric variation by 2020
Use timing information to reduce size of completion detection Use mixed gates to further reduce area w/ early propagation w/o early propagation regular gates strict gates Goal: Optimizing Sync→Async Flow
Contributions Three new relative-timing area optimizations: • Direct method: • Timing analysis + simple CD elimination • Greedy method: fast but not optimal • Uses strict gates, but may increase area • Exact method: optimal, but slow • Solves an mILP problem
Outline • Timing analysis & Direct Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions
Basics • QDI circuits: • Unbounded but finite delays on gates and wires • One timing assumption: isochronic fork • Timed circuits: • Delays on gates and wires: bounded time intervals • Given input arrival times: compute propagation intervals for each gate and wire
GlobalPI (1.5,1.9) (1.0,1.2) (0,0) (0,0) (1.1,1.2) (0.5,0.7) (0,0) (2.0,5.6) (2.0,5.6) (3.0,4.0) (0,0) (0.5,0.7) (0,0) (3.6,4.9) (3.5,4.1) (0.6,0.8) (3.6,4.9) (0,0) (0,0) Timing Computation X • Conservative assumption: any input change can trigger an output change A (1.5,1.9) B N1 Z C N3 D N2 Y
Done Under any input change, gate quiescent when output produced 1.9 < 2.0 (1.0,1.2) (1.1,1.2) C Direct Optimization Method (1.5,1.9) X • Gate completion detection iff gate may not be stable when outputs are produced A (1.5,1.9) B N1 Z (2.0,5.6) (2.0,5.6) (3.0,4.0) C (3.6,4.9) N3 (3.5,4.1) D (3.6,4.9) N2 Y
All inputs must arrive before producing an output Eliminate early propagation effect Extremely expensive Decrease length of propagation interval C C C Strict Gates A B
(1.5,1.9) (1.0,1.2) (1.5,1.9) (1.1,1.2) (5.0,6.8) (5.0,6.8) (3.0,4.0) Done (3.6,4.9) (3.5,4.1) (3.6,4.9) Timing Computation with Strict Gates X A • Entire completion detection: single OR gate B N1 Z C (1.4,1.9) N3 D N2 Y • This circuit: area not reduced • Goal: smart insertion of strict gates
Outline • Timing analysis & Direct Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions
Greedy Optimization (1) • Strict gates: area implications • GlobalPI may be narrower and delayed • Fewer gates non-quiescent • Smaller completion detection • Greedy optimization framework: • Flip gates in the circuit from normal to strict • Select most promising candidate • Continue until no improvements possible
Greedy Optimization (2) Algorithm: • For each gate Gi in the circuit • Flip each gate Gi in turn from regular to strict • Perform timing analysis, compute GlobalPIi • Flip back Gi to regular • Select Gk with the narrowest GlobalPIk • If GlobalPIk narrower than previous best: • Flip Gk to strict permanently • Continue (goto 1) Else: finish
Greedy Optimization (3) • Algorithm does not optimize for area directly • Instead: may reduce the completion detection by narrowing the output interval • Results promising, but individual benchmarks may result in larger area
Outline • Timing analysis & Direct Method • Greedy optimization method • Exact optimization method • Results • Conclusions
Exact Optimization Method • mixed Integer Linear Programming (mILP) • Transform circuit graph into an optimization problem: • Introduce variables for each gate, wire and primary input/output • Matrix coefficients: from library (gate areas) and back-annotation (gate/wire delays) files • Decision variables (GS) should gate be strict?
mILP formulation • Minimize: TotalArea = GateArea+CDArea • GateArea = i (GSi·SAreai + (1-GSi)·NAreai) • CDArea = SCD·Or2Area + (SCD-1)·CArea • SCD: # gates that need completion detection • NeedsCD: does a gate need CD? • NeedsCD = 0 if PIM < GlobalPIm or successor is strict; otherwise 1 • Rest of the model implements timing computation
Improving the mILP Model • Basic mILP model: too slow even for small circuits (hours for dozen gates) • Leverage problem knowledge into model improvements: • Branching order: gates closer to the output are more likely to become strict => inspected first • Single input gates: never strict • Provide initial solution (result of greedy opt) • Can solve problems with hundreds of gates in minutes
Related Work: Optimizations • Cortadella et al: • logical function decompositions • can achieve substantial area savings • can be the starting point for our methods • Zhou et al: consider strict gates in optimization, but no timing information • Sokolov et al: two timing optimizations • Alternate levels: unrealistic assumptions for gate delays • Longest path: applicable only for small circuits
Experimental Setup • Tool flow: • Synthesis & tech-mapping with Synopsys Design Compiler • Perl scripts for dual-rail implementations • Optimization tool reads structural Verilog and timing back-annotations • End result: optimized circuits (Verilog) • Experiments: • Arithmetic and ISCAS’89 benchmarks • Pre-layout runs in 0.18m technology
Greedy: 2.83x NCL-X area for le32 Direct: 0.83x Greedy: 0.55x mILP: 0.43x mILP does not finish in less than 1 hour Partial results Area: Ratio vs. NCL-X method
8/168 strict 4.7% before → 40% after Over twice as small than NCL-X Area breakdown
Conclusions • Paper introduced: • a method to translate synchronous circuits into optimized asynchronous circuits • Three new relative timing optimizations for improving area • Direct: extremely simple • Greedy: fast, good results • Exact: optimal, may be extremely slow • Analyzed the impact of parametric variation on these circuits
Outline • Background • Timing analysis & Direct Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions
Introduction • Future deep sub-micron technologies: • large parametric variations (ITRS’05 predicts 35% by 2020). • Asynchronous design a natural fit • Asynchronous handshaking: widespread • Acceptance for asynchronous circuits is predicated on quality CAD tools: • “Pure” async: from scratch • Sync to async translation
A1 X1 B1 A0 X0 B0 Z1 N1 C1 Y1 Z0 D1 C0 N3 Y0 D0 N2 Dual-rail circuit Synchronous to Asynchronous Translation Z = (A·B)·(C+D) A X N1 B Z N3 C N2 D Y Synchronous circuit Template-based replacement of each sync gate
Related Work • Numerous approaches for translating synchronous circuits into asynchronous • Dealing with the orphans problem: • Kondratiev et al: NCL-X (discussed below) • Brej: anti-tokens • Allows for early propagation • Completion detection in background • Even larger area overheads
CrtSol: current best Integer solution Best Estimation: best guess of how far the optimum is When 0, optimum found ILP optimization for 32-bit BK adder
Outline • Timing analysis & Direc Optimization • Greedy optimization method • Exact optimization method • Results • Conclusions
8/168 strict 4.7% before → 40% after Over twice as small than NCL-X Area breakdown