410 likes | 564 Views
Practical Design and Performance Evaluation of Completion Detection Circuits. Fu-Chiung Cheng Department of Computer Science Columbia University. Reading 4. Outline. Motivation Previous Work New Completion Detection Circuit Performance Evaluation Conclusion. Motivation.
E N D
Practical Design and Performance Evaluation of Completion Detection Circuits Fu-Chiung Cheng Department of Computer Science Columbia University Reading 4
Outline • Motivation • Previous Work • New Completion Detection Circuit • Performance Evaluation • Conclusion
Motivation • Circuits: Synchronous or Asynchronous. • Synchronization: • Sync: a global clock • Async: start and completion • mechanisms
Motivation • Potential advantages of async. design: • No clock skew problem, • Low power consumption, • Average-case performance, • Modularity, composability and reusability • Easier technology migration • The promise of high performance is • especially attractive.
Motivation • High performance async. design: • 1. fast self-timed components with • good average case performance • 2. fast completion detection circuits, • detecting the completion. 0 0 0 0 A A B B S S S S Ack0 Ackn-1 Self-timed component 1 0 1 0 + . . . . . . . . . C DoneReset 0 0 0 n-1 1 0 1 n-1 +
Motivation • High performance async. design: • 1. fast self-timed components with • good average case performance • 2. fast completion detection circuits, • detecting the completion. 0 0 0 0 A A B B S S S S Ack0 Ackn-1 Self-timed component 1 0 1 0 + . . . . . . . . . C DoneReset 0 0 0 n-1 1 0 1 n-1 +
Motivation • Fast self-timed components: • 1. Delay-insensitive carry-lookahead adders • 2.Delay-insensitive comparators:
Motivation • Fast completion detection circuits: • 1. Completion detection circuits (CDCs) are • considered as the major overhead. • 2. This paper address the design of • fast completion detection circuits.
Previous Work: • Self-timed components may use • 1. bundled data protocol • 2.dual-rail signaling
Previous Work: • CDCs for bundled data components • 1. Delay elements (an inverter chain). • delay > worst case delay. • 2.Speculative completion [Nowick97] • performance depend on • A. number of matched delays and • B. associated abort detection network • 3. Current-Sensing Completion-Detection • [Dean94,Grass96] • A. consume substantial power • B. requires several gate delays
Previous Work: • CDCs for dual-rail self-timed components • 1. General model: • A. n two-input ORs • B. 1 n-input C-element • 2.Operations: • A. computation cycle: DoneReset=1 • B. reset cycle: DoneReset=0 0 0 0 0 A A B B S S S S Self-timed component Ack0 Ackn-1 1 0 1 0 + DoneReset . . . . . . . . . C 0 0 0 n-1 1 0 1 n-1 +
Previous Work: • N-input C-element: a tree of 2-input C-elms • 1. long delay • 2. large variance Ack0 Ack1 Ackn-2 Ackn-1 …. C C …. …. C C …. C
Previous Work: • N-input C-element: • 1. More efficient implementation: • DoneReset = (done+reset DoneReset) • A. done circuit: an n-input AND • done = Ack0 Ack1 …Ackn-1 • B. reset: circuit: an n-input OR • reset = Ack0 + Ack1 + …+Ackn-1 • C. a 2-input C-elem. • 2. delay & variance: • better than the tree • of 2-input C-elem Ack0 Ackn-1 & done reset . . . DoneReset C Ack0 Ackn-1 + . . .
Previous Work: • Wuu’s CDCs [Wuu93]: • A. done circuit: a tree of NAND • B. reset circuit: a tree of NOR • C. long delay • D. small variance • E. use static gates done reset
Previous Work: • Yun’s CDCs [Yun97]: • A. done circuit: • a tree of • domino logic • B. no reset circuit • C. variant delay • D. large variance • E. use dynamic • CMOS
Our Design • Computation Completion detection circuits • (dynamic n-input NOR) • (static 2-input NOR)
Our Design • Reset Completion detection circuits • (dynamic 2n-input Or)
Our Design • Computation cycle: • For the done signal, • 1. the PMOS transistor (Acki) will be closed and • 2. all NMOS transistors will be open. • 3. Thus, the done signal will be turned on.
Our Design • Computation cycle: • For the reset signal, • the reset signal is turned on as soon as • any Acki signal goes high
Our Design • Reset cycle: • For the done signal, • the done signal is turned off as soon as • any Acki signal is turned off
Our Design • Reset cycle: • For the reset signal, • the reset signal is turned off only after all • Acki signals are turned off.
Our Design • done + reset circuits • = dual-rail multi-input C-element • done + reset circuits + 2-input C-element • = single-rail multi-input C-element • Implementation of 2-input C-element:
Our Design • The PMOS in the pull-up circuit of the done • circuit saves power in non-operation mode. • In a quiescent state, all Acki signals are zero. • All pull-down transistors are closed. • To save power, pull-up transistor is open to cut off • the path from Vdd to Ground.
Our Design • Input low arrives too early, power is wasted. • Input low arrives too late, take a longer time • to turn on the done signal. • Low power consumption latest Acki signal • High performance any not-latest Acki signal
SPICE Output: done circuit Delay=0.55ns ChengDone0: 1. Ack0 is the latest signal. 2. input pulses: 3 and 4 3. buffered input:1004 4. Ack0:100 5. Done:24680 6. DoneReset: 200
SPICE Output: done circuit Delay=0.22ns ChengDone1: 1. Ack1 is the latest signal. 2. input pulses: 5 and 6 3. buffered input:1006 4. Ack1:101 5. Done:24680 6. DoneReset: 200
SPICE Output: done circuit Delay=0.64ns ChengDone37: 1. All Ack arrive at the same time 2. Done:24680 3. DoneReset: 200
SPICE Output: reset circuit Delay=1.23ns ChengReset0: 1. Ack0 is the latest signal. 2. input pulse: 3 and 4 3. buffered input:1004 5. Reset:13579 6. DoneReset: 200
SPICE Output: reset circuit Delay=0.87ns ChengReset1: 1. Ack0 is the latest signal. 2. input pulse: 3 and 4 3. buffered input:1004 5. Reset:13579 6. DoneReset: 200
SPICE Output: reset circuit Delay=1.34ns ChengReset37: 1. All Ack reset at the same time 2. Done:24680 3. DoneReset: 200
Our Design • Constraint: when conducting, • when only one pull-down transistor is conducting. • This can be achieved by properly sizing transistors.
Logic Complexity # of transistors
Performance Evaluation • SPICE Simulation: • 1. use MOSIS 2 micron CMOS level 2 parameters • 2. W=3u L=2u (buffer 0.4 ns 2-input Nor 0.18ns) • Computation-completion detection circuits • 38 typical cases (for Wuu, Yun and Cheng) • The delay measured includes the delay of • the OR gate for Acki. • Reset-completion detection circuits: • 38 typical cases (Wuu and Cheng)
Conclusions • A new completion detection circuit for • dual-rail self-timed components. • 1. very fast computation-completion detection • 2. very fast reset-completion detection • Low-overhead, very fast completion detection • circuit is crucial for high performance • self-timed circuits.
Conclusions • SPICE simulation results: • 1. our computation-completion detection circuit • 9 times faster than Wuu's and Yun's • 2. our reset-completion detection circuit: • 4 times faster than Wuu's.