370 likes | 620 Views
Circuit Design for SRCMOS Asynchronous Wave Pipelines Oliver Hauck Integrated Circuits and Systems Lab Departments of Computer Science and Electrical Engineering Darmstadt University of Technology. Outline. Pipelines: synchronous, asynchronous, wave pipelined,
E N D
Circuit DesignforSRCMOSAsynchronous Wave PipelinesOliver HauckIntegrated Circuits and Systems LabDepartments of Computer Science and Electrical EngineeringDarmstadt University of Technology
Outline Pipelines: synchronous, asynchronous, wave pipelined, and asynchronous wave pipelined (AWP) Comparison: AWPs vs. sync, async, and sync wave pipes AWP Circuit Design Conclusion
Pipelining Pipelining used as premier technique to better exploit hardware and boost performance of VLSI chips Clocking overhead presents serious threat for deeply pipelined systems built upon sub-micron CMOS processes running at GHz frequencies
General Framework for Pipelines Latch/Reg Latch/Reg Logic Data Clk
Synchronous Pipeline Latch/Reg Latch/Reg Logic Data Clk Throughput determined by longest logic path + clock/register overhead Fine-grain pipelining allows high throughput at the cost of increased clock/register overhead Negative side-effects of gate-level pipelining : Increased latency, clock load/skew, power, area, design time More area for clocking and registers than for logic Implementation options: Register- vs. latch-based, explicit latches vs. latchless TSPC vs. local clocks derived from global clock Static vs. dynamic, single-ended vs. dual-rail
Asynchronous Pipeline Handshake Handshake Logic Data req_in req_out ack_in ack_out Micropipeline (Sutherland 1989) Synchronous clock replaced by asynchronous handshaking Elastic operation: input and output rate may differ momentarily, and pipeline will buffer Implementation options: 4-phase (level) vs. 2-phase (event) protocol Bundled data (matched delay) vs. completion detection Operation is data dependant, saves power during idle As with fine-grain sync pipelines, throughput can be high; handshake causes high latency and backward stall Plug & Play composability Load on req and ack lines distributed Used by Furber‘s group at Manchester U for AMULET1/2/3
Synchronous Wave Pipeline Latch/Reg Latch/Reg Wave Logic Data Clk Several data waves simultaneously active in the logic Logic has to minimize delay variations over P,T,V corners Global clock used with constructive skew to adjust phases Wave pipelining potentially gives higher throughput as conventional pipelines at decreased latency and reduced clock load, area and power However, tuning the logic and the delay elements is difficult
Wave Pipelining: A Short Outline • Wave pipelining occurs when combinational logic is clocked faster than latency would allow • Several data waves are then active in the logic without being separated by storage elements • Latency remains constant and throughput is determined by delaydifferences rather than absolute delay • Requirement for delay balanced logic and complicated timing are the main hurdles
Wave Pipelining: A Little History • Technique stems from the 60s and has had a reputation for being exotic since • Wave pipelining was long dead before being revived by W. Burleson (U. Mass.) and M. Flynn (Stanford U., PhDs by Wong, Klass, and Nowka) and C. Gray at NCSU • Some working academic chips exist, mainly datapath • Some commercial memory is wave pipelined (e.g. ULTRA-III cache), but no logic, as far as we know
Asynchronous Wave Pipeline (AWP) Wave Latch Wave Latch Wave Logic Data req_in req_out matched delay Data words associated with events on request line Several data waves and protocol events simultaneously active in the logic and the matched delay element, respectively AWP is special case of the sync wave pipeline with the constructive skew set to worst-case logic delay It is crucial that the delay element accurately tracks the delay behaviour of the logic over P, T, V corners
AWPs vs. Synchronous Pipelines • No global clock, instead a local clock (request) that is fed through the pipeline and obeys a simple asynchronous protocol, i.e. data is associated with event on request • Many pipeline registers removed, thus requirements on the clock (request) relaxed • Synchronous pipelines can reach the throughput of AWPs only with excessive cost in area, power and latency
AWPs vs. Asynchronous Pipelines • AWPs deliberately sacrifice the ack and keep only the req to avoid protocol overhead • AWPs not elastic: data at output has to be consumed • AWPs eliminate hazards as side-effect of delay balancing • AWPs have in common with other async methodologies: data dependant operation (avoids redundant transitions), composability(though inelastic), noglobalclock
AWPs vs. Synchronous Wave Pipelines AWPs tackle two main difficulties in sync wave pipes: • Replacing the constructive skew by worst-case delay removes double-sided timing constraint, i. e. in con- trast to sync wave pipes do AWPs operate at any rate • Using dynamic self-resetting logic controls delay variation and doesn´t impact latency much
Wave Pipelining Combinational Logic • Overall goal: keep data wave coherent under all possible conditions (data, PTV) • Desirable architecture features: most logic paths have same depth fanin/fanout the same everywhere • First step: pad all short paths to maximum length
Example: 64-b Brent-Kung Parallel Adder 0 1 2 3 4 pg PG PG G x o r Buffers provide for same depth on every logic path All gates in the same column must have the same delay
Circuits • Logic style used has to minimize delay variation • Earlier work focused on bipolar logic (ECL, CML), but CMOS is mainstream • Static CMOS is not well suited for wave piping, fixing the problem results in more power and slower speed • Pass transistor logic gives slopy edges thereby introducing delay variation • Dynamic logic is attractive as only output high transition is data-dependant, output pulldown is done by precharge
Circuits (cont.) • Using dynamic logic as in Burleson´s Wave Domino jeopardizes the concept as it needs fine-grain precharge • What is needed is a dynamic logic family without precharge overhead: SRCMOS • Work done at IBM: classic paper by Chappell et al:``A 2-ns Cycle, 3.8-ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture,´´ JSSC (26), 11, 1991; or, more recently: ``Implementation of a Self-Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability,´´ JSSC (34), 8, 1999, by Hwang et al.
SRCMOS • Distinguishing property of our SRCMOS circuits: precharge feedback is fully local, and NMOS trees are delay balanced output N inputs
Delay Balancing at Transistor Level • NMOS tree is designed so that the precharge node is pulled down by a constant number of series devices • Short paths are padded with dummy devices • Delay variation is minimal when exactly one path is on, i. e. wide fanin OR´s are hard to use • Every output has to see the same load • Lightly loaded outputs are given dummy cap
Simulation of Gim cell Pulses of 4 possible input situations giving ´1´ at the output are tightly matched Note: in this case never are Pxy=Gxy=1
64-bit Adder Output Waveforms latching window
Transistor Sizing Wprecharge Wkeeper Cfeedback Cload N Cdrive output inputs Wpd Wpd / Cdrive = const Cdrive / (Cload+Cfeedback+Wkeeper) = const Cfeedback / Wprecharge = const Wprecharge / Cdrive = const LINEAR SIZING
Interconnect: Resistive Effects 0.9µm x 900µm MET2 parasitics: C=116fF, R=70 Ohms C only R/3, R/3, R/3 R/2, R/2 RC only
Interconnect: Coupling Effects 2 adjacent MET2 lines coupled by C=54fF
PTV Variations • SRCMOS provides some robustness by generating fresh pulses at every gate output • Pulsed operation reduces data dependancy, coupling • PTV noise is not critical when drift is in the same direction across die • Critical are: temperature gradient, supply drop, and local variations • What is needed: Rule of thumb like ``For process X, to be on the safe side, keep area between two latches < Y sqmm´´
Conclusion AWPs presented as alternative approach to high-speed design, shows potential for GHz throughput without clocks AWPs avoid some problems of conventional wave pipes and (a)synchronous systems 64b adder + test circuit and EC crypto layout in the making Not covered here: feedback + controllers To do: support transistor sizing