Wagging Logic: Moore's Law will eventually fix it

Wagging Logic: Moore's Law will eventually fix it Charlie Brej APT Group University of Manchester Group Talk

Introduction • Quasi-Delay-Insensitive (QDI) approach • Prove the high performance potential • What is performance? • Latency • Throughput • Why is async better? • Average case performance • Variability and data-dependant • Bit level pipelining Group Talk

Ensure all wire pairs are cycled up and down QDI Forward Safe Guarding C Group Talk

Viewpoint of a single output Many inputs Behaviour Group Talk

All or nothing Synchronises inputs together Behaviour Group Talk

Why is it so slow? • Delays: • Gate: 1, C-element: 2 • Stage data propagation: X • Cycle time (times 2 for set and reset): • Forward guarding: 2X • C-element for each gate • Acknowledge propagation: 2X • C-element for each fork (fork depth ~ gate depth) • About eight times slower than worst case! Group Talk

Why is four-phase so slow? • Low latency • Low throughput • Only 1/8th of the system doing useful work • Rest is resetting/completing Workie Sleepy Sleepy Sleepy Sleepy Sleepy Sleepy Sleepy Workie Sleepy Group Talk

Solutions • Ultra/Hyper/Super Pipelining • Need 8 times finer pipelining • Impossible • Each latch adds to the latency • Faster completion detection • Balanced treeing C-elements • Arranging to suit arrival order • Backward guarding • Not even close to 8x improvement Group Talk

Inspiration: Wagging Latches • Alternate latch read/write • Capacity of two latches • Depth of one latch Group Talk

Reset Set Reset Set Set Reset Set Reset Wagging Logic • Apply same method to the logic • Alternate logic allowing one to set while the other resets (precharges) Group Talk

Wagging Logic • Between wagging stages • No need to wagg • No need to synchronize • Wagg only when communication with non-wagging logic Group Talk

Non FIFO Example Group Talk

Duplicate the Logic Group Talk

Connect to Complementary Group Talk

A Harder Example Group Talk

Duplicate the Logic Group Talk

Connect to Complementary Group Talk

Triplicate the Logic Group Talk

Connect to the next on the list Group Talk

Other example Group Talk

Proof of the pudding • Simple gate level simulation • My own simulator • Delays: C-element=2, Gate=1 • Example circuits • Fibonacci sequence generators • Vertically pipelined 64bit ripple carry adder • Non-pipelined 8bit ripple carry adder • 16 input XOR • Backward and Forward guarded • Relative measurements of Speed, Power, Area • 10,000 gate delays simulation Group Talk

Synchronous Worst Case:74 64bit Fibonacci Performance Group Talk

Synchronous Worst Case:500 8bit Fibonacci Performance Group Talk

Synchronous Worst/Best Case:1250 (8 gate delays) Inc. Timing margins Inc. Flip-Flop:1000 (10 gate delays) XOR Performance Group Talk

Synchronous:610 Power Consumption Group Talk

Area Group Talk

Future work • Larger and more complex designs • Small CPU • Layout • Silicon? • Improve completion time • Current optimal wagging ~ 5 • Target ~ 3 • Fully automated flow • Verilog Input & Output • Partitioning Group Talk

Conclusions • Matching and surpassing synchronous performance every time • DI logic for performance • Very Expensive • 20 times more power • 5 times bigger (times wagging) • Fastest logic on the planet! • Discounting increase in wire delays • Assuming other things will be able to keep up Group Talk

Wagging Logic: Moore's Law will eventually fix it