Challenges in Automatic Optimization of Arithmetic Circuits

csda csda Challenges in Automatic Optimizationof Arithmetic Circuits Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL)

Circuit PerformanceDepends Heavily on the Description Multiplier with OptimizedCompressor Tree “Software” Multiplier Multiplier with Compressor Tree

Pre-Synthesis Optimization of Arithmetic Circuits Known architectures Original Circuit Description Arithmetic optimizations Logic Synthesis Physical Design Automatic architecture exploration

Automation and Computer Arithmetic • Algorithmic approaches for a particular class of circuits • Variable group size CLA adder [Lee91] • Irregular partial product compressors [Stelling98] Automation • Heuristics to optimize general classes of circuits • Kernel and co-kernel extraction [Brayton82] • Decomposition based approaches for general circuits [Bertacco97, Mishchenko01, Yang02]

Logic Synthesis • Synthesis tools have become extremely good in optimizing circuits expressed in Sum-Of-Product form • And when there are plenty of XOR gates? ? Before expansion :0.37 ns (138.2 μm2) After expansion :0.26 ns (146.9 μm2) Before expansion :0.22 ns (58.8 μm2) After expansion :0.27 ns (221.2 μm2)

Outline Verma, Brisk, & Ienne; DAC 2007 Best Paper Award nominee Verma, Brisk, & Ienne; IWLS 2008 Verma & Ienne; DAC 2006 Verma & Ienne; ICCAD 2004 Verma, Brisk, & Ienne; TCAD 2008 Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

Outline Low Complexity High High Granularity Low Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

Outline Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

Clustering: Maximization of the Use of Carry-Save Representation The two addition nodes are clustered Two addition nodes are separated by NOT Goal: Swap the adders with other logic operations while preserving the semantics to cluster additions

Examples of Transformations Advancing shift left over add(distributivity of multiplication over addition) (A << k)  A . 2k Advancing shift right over addition is more complex Advancing SEL over add(existence of the identity element of addition) C ? (A + B) : D (C ? A : D) + (C ? B : 0)

Some Transformations Have a Cost Advancing PP over add(distributive property of multiplication over addition) This transformation has a significant cost in terms of area!

Generation of All Pareto-Optimal Implementations Pareto-optimal:better than any other in terms of area or critical-path delay Theorem: The transformations form a persistent and confluent reduction system

Example: adpcmdecode Kernel AND network Compressor tree 0.51 ns, 4901 μm2 0.85 ns, 5678 μm2

Outline Limited scope for optimizations Bit-level Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

Implementation of Subcircuits Corresponding to Contiguous Layers Can Be Improved Arithmetic ADD Logic LZD A direct implementation of LZA in carry-select fashion [Gerwig99] Leading Zero Anticipator

Recursively compute leader expressions again Input Condensation • Leader expressions: • Sufficient to evaluate the whole of an expression • Once you evaluate them, you can discard the input bits IN IN 8-input parallel counter Some Large Circuit Leader expressions L |L| < |IN| s c Smaller circuit OUT OUT Compute all leader expressions in parallel

x y z z = f(x, y) Progressive Decomposition: Algorithm Overview • Choose a subset of input bits • How many bits? • Many different combinations? • Find leader expressions • Optimize via Boolean ring properties • Find identities • Discard dependent expressions • Rewrite circuit in terms of leader expressions • Recursively process the remaining circuit

a1 a0 b1 c1 b0 c0 CSA CSA 0 carry sum Carry-save adder + + 0 a1 a0 c0 b1 b0 c1 X + + 0 0 + + + 0 X Example: 3-Input Adder (s2 Output) X = [a1b1 + (a1 + b1)a0b0]  [(a1 b1  a0b0)c1 + c0(a0 b0)(c1 + (a1 b1  a0b0))] L(X, {a1, b1, c1}) ={a1 b1  c1, a1b1 b1c1 a1c1} 3:2 Compressor Ripple-Carry Adder Ripple-Carry Adder

A Better Division Is Used for Leader Expression Computation X = ab (c d e)  cd (a b e) X = (ab + cd) (a  b  c d e) Based on the identity: pq (p q) = 0 Theorem: An expression of the form (PQ RS) can be factored as (P R) T, if there exist U and V such that 1) PU = RV = 0 and 2) Q S = U  V The ideal membership problem can be used to determine the existence of such U and V

Progressive Decomposition: Qualitative Analysis • Completely agnostic of the type of circuit to optimize • Automatically infers successful circuit designs from the literature… • Carry-lookahead adder (beyond minimal sizes) • Structured LZD/LOD circuit • Optimized LZA circuit (no sum computation) • Carry-save addition • Parallel counter • …and discovers some unknown to us! • Multi-Input comparisons (min/max)

Multi-Input Comparator(Min/max of k n-bit Integers) Binary tree of comparators Pairwise comparison of inputs Number of comparators: k (k − 1)/2 Critical path delay: O(log n + log k) Hardware area: O(k2n) Number of comparators: k − 1 Critical path delay: O(log n  log k) Hardware area: O(kn) 0.46 ns, 1755 μm2 0.21 ns, 3479 μm2 With Our Structuring Algorithm: Bitwidth reduction using dominators and LODs Number of LODs: k  log* n Critical path delay: O(log n + log k  log* n) Hardware area: O(kn) 0.22 ns, 1331 μm2 log*() is the number of times the logarithm function must be iteratively applied before the result is ≤ 1 – e.g., log*(265536) = 5

Outline Reed-Muller form can be very inefficient Exhaustive Exploration Efficient implementation of the leader expressions ? Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

Problem Statement no “reuse” total “reuse” selective “reuse” Given a set of Boolean expressions, generate all their Pareto-optimal implementations

EnumeratingCommon Sub-Expressions Root: Original Reed-Muller form Eitherxy or xy replaced by a new variable The nodes of the DAG correspond to all partial implementations of the two expressions with some sharing between them

Pruning the Enumeration DAG • The size of DAG can be as large as O ((n + m) 2m), where n is the number of variables and m is the size of Boolean expressions • Enumerating the whole DAG is computationally infeasible • Pruning Criteria • Recognizing node equivalence (width reduction) • Merging some reductions into a single one(height reduction) • Delaying certain reductions (branch reduction)

There Is Scope for Further Pruning… Number of possible implementations: >1060 Number of explored implementations: 2687 Number of actual Pareto-optimal implementations: 4 Area and delay for all 6-bit adders generated by our algorithm Without any pruning, it would be impossible to handle expressions with more than five variables

+ …but the Enumeration Algorithm Finds Interesting Non-Trivial Relations! 4x4-bit multiplier: better than our best manually-designed cell-based multiplier?! The method has been generalized for higher bitwidth multipliers It reduced the delay of the best cell-based 8 x 8-bit multiplier by 10% Verma & Ienne; ASP-DAC 2007 Best Paper Award nominee

Summary Verma, Brisk, & Ienne; DAC 2007 Best Paper Award nominee Verma, Brisk, & Ienne; IWLS 2008 Verma & Ienne; DAC 2006 Verma & Ienne; ICCAD 2004 Verma, Brisk, & Ienne; TCAD 2008 Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

Computer Arithmetic and Automation • Computer Arithmetic has been for long the domain of extremely ingenuous manually developed architectures • Automation has mostly addressed the optimization of such architectures through the exploration of the predefined design spaces they delimit • Logic synthesis, from the “bottom”, has failed to explore beyond known territories due to fairly fundamental issues It is perhaps high time to try to change all this…

Thanks!

Challenges in Automatic Optimization of Arithmetic Circuits