1 / 30

Challenges in Automatic Optimization of Arithmetic Circuits

csda. csda. Challenges in Automatic Optimization of Arithmetic Circuits. Ajay K. Verma , Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL).

randy
Download Presentation

Challenges in Automatic Optimization of Arithmetic Circuits

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. csda csda Challenges in Automatic Optimizationof Arithmetic Circuits Ajay K. Verma, Philip Brisk and Paolo Ienne Processor Architecture Laboratory (LAP) & Centre for Advanced Digital Systems (CSDA) Ecole Polytechnique Fédérale de Lausanne (EPFL)

  2. Circuit PerformanceDepends Heavily on the Description Multiplier with OptimizedCompressor Tree “Software” Multiplier Multiplier with Compressor Tree

  3. Pre-Synthesis Optimization of Arithmetic Circuits Known architectures Original Circuit Description Arithmetic optimizations Logic Synthesis Physical Design Automatic architecture exploration

  4. Automation and Computer Arithmetic • Algorithmic approaches for a particular class of circuits • Variable group size CLA adder [Lee91] • Irregular partial product compressors [Stelling98] Automation • Heuristics to optimize general classes of circuits • Kernel and co-kernel extraction [Brayton82] • Decomposition based approaches for general circuits [Bertacco97, Mishchenko01, Yang02]

  5. Logic Synthesis • Synthesis tools have become extremely good in optimizing circuits expressed in Sum-Of-Product form • And when there are plenty of XOR gates? ? Before expansion :0.37 ns (138.2 μm2) After expansion :0.26 ns (146.9 μm2) Before expansion :0.22 ns (58.8 μm2) After expansion :0.27 ns (221.2 μm2)

  6. Outline Verma, Brisk, & Ienne; DAC 2007 Best Paper Award nominee Verma, Brisk, & Ienne; IWLS 2008 Verma & Ienne; DAC 2006 Verma & Ienne; ICCAD 2004 Verma, Brisk, & Ienne; TCAD 2008 Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

  7. Outline Low Complexity High High Granularity Low Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

  8. Outline Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

  9. Clustering: Maximization of the Use of Carry-Save Representation The two addition nodes are clustered Two addition nodes are separated by NOT Goal: Swap the adders with other logic operations while preserving the semantics to cluster additions

  10. Examples of Transformations Advancing shift left over add(distributivity of multiplication over addition) (A << k)  A . 2k Advancing shift right over addition is more complex Advancing SEL over add(existence of the identity element of addition) C ? (A + B) : D (C ? A : D) + (C ? B : 0)

  11. Some Transformations Have a Cost Advancing PP over add(distributive property of multiplication over addition) This transformation has a significant cost in terms of area!

  12. Generation of All Pareto-Optimal Implementations Pareto-optimal:better than any other in terms of area or critical-path delay Theorem: The transformations form a persistent and confluent reduction system

  13. Example: adpcmdecode Kernel AND network Compressor tree 0.51 ns, 4901 μm2 0.85 ns, 5678 μm2

  14. Outline Limited scope for optimizations Bit-level Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

  15. Implementation of Subcircuits Corresponding to Contiguous Layers Can Be Improved Arithmetic ADD Logic LZD A direct implementation of LZA in carry-select fashion [Gerwig99] Leading Zero Anticipator

  16. Recursively compute leader expressions again Input Condensation • Leader expressions: • Sufficient to evaluate the whole of an expression • Once you evaluate them, you can discard the input bits IN IN 8-input parallel counter Some Large Circuit Leader expressions L |L| < |IN| s c Smaller circuit OUT OUT Compute all leader expressions in parallel

  17. x y z z = f(x, y) Progressive Decomposition: Algorithm Overview • Choose a subset of input bits • How many bits? • Many different combinations? • Find leader expressions • Optimize via Boolean ring properties • Find identities • Discard dependent expressions • Rewrite circuit in terms of leader expressions • Recursively process the remaining circuit

  18. a1 a0 b1 c1 b0 c0 CSA CSA 0 carry sum Carry-save adder + + 0 a1 a0 c0 b1 b0 c1 X + + 0 0 + + + 0 X Example: 3-Input Adder (s2 Output) X = [a1b1 + (a1 + b1)a0b0]  [(a1 b1  a0b0)c1 + c0(a0 b0)(c1 + (a1 b1  a0b0))] L(X, {a1, b1, c1}) ={a1 b1  c1, a1b1 b1c1 a1c1} 3:2 Compressor Ripple-Carry Adder Ripple-Carry Adder

  19. A Better Division Is Used for Leader Expression Computation X = ab (c d e)  cd (a b e) X = (ab + cd) (a  b  c d e) Based on the identity: pq (p q) = 0 Theorem: An expression of the form (PQ RS) can be factored as (P R) T, if there exist U and V such that 1) PU = RV = 0 and 2) Q S = U  V The ideal membership problem can be used to determine the existence of such U and V

  20. Progressive Decomposition: Qualitative Analysis • Completely agnostic of the type of circuit to optimize • Automatically infers successful circuit designs from the literature… • Carry-lookahead adder (beyond minimal sizes) • Structured LZD/LOD circuit • Optimized LZA circuit (no sum computation) • Carry-save addition • Parallel counter • …and discovers some unknown to us! • Multi-Input comparisons (min/max)

  21. Multi-Input Comparator(Min/max of k n-bit Integers) Binary tree of comparators Pairwise comparison of inputs Number of comparators: k (k − 1)/2 Critical path delay: O(log n + log k) Hardware area: O(k2n) Number of comparators: k − 1 Critical path delay: O(log n  log k) Hardware area: O(kn) 0.46 ns, 1755 μm2 0.21 ns, 3479 μm2 With Our Structuring Algorithm: Bitwidth reduction using dominators and LODs Number of LODs: k  log* n Critical path delay: O(log n + log k  log* n) Hardware area: O(kn) 0.22 ns, 1331 μm2 log*() is the number of times the logarithm function must be iteratively applied before the result is ≤ 1 – e.g., log*(265536) = 5

  22. Outline Reed-Muller form can be very inefficient Exhaustive Exploration Efficient implementation of the leader expressions ? Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

  23. Problem Statement no “reuse” total “reuse” selective “reuse” Given a set of Boolean expressions, generate all their Pareto-optimal implementations

  24. EnumeratingCommon Sub-Expressions Root: Original Reed-Muller form Eitherxy or xy replaced by a new variable The nodes of the DAG correspond to all partial implementations of the two expressions with some sharing between them

  25. Pruning the Enumeration DAG • The size of DAG can be as large as O ((n + m) 2m), where n is the number of variables and m is the size of Boolean expressions • Enumerating the whole DAG is computationally infeasible • Pruning Criteria • Recognizing node equivalence (width reduction) • Merging some reductions into a single one(height reduction) • Delaying certain reductions (branch reduction)

  26. There Is Scope for Further Pruning… Number of possible implementations: >1060 Number of explored implementations: 2687 Number of actual Pareto-optimal implementations: 4 Area and delay for all 6-bit adders generated by our algorithm Without any pruning, it would be impossible to handle expressions with more than five variables

  27. + …but the Enumeration Algorithm Finds Interesting Non-Trivial Relations! 4x4-bit multiplier: better than our best manually-designed cell-based multiplier?! The method has been generalized for higher bitwidth multipliers It reduced the delay of the best cell-based 8 x 8-bit multiplier by 10% Verma & Ienne; ASP-DAC 2007 Best Paper Award nominee

  28. Summary Verma, Brisk, & Ienne; DAC 2007 Best Paper Award nominee Verma, Brisk, & Ienne; IWLS 2008 Verma & Ienne; DAC 2006 Verma & Ienne; ICCAD 2004 Verma, Brisk, & Ienne; TCAD 2008 Creating Macroscopic Structure Exploring Microscopic Structure Optimizing at Word-Level

  29. Computer Arithmetic and Automation • Computer Arithmetic has been for long the domain of extremely ingenuous manually developed architectures • Automation has mostly addressed the optimization of such architectures through the exploration of the predefined design spaces they delimit • Logic synthesis, from the “bottom”, has failed to explore beyond known territories due to fairly fundamental issues It is perhaps high time to try to change all this…

  30. Thanks!

More Related