Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow

Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow 1Processor Architecture Laboratory School of Computer and Communication Sciences Ecole Polytechnique Fédérale de Lausanne (EPFL) Ajay K. Verma1, Philip Brisk2, Paolo Ienne1 2Department of Computer Science and Engineering Bourns College of Engineering University of California, Riverside International Conference on Computer-Aided Design November 5, 2009

Logic synthesis tools Local optimization via Boolean minimization Architectural transformation Not with “traditional” logic synthesis Ripple-Carry Adder Carry-Lookahead Adder Logic Optimization Strategies 1

Leading Zero Detector Naïve Implementation Optimized Implementation [Oklobdzija, TLVSI 1994] 16% faster, 8% smaller 2

Decomposition Techniques Progressive Decomposition and its Shortcomings [Verma et al., DAC 2007] Iterative Layering Algorithm Experimental Results Conclusion Outline 3

Decomposition 4

Optimize the red block locally Recursively decompose the remaining circuit Decomposition 4

Input condensation At each step, fewer input bits remain Imposes hierarchy on the circuit Decomposition 4

The result is a well-structured hierarchical circuit Decomposition 4

Disjoint Decomposition Non-disjoint Decomposition 5

Disjoint Decomposition Example:8:4 Parallel Counter s c (Full Adder) 6

y3 x0 y2 x0 y1 x0 y0 x0 X Y 4 bits y3 x1 y2 x1 y1 x1 y0 x1 PPG X Y 16 bits y3 x2 y2 x2 y1 x2 y0 x2 Σ y3 x3 y2 x3 y1 x3 y0 x3 4x4-bit Multiplier 4 bits 7

4x4-bit Multiplier X Y 4 bits 4 bits PPG X Y 16 bits Partial product reduction tree has a disjoint decomposition Σ 7

X Y 4 bits PPG X Y 16 bits Σ 4x4-bit Multiplier 4 bits The partial product generator requires a non-disjoint decomposition Partial product reduction tree has a disjoint decomposition 7

E1 E2 19 19 M1 M2 M1 M2 E1 E2 sign 48 1 48 19 19 and 4 not s1 s2 sign xor 4 neg s1 s2 xor out out Compound Circuits g72x 12% faster, 55% larger 8

Successfully structured some arithmetic circuits Ripple-carry adder Inferred parallel prefix adder 3-input ripple-carry adder Inferred carry-save adder Leading zero detector Inferred design of [Oklobdzija 1994] Various counters, majority Inferred carry-free structures functions, etc. based on carry-save addition Progressive Decomposition[Verma et al., DAC 2007] 10

Disjoint decomposition Forget about multipliers Cannot handle compound arithmetic circuits Entire algorithm based on Reed-Muller Form Rewrite ‘your’ optimizer, e.g., if you use AIGs or BDDs. Exponential size for leading one detector Leading zero detector remains polynomial Progressive Decomposition[Verma et al., DAC 2007] 10

Non-disjoint decomposition Yields disjoint decompositions when appropriate Not tied to any specific circuit representation Our implementation uses BDDs SAT-based functional dependence test [Lee et al., ICCAD 2007] Requires efficient conversion to CNF Functional dependence is inherent to any decomposition Iterative Layering 12

Bricks Definition and algorithmic overview Evaluation metrics Brick Enumeration Cofactor enumeration Generate bricks from cofactors Brick Selection Problem formulation related to Set Cover Iterative Layering Outline 13

A subcircuit with <k inputs and one output Any functional dependence may exist between a brick and the original expression Kernels and co-kernels are bricks The dependence is disjunctive by definition Bricks E = ac + ad + bc + bd 7 gates Brick: p = a + b (1 gate) E = pc + pd 4 gates E = p(c + d) 3 gates 14

Enumerate all bricks having < k inputs k=6 in our implementation Evaluate all bricks based on a merit function Select a subset of bricks The subset must contain all of the information about the circuit The subset should be optimal w.r.t. some optimization criteria The selected bricks form a “layer” Stack layers on top of one another to structure the circuit Iterative Layering Algorithm 15

Estimated gate reduction Size of BDD of input expression [Macii et al., GLS-VLSI 1999] Size(BDDf) Info. Fitness = Size(BDDg) + Size(BDDp) Information Fitness p f g 16

E – expression to optimize p – brick under consideration D = on-set(E)  off-set(E) N = {(x, y)D| p(x)  p(y)} Intuition: Attempt to quantify the functional dependency from p to E Limitation: Requires completely specified truth table Size is exponential in the number of inputs Our Approach: Randomly sample the truth table of E Theorem 1 in the paper includes some probabilistic justification |N| Info. Coverage = |D| Information Coverage 17

Bricks Definition and algorithmic overview Evaluation metrics Brick Enumeration Cofactor enumeration Generate bricks from cofactors Brick Selection Problem formulation related to Set Cover Iterative Layering: Outline 18

Enumerate every combination of k input bits E = ab  cd  (a  b)(c  d) B = {a, b, c} R = {d} Enumerate the set of cofactors with respect to R S = {Ed Ed } = {ab  bc  ac, ab  bc  ac  a  b  c} Problem: |S| = 2|R| Brute Force Cofactor Enumeration 19

Generate an initial set of cofactors using random sampling Test if E depends on the cofactors and any remaining variables [Lee et al., ICCAD 2007] SAT = FALSE implies a full dependence SAT = TRUE implies a partial dependence Satisfying assignment of input variables yields one missing cofactor Repeat Step 2 until SAT = FALSE Cofactor Enumeration via Sampling and SAT-based Functional Dependence Testing 20

Brick Computation: Summary For every combination of at most k input bits • Generate the cofactors of the remaining bits • Random sampling + SAT-based functional dependence testing • Discard useless cofactors • Details are in the paper • Recursively apply iterative layering with a smaller value of k to generate the bricks from the cofactors That’s a lot of bricks! • Which bricks do I really need? 21

Bricks Definition and algorithmic overview Evaluation metric Brick Enumeration Cofactor enumeration Generate bricks from cofactors Brick Selection Problem formulation related to Set Cover Iterative Layering: Outline 22

Brick Selection: Overview Goal: Find a minimal set of bricks that covers all points in on-set(E)  off-set(E) • Greedy heuristic based on [Johnson, HCSS 1974] • Select a brick that maximizes Info.Fitness  Info.Coverage • Update Info.Fitness and Info.Coverage for the remaining bricks • Stop when E is functionally dependent on the chosen bricks [Lee et al., ICCAD 2007] • See the paper for details on the data structures used 23

Experimental Setup Circuit written by hand Iterative Layering Progressive Decomposition [Verma et al., DAC 2007] Known Arithmetic Circuits 1 2 3 4 Synopsis Design Compiler - compile_ultra - minimize delay Artisan Standard Cells UMC (90 nm) 25

Optimized for Area, Not Delay Progressive Decomposition Fails Critical Path Delay ns Original Progressive Decomposition Iterative Layering Library/Manual Implementation 26

Area Optimized for Area, Not Delay Progressive Decomposition Fails μm2 Original Progressive Decomposition Iterative Layering Library/Manual Implementation 27

n-bit, k-input MAX Function Pairwise Comparison of Inputs ½k(k - l) comparators Delay: O(log n + log k) Area: O(k2n) 0.21ns, 3479 m2 Binary Tree of Comparators k - l comparators Delay: O(log n  log k) Area: O(kn) 0.46ns, 1755 m2 Iterative Layering 0.22ns, 1331 m2 (Circuit structure was unknown to us) 28

8-bit, 4-input MAX Example Integer Domination Table Count Leading 1’s • 0010110 • 0111011 • 0111110 (61) 0111101 1010110 1111011 1111110 1111101 001 (1) 100 (4) 110 (6) 101 (5) • 001 • 100 • 110 (5) 101 001 100 110 101 00 (0) 01 (1) 10 (2) 01 (1) Replace any all-zero column with ones! (0) 00 • 01 • 10 (1) 01 00 01 10 01 0 0 1 MAX 0 29

Iterative Layering structures arithmetic circuits Automatically infer well-known manual designs from arithmetic literature Fixes shortcomings of Progressive Decomposition Non-disjoint decomposition Usable with any circuit representation Conclusion Compound Arithmetic Circuits ADD 3-ADD LZD MUL SHFT MAX PD IL 31

Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow