250 likes | 340 Views
LMS: A New Logic Synthesis Method Based on Pre-Computed Library. Wenlong Yang Lingli Wang State Key Lab of ASIC and System Fudan University, Shanghai, China. Alan Mishchenko Department of EECS University of California, Berkeley. Outline. Introduction Previous Work
E N D
LMS: A New Logic Synthesis Method Based on Pre-Computed Library Wenlong Yang Lingli Wang State Key Lab of ASIC and System Fudan University, Shanghai, China Alan Mishchenko Department of EECS University of California, Berkeley
Outline • Introduction • Previous Work • Lazy Man’s Logic Synthesis(LMS) • Experimental Results • Conclusion & Future Work
Introduction • Goal of logic synthesis: Deriving a circuit or improving an available circuit • We proposed a “Lazy” approach to reuse optimal structures derived by other synthesis tools based on a pre-computed library Other tools A Function with N variables AIG LMS precomputed library
Outline • Introduction • Previous Work • Lazy Man’s Logic Synthesis(LMS) • Experimental Results • Conclusion
Previous Work • Logic synthesis based on precomputed library have been proposed in several papers, but they are all different from LMS: • LMS • Precompute structures in terms of AIGs • Use public benchmarks and existing tools • Look at 6-16 input functions • Store many equivalent structures • Previous work • Precompute structures in terms of LUTs • [Kennings, IWLS, 2010 ] • Didn't use preexisting benchmarks or tools [Bjesse, ICCAD , 2004] • Look at only 4-5 input functions • [Li, IWLS, 2011] • Only compute multiple structure choices • [Chatterjee, TCAD, 2006]
Previous Work – SOP Balancing • For each node • Compute several k-input cuts • Perform delay-optimal tree balancing of the SOP • Select the best one to replace the current structure. F’ = !c*!(b*!a) F = !c*!b + !c*a An AIG subgraph found in benchmark s27.blif where SOP balancing loses to the proposed approach
Outline • Introduction • Previous Work • Lazy Man’s Logic Synthesis(LMS) • Equivalence Classes • Library Representation/Construction • Implementation • Experimental Results • Conclusion
Equivalence Classes • LMS is based on collecting, storing, and re-using circuit structures of Boolean functions with 6-16 input variables. • The total number of completely-specified Boolean functions of N variables is 2^(2^N). • Experiments shows that even for the practical functions, this number can be very large. To reduce the number and memory need to store functions in a library, a canonical form is used to break them into Equivalence Classes.
NPN • Two functions are NPN-equivalent if one of them can be obtained from the other by negation and/or permutation of the inputs and outputs. • Drawbacks of NPN computation: • Time-consuming • Complicated Complete NPN canonical form is not affordable to LMS
Semi-Canonical Form • The idea is to order the input variables and the polarities of inputs/outputs using the number of positive minterms and cofactors w.r.t. each variable. Input:TruthTable F • Determine the polarity of F by the number of 1’s in TruthTable • Determine the polarity of each variable by the number of 1s in the negative cofactor w.r.t. each variable • Sort input variables by the number of 1s in their negative cofactors and permute inputs accordingly Output:canonicizedTruthTable F A reasonable trade-off between accuracy and speed
Library Representation • An N-input library contains functions up to N variables. • Structures of all functions are represented as a shared AIG • Each output of the AIG is the root node of one logic structure. • When a library is loaded, the following actions are performed: • A hash table is created to hash the outputs by its semi-canonical form. • For each structure, the area and pin-to-output delays are computed and stored.
Pin-To-Output Delay & Dominated Structure {3, 2, 4, 5, 2, 3, 1} Suppose arrival time: + {3, 3, 3, 5, 5, 4, 1} Pin-to-output delay: = {6, 5, 7, 10, 7, 7, 2} Example of using pin-to-output delays to compute structure delay If one structure’s pin-to-output delay is worse than another with respect to every input, the structure is dominated.
Library Construction • LUT mapper if in ABC is used as a structural cut browser to generate K-input cuts whose logic structures are added to the library. Input: Cut C • If cut C does not meet the requirements return • Compute Boolean function F of cut C as a truthtable • Compute the semi-canonical form of F • Rebuild the structure of the cut in the library • If ( the structure already exists or is dominated ) return • Add a new primary output to store the structure in the hash table
A case study of LMS: AIG level minimization Input:And-Inverter Graph • For each node, in a topological order • Compute several K-input cuts • For each cut • Compute truth table • Look up in the library • If there is no structure for this function • Mark the cut to ensure it is not selected as best cut • Else if the best structure found leads to smaller AIG level • Save the cut as the best cut • If there is an improvement in level, update AIG
Implementation • The LMS algorithm is implemented in ABC. The LUT mapper ifin ABC is used as: • (a) Acut browser for computing the libraries • (b) Amapper in the case study on AIG level minimization • Commands related to library construction: • rec_start: Starts the LMS recorder. • rec_add: Add structures from benchmarks • rec_filter: Removes the structures with less frequency • rec_merge: Merges two previously computed libraries • rec_ps: Prints statistics for the currently loaded library • rec_use: Transforms the internal library to the current network in ABC • rec_stop: Deletes the current library. • Commands used to perform LMS mapping: • if –y –K <num> -C<num> • -y enables level optimization by LMS • -K <num> is the cut size • -C <num> is the number of cuts used at each node
Outline • Introduction • Previous Work • Lazy Man’s Logic Synthesis(LMS) • Experimental Results • Library Coverage • 6-input Library • Optimize Delay After LUT Mapping • Conclusion
Library Coverage • This experiment was performed to show that LMS has practical memory requirements for functions up to 12 inputs. • Semi-canonical classes of all functions appearing in the cuts of the benchmark circuits without synthesis, were collected and the frequency of their appearance was recorded. • ~2 M classes in total • ~740 K classes for 90% functions • ~400MB for truth tables Function # occurrence frequency
Constructing Library for 6-input Functions Statistics of the precomputed 6-input library • The goal of this experiment is to derive a 6-input library used in the following case study of AIG level minimization. • The following ABC scripts are used to collect structures: • read file; st; rec_add; • dc2; rec_add; • if -K 8; bidec; st; rec_add; • if -K 8; mfs; st; rec_add; • if -K 8; bidec; st; rec_add; • if -g -K 6; st; rec_add; • if -g -K 6; st; rec_add; • ~77MB AIGER file
Optimize Delay After LUT Mapping • Two sets of benchmarks are used in this paper: 20MCNC benchmarks and 10 large Altera benchmarks. • LUT mapping was performed by the following scripts: • Map: st; resyn2; if -K 4 or 6 • MapC: st; resyn2; dch -f; if -K 4 or 6 • SOPBC: st; if -gm -K 6; st; resyn2; dch -f; if -K 4 or 6 • LMSC: st; if -ym -K 6; st; resyn2; dch -f; if -K 4 or 6 • Benchmarks were run on a workstation with a Intel Xeon Quad Core CPU and 256GBytesRAM(~4GB used for the experiment) • The resulting networks were verified by command cec in ABC.
Mapping results for Altera benchmarks(4-LUTS) LMSC reduced delay by 37% with an area increase of 13%
Mapping results for Altera benchmarks(6-LUTS) LMSC reduced delay by 26% with an area increase of 13%
Mapping results for MCNC benchmarks 4-LUTs: LMSC reduced delay by 10% with an area increase of 3% 6-LUTs: LMSC reduced delay by 12% with an area increase of 8%
Conclusion • A new method to harvest and re-use circuit structures produced by different tools on benchmark circuits • The “lazy” approach is made practical by • A semi-canonical form to reduce the number of equivalence classes • Using AIGs to store precomputed libraries in memory and on disk • Using truth tables to manipulate Boolean functions • As the case-study, the proposed approach was applied to improve delay after FPGA mapping • For industrial benchmarks, compared to SOP balancing, • the delay was reduced by 17% (18%) for LUT4 (LUT6) • the area penalty was 2% (5%)
Future work • Improving implementation • Reducing memory by using a low-memory AIG • Building libraries in terms of multi-input gates • Filtering libraries based on their performance • Giving the user control over the area increase • Continuing experiments • Performing case studies with larger functions • Evaluating delay improvements after P&R
Q&A Authors' E-mail: • Wenlong Yang allanwin@hotmail.com • LingliWang llwang@fudan.edu.cn • Alan Mishchenko alanmi@eecs.berkeley.edu