250 likes | 272 Views
Learn strategies to optimize memory use, reduce runtime, and enhance performance in EDA applications. Topics include network traversal, memory allocation, wave-front traversals, and memory footprint minimization.
E N D
Improving Runtime and Memory Requirements in EDA Applications Alan Mishchenko UC Berkeley
Overview • Introduction • Topics • Network traversal • AIG package • SAT solver • BDD package • Memory management • Locality of computation • Conclusion
Network Traversal • Optimizing node memory for DFS traversal • Storing fanins/fanouts in the node • Using traversal IDs • Using wave-front traversals • Minimizing memory footprint
Memory Alloc In Topological Order • Optimize node memory for DFS traversal • Allocate memory from an array in a DFS order Primary outputs 8 7 3 6 1 2 5 4 Primary inputs
Store Fanins/Fanouts in the Node • Embed the dynamic array into the node • Leads to direct pointing or storing integer IDs of the fanin/fanouts • In rare cases when memory reallocation is needed (<0.1% of nodes), use a new piece of memory to store extended array of fanins/fanouts struct Nwk_Obj_t_ { … int nFanins; // the number of fanins int nFanouts; // the number of fanouts int nFanioAlloc; // the number of allocated fanins/fanouts Nwk_Obj_t ** pFanio; // fanins/fanouts }; pObj = (Nwk_Obj_t *)Aig_MmFlexEntryFetch( sizeof(Nwk_Obj_t) + sizeof(Nwk_Obj_t *) * (nFanins + nFanouts + p->nFanioPlus) ); pObj->pFanio = (Nwk_Obj_t **)((char *)pObj + sizeof(Nwk_Obj_t));
Traversal ID • Use a specialized integer data-member of the node to remember the number of the last traversal that visited this node void Nwk_ManDfs_rec( Nwk_Man_t * p, Nwk_Obj_t * pObj, Vec_Ptr_t * vNodes ) { if ( Nwk_ObjIsTravIdCurrent(p, pObj) ) return; Nwk_ObjSetTravIdCurrent(p, pObj); Nwk_ManDfs_rec( p, Nwk_ObjFanin0(pObj), vNodes ); Nwk_ManDfs_rec( p, Nwk_ObjFanin1(pObj), vNodes ); Vec_PtrPush( vNodes, pObj ); } Vec_Ptr_t * Nwk_ManDfs( Nwk_Man_t * p ) { Vec_Ptr_t * vNodes; Nwk_Obj_t * pObj; int i; Nwk_ManIncrementTravId( p ); vNodes = Vec_PtrAlloc(); Nwk_ManForEachPo( p, pObj, i ) Nwk_ManDfs_rec( p, pObj, vNodes ); return vNodes; }
Wave-Front Traversals • Some applications use additional memory at each node • Examples: Simulation, cut enumeration, support computation • 1K per node for 1M nodes = 1Gb of additional memory! • Case study: Computing input supports of each output of the network • Used, for example, to compute (a) output partitioning, (b) register dependency matrix (A. Dasdan et al, “An experimental study of minimum mean cycle algorithms”, 1998) • Code: procedure Aig_ManSupports() in file “abc\src\aig\aig\aigPart.c” Wave-front Wave-front Wave-front At any time during traversal, a wave-front is the set of nodes such that: all fanins are already visited and at least one fanout is not yet visited. Additional memory is only needed for the nodes on the wave-front. For most industrial designs, wave-front is about 1% of all nodes (1Gb 10Mb).
Minimizing Memory Footprint • When repeatedly traversing a large network, runtime is determined by memory pumped through the CPU (pointer chasing) • Examples when repeated traversal cannot be avoided • Sequential simulation of a network for many cycles • Computing maximum-network flow during retiming, etc • In such applications, it is better to develop a specialized, static, low-memory representation of the network • Reducing memory 2x may improve runtime 3-5x • Example: Most-forward retiming (code in “abc\src\aig\aig\aigRet.c”) • If repeated topological and reverse topological traversals are performed, it may be better to have two networks, each having memory allocated to facilitate each traversal order
Implementation of AIG Package • Fixed amount of memory for each AIG node • Arbitrary fanout also uses fixed amount of memory per node! • Different memory configurations • Structural hashing • The only potentially non-cache-friendly operation • Tricks to speed up structural hashing • AIGER: Compact binary AIG representation format • Work of Armin Biere (Johannes Kepler University, Linz, Austria) • Available at http://fmv.jku.at/aiger
12 bytes (32b) / 12 bytes (64b) struct Gia_Obj_t_ { unsigned iDiff0 : 29; // the diff of the first fanin unsigned fCompl0: 1; // the complemented attribute unsigned fMark0 : 1; // first user-controlled mark unsigned fTerm : 1; // terminal node (CI/CO) unsigned iDiff1 : 29; // the diff of the second fanin unsigned fCompl1: 1; // the complemented attribute unsigned fMark1 : 1; // second user-controlled mark unsigned fPhase : 1; // value under 000 pattern unsigned Value; // application-specific value }; 36 bytes (32b) / 56 bytes (64b) struct Aig_Obj_t_ { Aig_Obj_t * pNext; // strashing table Aig_Obj_t * pFanin0; // fanin Aig_Obj_t * pFanin1; // fanin Aig_Obj_t * pHaig; // pointer to the HAIG node unsigned int Type : 3; // object type unsigned int fPhase : 1; // value under 00...0 pattern unsigned int fMarkA : 1; // multipurpose mask unsigned int fMarkB : 1; // multipurpose mask unsigned int nRefs : 26; // reference count unsigned Level : 24; // the topological level unsigned nCuts : 8; // the number of cuts int Id; // unique ID int TravId; // ID of the last traversal union { // temporary storage void * pData; int iData; float fData; }; }; AIG Node • ABC has several AIG packages • A low-memory package is used for simulation and equivalence checking • A more elaborate package is used for general AIG manipulation Observation: It is better to store node fanins as integer IDs rather than pointers.
Fixed-Memory Fanout for AIGs • Solution (due to Satrajit Chatterjee): • Use 5 pointers (integers) for each node • One pointer (integer) contains the first fanout of the node • Other pointers (integers) are used to create two double-linked linked lists • Each list stores fanout representation of the corresponding fanin • Double-linked lists allow for constant-time addition/removal of node fanouts • Code in file “abc\src\aig\aig\aigFanout.c” a b c NULL NULL n n n n node first fanout } fanouts of the first fanin } fanouts of the second fanin fanins
Structural Hashing • The only potentially non-cache-friendly AIG operation • Structural hashing is very valuable – but cannot avoid hashing • The standard hash-table is used, with nodes having the same hash key being linked into single-linked lists • The pointer to the next node is embedded in the AIG node • Tried the linear-probing hash-table without improvement • Trick to sometimes avoid hash-table look-up • When building a new node, do not look it up in the table if at least one of its fanins has reference counter 0
AIGER • Uses ~3 bytes per AIG node, on average • 1M node AIG can be written into a 3Mb file • ~12x more compact than Verilog, BLIF, or BENCH • ~5x faster reading/writing for large files • Key observations used by AIGER • To represent a node, two integers (fanin literals) need to be represented • The fanin literals are often numerically close • Only the difference between them can be stored, which typically takes only one byte
SAT Solver • A modern SAT solver (in particular, MiniSAT) is a treasure-trove of tricks for efficient implementation • To mentions just a few • Representing clauses as arrays of integers • Using signatures to check clause containment • Using two-literal watching scheme • etc
SAT Solver (What’s Missing?) • Most of the modern SAT solvers are geared to solving hard problems, such as those encountered in SAT competitions (1 problem ~ 15 min) • This motivates elaborate data-structures and high memory usage • 64 bytes per variable; 16 bytes per clause; 4 bytes per literal • In ABC, runtime of several applications is dominated by SAT • SAT sweeping • Sequential SAT sweeping (register/signal correspondence) • Accumulation of structural choices • Computing don’t-cares in a window • The SAT problems solved in these applications have much in common • Incremental (each problem has +/- 10 AIG nodes, compared to the previous problem solved) • Relatively easy (less than 100 conflicts) • Numerous (10K-100K problems) • Based on these observations, a new efficient circuit-based SAT solver was developed (abc\src\aig\gia\giaCSat.c)
Experimental Results (CEC) CEC results for 8 hard industrial instances. Runtime in minutes on Intel Q9450 @ 2.66 Ghz. Time1 is “cec” in ABC809xx. Time2 is “&cec” in abc90329. Timeout is 1 hour. Less than 100 Mb of RAM was used in these experiments.
Why MiniSAT Is Slower? • Requires multiple intermediate steps • Window AIG CNF Solving • Instead of Window Solving • Uses too much memory • Solver + CNF = 140 bytes / AIG node • Instead of 8-16 bytes / AIG node • Decision heuristics • Are not aware of the circuit structure • Instead of Using circuit information
BDD Package • Similar to a SAT solver, a modern BDD package is a well-researched computation engine, which performs • Boolean function manipulation • Garbage collection • Dynamic variable reordering • etc • The usefulness of BDD package is limited since the arrival of AIGs (2000) and efficient SAT solvers (2001) • However, some applications still rely on BDDs (for example, exact reachability analysis) • This motivates building a better BDD package
BDD Package (What’s Missing?) • How a modern BDD package can be improved? • Make it pointer-independent (!) • Leads to reproducible results across different runs / platforms • Improve CPU cache behavior by using 8 bytes per node • Present packages use 16 or 32 bytes (on a 32- or 64-bit computer) • Improve dynamic variable reordering • Currently, it is very slow (~1M BDD nodes takes ~5 min) • Apply variable reordering more frequently • Rather than wait to BDD to grow large followed by slow reordering • These and other ideas are currently being implemented
Minimalistic BDD Data-Structure • Node representation • Node storage (8 bytes per node) • Next pointers (4 bytes per node) • Unique table (4 bytes per node) • Computed table (16 bytes per entry) • External referencing • Two relatively small arrays of integers • Variable / Level mapping • Two relatively small arrays of integers • Dynamic variable reordering • Temporary storage for nodes (8 bytes per node) • Temporary reference counters (4 bytes per node) • Temporary marks (1 bit per node)
BDD Node Representation struct Bdd_Node_t // 64 bits = 8 bytes { unsigned f0 : 24; // negative cofactor unsigned c0 : 1; // complemented attribute of negative cofactor unsigned f1 : 24; // positive cofactor unsigned lev : 15; // level }; • This node structure is optimized for frequent traversals • Allows for building BDDs with ~32K variables and ~16M nodes
Custom Memory Management • Three types of memory managers in ABC • Fixed-size • Allocates/recycles entries of a fixed size • Used for AIG nodes • Flexible-size • Allocates (but does not recycle) entries of variable size • Used for signal names • Step-size • Steps are degrees of 2 (4-8-16-32-etc) in bytes • Use for CNF clauses in the customized version of MiniSat • Code in package “abc\src\aig\mem”
Locality of Computation • To improve speed • Use less memory • Make transformations local • Use contiguous data-structures • Case study: BDDs vs. truth tables (TTs) • In the past: “BDDs are present-day truth tables” • These days: “Truth tables are present-day BDDs” • Advantages of TTs • Computation is more local • Memory usage is predictable • For functions up to 16 vars, TTs lead to faster computation • ISOP, DSD, matching, decomposition, etc • Limitations of TTs • Does not work for more than 16 variables • Some operations are faster using BDDs, even for functions with 10 variables • E.g. cofactor satisfy counting
Conclusion • Lessons learned while developing ABC • Topics considered • Network traversal • AIG representation • SAT solving • Memory management • Locality of computation • Locality of computation is important • Allows for efficient control of the resources • Leads to scalability and parallelism