170 likes | 183 Views
Explore strategies to enhance runtime and memory efficiency in EDA applications, focusing on network traversal, memory management, and computation optimization.
E N D
Improving Runtime and Memory Requirements in EDA Applications Alan Mishchenko UC Berkeley
Overview • Introduction • Topics • Network traversal • AIG package • SAT solving • Memory management • Locality of computation • Conclusion
Network Traversal • Optimizing node memory for DFS traversal • Storing fanins/fanouts in the node • Using traversal IDs • Using wave-front traversals • Minimizing memory footprint
Memory Alloc In Topological Order • Optimize node memory for DFS traversal • Allocate memory from an array in a DFS order Primary outputs 8 7 3 6 1 2 5 4 Primary inputs
Store Fanins/Fanouts in the Node • Embed the dynamic array into the node • Leads to direct pointing or storing integer IDs of the fanin/fanouts • In rare cases when memory reallocation is needed (<0.1% of nodes), use a new piece of memory to store extended array of fanins/fanouts struct Nwk_Obj_t_ { … int nFanins; // the number of fanins int nFanouts; // the number of fanouts int nFanioAlloc; // the number of allocated fanins/fanouts Nwk_Obj_t ** pFanio; // fanins/fanouts }; pObj = (Nwk_Obj_t *)Aig_MmFlexEntryFetch( sizeof(Nwk_Obj_t) + sizeof(Nwk_Obj_t *) * (nFanins + nFanouts + p->nFanioPlus) ); pObj->pFanio = (Nwk_Obj_t **)((char *)pObj + sizeof(Nwk_Obj_t));
Traversal ID • Use a specialized integer data-member of the node to remember the number of the last traversal that visited this node void Nwk_ManDfs_rec( Nwk_Man_t * p, Nwk_Obj_t * pObj, Vec_Ptr_t * vNodes ) { if ( Nwk_ObjIsTravIdCurrent(p, pObj) ) return; Nwk_ObjSetTravIdCurrent(p, pObj); Nwk_ManDfs_rec( p, Nwk_ObjFanin0(pObj), vNodes ); Nwk_ManDfs_rec( p, Nwk_ObjFanin1(pObj), vNodes ); Vec_PtrPush( vNodes, pObj ); } Vec_Ptr_t * Nwk_ManDfs( Nwk_Man_t * p ) { Vec_Ptr_t * vNodes; Nwk_Obj_t * pObj; int i; Nwk_ManIncrementTravId( p ); vNodes = Vec_PtrAlloc(); Nwk_ManForEachPo( p, pObj, i ) Nwk_ManDfs_rec( p, pObj, vNodes ); return vNodes; }
Wave-Front Traversals • Some applications use additional memory at each node • Examples: Simulation, cut enumeration, support computation • 1K per node for 1M nodes = 1Gb of additional memory! • Case study: Computing input supports of each output of the network • Used, for example, to compute (a) output partitioning, (b) register dependency matrix (A. Dasdan et al, “An experimental study of minimum mean cycle algorithms”, 1998) • Code: procedure Aig_ManSupports() in file “abc\src\aig\aig\aigPart.c” Wave-front Wave-front Wave-front At any time during traversal, a wave-front is the set of nodes such that: all fanins are already visited and at least one fanout is not yet visited. Additional memory is only needed for the nodes on the wave-front. For most industrial designs, wave-front is about 1% of all nodes (1Gb 10Mb).
Minimizing Memory Footprint • When repeatedly traversing a large network, runtime is determined by memory pumped through the CPU (pointer chasing) • Examples when repeated traversal cannot be avoided • Sequential simulation of a network for many cycles • Computing maximum-network flow during retiming, etc • In such applications, it is better to develop a specialized, static, low-memory representation of the network • Reducing memory 2x may improve runtime 3-5x • Example: Most-forward retiming (code in “abc\src\aig\aig\aigRet.c”) • If repeated topological and reverse topological traversals are performed, it may be better to have two networks, each having memory allocated to facilitate each traversal order
Implementation of AIG Package • Fixed amount of memory for each AIG node • Arbitrary fanout also uses fixed amount of memory per node! • Different memory configurations • Structural hashing • The only potentially non-cache-friendly operation • Tricks to speed up structural hashing • AIGER: Compact binary AIG representation format • Work of Armin Biere (Johannes Kepler University, Linz, Austria) • Available at http://fmv.jku.at/aiger
24 bytes (32b) / 40 bytes (64b) struct Hop_Obj_t_ { Hop_Obj_t * pNext; // strashing table Hop_Obj_t * pFanin0; // fanin Hop_Obj_t * pFanin1; // fanin void * pData; // misc unsigned int Type : 3; // object type unsigned int fPhase : 1; // value under 00...0 unsigned int fMarkA : 1; // multipurpose mask unsigned int fMarkB : 1; // multipurpose mask unsigned int nRefs : 26; // reference counter int Id; // unique ID }; 36 bytes (32b) / 56 bytes (64b) struct Aig_Obj_t_ { Aig_Obj_t * pNext; // strashing table Aig_Obj_t * pFanin0; // fanin Aig_Obj_t * pFanin1; // fanin Aig_Obj_t * pHaig; // pointer to the HAIG node unsigned int Type : 3; // object type unsigned int fPhase : 1; // value under 00...0 pattern unsigned int fMarkA : 1; // multipurpose mask unsigned int fMarkB : 1; // multipurpose mask unsigned int nRefs : 26; // reference count unsigned Level : 24; // the topological level unsigned nCuts : 8; // the number of cuts int Id; // unique ID int TravId; // ID of the last traversal union { // temporary storage void * pData; int iData; float fData; }; }; AIG Node • ABC has several AIG packages • A low-memory package is used to represent local functions after mapping • A more elaborate package is used for general AIG manipulation Open question: How to store fanins of the node, as pointers or as integer IDs?
Fixed-Memory Fanout for AIGs • Solution (due to Satrajit Chatterjee): • Use 5 pointers (integers) for each node • One pointer (integer) contains the first fanout of the node • Other pointers (integers) are used to create two double-linked linked lists • Each list stores fanout representation of the corresponding fanin • Double-linked lists allow for constant-time addition/removal of node fanouts • Code in file “abc\src\aig\aig\aigFanout.c” a b c NULL NULL n n n n node first fanout } fanouts of the first fanin } fanouts of the second fanin fanins
Structural Hashing • The only potentially non-cache-friendly AIG operation • Structural hashing is very valuable – but cannot avoid hashing • The standard hash-table is used, with nodes having the same hash key being linked into single-linked lists • The pointer to the next node is embedded in the AIG node • Tried the linear-probing hash-table without improvement • Trick to sometimes avoid hash-table look-up • When building a new node, do not look it up in the table if at least one of its fanins has reference counter 0
AIGER • Uses 3 bytes per AIG node, on average • 1M node AIG can be written into a 3Mb file • ~12x more compact than Verilog, BLIF, or BENCH • ~5x faster reading/writing for large files • Key observations used by AIGER • To represent a node, two integers (fanin literals) need to be represented • The fanin literals are often numerically close • Only the difference between them can be stored, which typically takes only one byte
SAT Solving • A modern SAT solver (in particular, MiniSAT) is a treasure-trove of tricks for efficient implementation • To mentions just a few • Representing clauses as arrays of integers • Using signatures to check clause containment • Using two-literal watching scheme • An idea for ~30% faster BCP: • For watched lists, use single-linked lists instead of dynamic arrays • Embed the list pointers into clauses
Custom Memory Management • Three types of memory managers in ABC • Fixed-size • Allocates/recycles entries of a fixed size • Used for AIG nodes • Flexible-size • Allocates (but does not recycle) entries of variable size • Used for signal names • Step-size • Steps are degrees of 2 (4-8-16-32-etc) in bytes • Use for CNF clauses in the customized version of MiniSat • Code in package “abc\src\aig\mem”
Locality of Computation • To improve speed • Use less memory • Make transformations local • Use contiguous data-structures • Case study: BDDs vs. truth tables (TTs) • In the past: “BDDs are present-day truth tables” • These days: “Truth tables are present-day BDDs” • Advantages of TTs • Computation is more local • Memory usage is predictable • For functions up to 16 vars, TTs lead to faster computation • ISOP, DSD, matching, decomposition, etc • Limitations of TTs • Does not work for more than 16 variables • Some operations are faster using BDDs, even for functions with 10 variables • E.g. cofactor satisfy counting
Conclusion • Lessons learned while developing ABC • Topics considered • Network traversal • AIG representation • SAT solving • Memory management • Locality of computation • Locality of computation is important • Allows for efficient control of the resources • Leads to scalability and parallelism