190 likes | 313 Views
Recursive Data Structure Profiling. Easwaran Raman David I. August Princeton University. CPU. DRAM. 1. 1996. 1988. 1990. 1992. 1994. 1998. 2000. 1986. 1980. 1982. 1984. Motivation. Huge processor-memory performance gap Latency > 100 cycles
E N D
Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University
CPU DRAM 1 1996 1988 1990 1992 1994 1998 2000 1986 1980 1982 1984 Motivation • Huge processor-memory performance gap • Latency > 100 cycles • significant fraction of memory operations in typical programs • In many applications, Recursive Data Structures (RDS) constitute a large fraction of memory usage 1000 100 10 Year
Motivation • Techniques to minimize the performance impact of this gap • Caching, prefetching, out-of-order execution • Not very successful for RDS • Difficult to statically determine many RDS properties • Accesses are irregular and usually lie in critical path of execution Short loop body prevents efficient OoO execution Non-contiguous layout results in irregular access patterns while (valid(node)){ //do something //with node->data node = next(node) } 0x1000 0x2000 0x3000 0x4000 Traversal Code An RDS layout example
Motivation • Linearization[Clark76, Luk99] • Speculation recovery costs outweighs benefits if the next pointer field gets overwritten frequently • Information on the dynamic behavior of entire RDS structure is important head 1000 1008 1012 1004 1016 head pos index = 0; head = pos[index] while(head){ foo(head) head = pos[index++] check(head) } Placement of the nodes in the figure correspond to their placement in memory
RDS Profile • RDS profiling gives a ‘logical’ understanding of runtime behavior • ‘Application creates 100 trees’ instead of ‘application allocates 2MB in heap’ • ‘Linked list traversed 10 times’ instead of ‘Address 0x10004000 accessed 200 times’ • Profile for linearization: next pointer field in list L is modified n times
1 2 3 RDS Discovery node *tree_create(){ node *n = (node *)malloc(…); … n->left = tree_create(…); n->right = tree_create(…); } call malloc ;id = 1 mov r10 = r8 … call tree_create … call malloc ;id = 2 … mov r11 = r8 store r10[offset1] = r11; create 1->2 call tree_create … call malloc ;id = 3 … mov r12 = r8 store r10[offset2] = r12;create 1->3 C function for creating a tree • Assign unique id for value returned by malloc and create a node labeled by that id • Connect nodes by a directed edge if both the address and the value of a store have valid ids Dynamic Shape Graph Execution trace in (pseudo) assembly
1 array = malloc(…); for (i=…) array[i] = create_tree(…); … 2 5 3 4 6 7 RDS Discovery • Multiple RDS instances can be connected together in the DSG! • To separate them, we use properties of the static code • Use another graph called Static Shape Graph (SSG)
RDS discovery Execution trace in (pseudo) assembly call malloc; id = 1 Mov r20 = r8 …call malloc ;id = 2 …mov r10 = r8 …… …call tree_create …… call malloc ;id = 3 …… mov r11 = r8 …store r10[offset1] = r11; create 2->3 …call tree_create …… call malloc ;id = 4 …… mov r12 = r8 … store r10[offset2] = r12;create 2->4 store r20[0] = r10 ; create 1->2 • For every static call to malloc, create a node with unique id in the Static Shape Graph (SSG) • If a store creates an edge, connect the corresponding static nodes • Check for SCCs in the SSG • Connect two dynamic nodes only if their corresponding static nodes are in same SCC 1 A 5 2 T 6 7 3 4 SSG DSG
Experimental setup • Uses Pin, a dynamic instrumentation tool for Itanium • Mapping between address ranges and dynamic ids are stored in an AVL tree • Most recent mapping is cached • A mix of benchmarks from SPEC, Olden and other pointer intensive applications • Dynamic instruction count varies from a few million (ks) to over 300 billion (mesa) • All experiments run on a 900MHz Itanium 2 with 2 GB RAM running RH 7.1
Profiler Performance • Profile: RDS size, lifetime, access count • Memory: <16 MB for all but 3 applications Baseline: Execution using Pin (~ 10 times slower than native)
RDS usage statistics • SCCs in static shape graph (RDS types) • Usually a few(<5) per benchmark, a maximum of 31 in parser • #RDS instances (connected components in DSG) • Exhibits a wide range (1 in mcf to around million in parser) • Tend to be live for long if the program creates only a few of them • Sizes of RDS instances • Varies from a single node self-loop (parser) to a few hundred thousand nodes (mcf, parser) • #pointer chasing loads • Significant in many benchmarks • Applications show vast diversity in RDS usage • A good reason for profiling them!
RDS Stability • Stability of an RDS : A notion of how 'array-like' an RDS is • Stability index : an attempt to quantify this notion • Identify the time instances (alteration points) when changes occur to the RDS structure (by stores that replace existing pointers) • Count the traversals between successive alteration points • Stability index = #intervals that account for ‘most’ of the traversals • Lower index means higher stability
Conclusion • Aggressive data structure level optimization techniques for RDS need profile information for improved performance • RDS profiling gives a better understanding of the runtime behavior of RDS • RDS usage varies widely across benchmarks
RDS Profiling: Definitions • RDS type: The abstract form of the logical data structure that is manipulated by the program • Examples: list, binary tree, graph, etc. • Can be mutually recursive (nodes point to their incident edges and vice versa to form a graph) • RDS instance: A concrete realization of the RDS type • Example: the tree created in function foo, the list pointed to by the first entry of the hash table.