70 likes | 194 Views
Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA Zheng Wei and Joseph JaJa. Rohit Nigam 200702036. About the Problem. Solution Approach. GPU. GPU. GPU. CPU. GPU. Problems Faced.
E N D
Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDAZheng Wei and Joseph JaJa Rohit Nigam 200702036
Solution Approach GPU GPU GPU CPU GPU
Problems Faced • The author failed to mention that of the ‘s’ random sub-lists generated, one of the sublist’s head must be the head of the list. Considering this, I have kept the head of the first sublist as the head of list. Rest of the lists are random as suggested in the paper. • One other problem faced was in executing steps 4,5. Since the sublists are random and not ordered, the prefix sum computation of last elements of sublists again becomes the problem of computing prefix sum of link list. For this, we need to make have another array which specifies which sublist comes after the current list.
Optimizations • The main reason for making the assumption that head is not known is to explore the impact of the presence of significant caches since the initial step that determines the head of the list will fill the cache with some of the input data thereby rendering the execution of later steps faster on such processors. • The total number of nodes handled by a thread is about the same as any other thread with high probability if the number of sublists is at least lnp n and the number of processors p < , where n is the total number of nodes. • The number of sublists are managed such that there exists an optimal balance between the desirability of a large number of sublists (for fine-grain data parallel computations and load balancing) and the splitting/merging costs.
Optimizations • The step 4 sequentially computes the prefix sum instead of a recursive method, thereby cutting down a significant overhead. • Randomizing the positions of splitters gives high probability of a overall procedure is load balanced. • The total number of sublists per thread is min(2*(size/120),32) (size>120). This is the optimum value found experimentally, as beyond this value the optimization caused by increasing the number of sublists is worse than the overhead of creating and joining them in other stages of the algorithm.
Results • For List Size 64M, stride 1001, Sublists per thread 32.