290 likes | 469 Views
Cooperative Parallelization. Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University. Outline. Motivation Introduction Cooperative Parallelization Programmer’s Input Evaluation Conclusion. Motivation. Program parallelization is a difficult task
E N D
Cooperative Parallelization Praveen Yedlapalli EmreKultursay MahmutKandemir The Pennsylvania State University
Outline • Motivation • Introduction • Cooperative Parallelization • Programmer’s Input • Evaluation • Conclusion
Motivation • Program parallelization is a difficult task • Automatic parallelization helps in parallelizing sequential applications • Most of the parallelizing techniques focus on array based applications • Limited support for parallelizing pointer-intensive applications
Example void traverse_tree (Tree *tree) { if (tree−>left) traverse_tree(tree->left); if (tree->right) traverse_tree(tree->right); process(tree); } void traverse_list (List * list) { List * node = list; while ( node != NULL ) { process(node); node = node−>next; } } Tree Traversal List Traversal
Introduction • Program Parallelization is a 2-fold problem • First Problem: Finding where parallelism is available in the application if any • Second Problem: Deciding how to efficiently exploit the available parallelism
Finding Parallelism • Use static analysis to perform dependence checking and identify independent parts of the program • Target regular structures like arrays and for loops • Pointer intensive codes cannot be analyzed accurately with static analysis
Pointer Problem • Pointer intensive applications typically have • Data structures built from input • and while loops to traverse the data structures • Without the points-to information and with out loop counts there is very little we can do at compile time
Exploiting Parallelism • In array based applications with for loops sets of iterations are distributed to different threads • In pointer intensive applications information about the data structure is needed to run the parallel code
Programmer’s Input • The programmer has high level view of the program and can give hints about the program • Hints can indicate things like • If a loop can be parallelized • If function calls are independent • Structure of the working data • All of these bits of information are vital in program parallelization
Application Runtime Information • To efficiently exploit parallelism in pointer intensive applications we need runtime information • Size and shape of data structure (dependent on input) • Points-to information • Using the points-to information we determine the work distribution
Cooperative Parallelization Programmer(hints) Runtime System Compiler Cooperative Parallelization SequentialProgram Parallel Program
Cooperative Parallelization • Cooperation between the programmer, the compiler and the runtime system to identify and efficiently exercise parallelism in pointer intensive applications • The task of identifying parallelism in the code is delegated to the programmer • Runtime system is responsible for monitoring the program and efficiently executing parallel code
Application Characteristics • Pointer-intensive applications • A data structure is built from the input • The data structure is traversed several times and nodes are processed • The operations on nodes are typically independent • This fact can be obtained from the programmer as a hint
Tree Example int perimeter (QuadTree tree, int size) { intretval = 0; if (tree−>color==grey) { /*node has children */ retval += perimeter (tree−>nw, size/2); retval += perimeter (tree−>ne, size/2); retval += perimeter (tree−>sw, size/2); retval += perimeter (tree−>se, size/2); } else if (tree−>color==black) { ... /* do something on the node*/ } return retval; } tree nw subtree se subtree … Function from perimeter benchmark
List Example void compute_node (node_t * nodelist) { inti; while ( nodelist != NULL ) { for (i=0; i < nodelist−>from_count; i++) { node_t *other_node = nodelist−>from_nodes[i]; double coeff = nodelist−>coeffs[i]; double value = other_node−>value; nodelist−>value −= coeff * value; } nodelist = nodelist−>next } } Function from em3d benchmark head sublist 1 sublist n . . .
Runtime System • Processing of different parts of the data structure (sub problems) can be done in parallel • Needs access to multiple sub problems at runtime • The task of finding these sub problems in the data structure is done by a helper thread
Helper Thread • The helper thread goes over the data structure and finds multiple independent sub problems • The helper thread doesn’t need to traverse the whole data structure to find the sub problems • Using a separate thread for finding the sub problems reduces the overhead
Approach Sequential Execution Parallel Execution helper thread application threads loop loop
Code Structure helper thread: wait for signal from main thread find subproblems in the data structure signal main thread application thread: wait for signal from main thread work on the subproblems assigned to this thread signal main thread main thread: signal helper thread when data structure is ready wait for signal from helper thread distribute subproblems to application threads signal application threads wait for signal from application threads merge results from all the application threads
Profitability • The runtime information collected is used to determine the profitability of parallelization • This decision can be driven by the programmer using a hint • The program is parallelized only if the data structure is “big” enough
Programmer Hints • Interface between the programmer and the compiler • Should be simple to use with minimal essential information #parallel tree function (threads) (degree) (struct) {children} threshold [reduction] #parallel llist function (threads) (struct) (next_node) threshold [number]
Automation • Implemented a source-to-source translator • Modified C language grammar to understand the hints C program with hints Parser Generator Modified C grammar Translator Parallel program
Experimental Setup All benchmarks except otter are from olden suite
Evaluation 15x speedup
Overheads • Helper thread can be invoked before the main thread reaches the computation to overlap the overhead of finding the sub problems • Helper thread in general traverses a part of the data structure and takes very less time compared to the original function
Comparison to OpenMP • Open MP 3.0 supports task parallelism • Directives can be added in the code to parallelize while loops and recursive functions • Open MP tasks doesn’t take application runtime information into consideration • Tasks tend to be fine grain • Significant performance overhead
Related Work • Speculative parallelization can help in parallelizing programs that are difficult to analyze • That comes at the cost of executing instructions which might not be useful • Power and Performance overhead • Our approach is a non-speculative way of parallelization
Conclusion • Traditional parallelization techniques cannot efficiently parallelize pointer intensive codes • Combining programmer’s knowledge and application runtime information we can exploit parallelism in such codes • The idea presented is not limited to trees and linked lists and can be extended to other dynamic structures like graphs
Thanks You Questions ?