Cooperative Parallelization

Cooperative Parallelization Praveen Yedlapalli EmreKultursay MahmutKandemir The Pennsylvania State University

Outline • Motivation • Introduction • Cooperative Parallelization • Programmer’s Input • Evaluation • Conclusion

Motivation • Program parallelization is a difficult task • Automatic parallelization helps in parallelizing sequential applications • Most of the parallelizing techniques focus on array based applications • Limited support for parallelizing pointer-intensive applications

Example void traverse_tree (Tree *tree) { if (tree−>left) traverse_tree(tree->left); if (tree->right) traverse_tree(tree->right); process(tree); } void traverse_list (List * list) { List * node = list; while ( node != NULL ) { process(node); node = node−>next; } } Tree Traversal List Traversal

Introduction • Program Parallelization is a 2-fold problem • First Problem: Finding where parallelism is available in the application if any • Second Problem: Deciding how to efficiently exploit the available parallelism

Finding Parallelism • Use static analysis to perform dependence checking and identify independent parts of the program • Target regular structures like arrays and for loops • Pointer intensive codes cannot be analyzed accurately with static analysis

Pointer Problem • Pointer intensive applications typically have • Data structures built from input • and while loops to traverse the data structures • Without the points-to information and with out loop counts there is very little we can do at compile time

Exploiting Parallelism • In array based applications with for loops sets of iterations are distributed to different threads • In pointer intensive applications information about the data structure is needed to run the parallel code

Programmer’s Input • The programmer has high level view of the program and can give hints about the program • Hints can indicate things like • If a loop can be parallelized • If function calls are independent • Structure of the working data • All of these bits of information are vital in program parallelization

Application Runtime Information • To efficiently exploit parallelism in pointer intensive applications we need runtime information • Size and shape of data structure (dependent on input) • Points-to information • Using the points-to information we determine the work distribution

Cooperative Parallelization Programmer(hints) Runtime System Compiler Cooperative Parallelization SequentialProgram Parallel Program

Cooperative Parallelization • Cooperation between the programmer, the compiler and the runtime system to identify and efficiently exercise parallelism in pointer intensive applications • The task of identifying parallelism in the code is delegated to the programmer • Runtime system is responsible for monitoring the program and efficiently executing parallel code

Application Characteristics • Pointer-intensive applications • A data structure is built from the input • The data structure is traversed several times and nodes are processed • The operations on nodes are typically independent • This fact can be obtained from the programmer as a hint

Tree Example int perimeter (QuadTree tree, int size) { intretval = 0; if (tree−>color==grey) { /*node has children */ retval += perimeter (tree−>nw, size/2); retval += perimeter (tree−>ne, size/2); retval += perimeter (tree−>sw, size/2); retval += perimeter (tree−>se, size/2); } else if (tree−>color==black) { ... /* do something on the node*/ } return retval; } tree nw subtree se subtree … Function from perimeter benchmark

List Example void compute_node (node_t * nodelist) { inti; while ( nodelist != NULL ) { for (i=0; i < nodelist−>from_count; i++) { node_t *other_node = nodelist−>from_nodes[i]; double coeff = nodelist−>coeffs[i]; double value = other_node−>value; nodelist−>value −= coeff * value; } nodelist = nodelist−>next } } Function from em3d benchmark head sublist 1 sublist n . . .

Runtime System • Processing of different parts of the data structure (sub problems) can be done in parallel • Needs access to multiple sub problems at runtime • The task of finding these sub problems in the data structure is done by a helper thread

Helper Thread • The helper thread goes over the data structure and finds multiple independent sub problems • The helper thread doesn’t need to traverse the whole data structure to find the sub problems • Using a separate thread for finding the sub problems reduces the overhead

Approach Sequential Execution Parallel Execution helper thread application threads loop loop

Code Structure helper thread: wait for signal from main thread find subproblems in the data structure signal main thread application thread: wait for signal from main thread work on the subproblems assigned to this thread signal main thread main thread: signal helper thread when data structure is ready wait for signal from helper thread distribute subproblems to application threads signal application threads wait for signal from application threads merge results from all the application threads

Profitability • The runtime information collected is used to determine the profitability of parallelization • This decision can be driven by the programmer using a hint • The program is parallelized only if the data structure is “big” enough

Programmer Hints • Interface between the programmer and the compiler • Should be simple to use with minimal essential information #parallel tree function (threads) (degree) (struct) {children} threshold [reduction] #parallel llist function (threads) (struct) (next_node) threshold [number]

Automation • Implemented a source-to-source translator • Modified C language grammar to understand the hints C program with hints Parser Generator Modified C grammar Translator Parallel program

Experimental Setup All benchmarks except otter are from olden suite

Evaluation 15x speedup

Overheads • Helper thread can be invoked before the main thread reaches the computation to overlap the overhead of finding the sub problems • Helper thread in general traverses a part of the data structure and takes very less time compared to the original function

Comparison to OpenMP • Open MP 3.0 supports task parallelism • Directives can be added in the code to parallelize while loops and recursive functions • Open MP tasks doesn’t take application runtime information into consideration • Tasks tend to be fine grain • Significant performance overhead

Related Work • Speculative parallelization can help in parallelizing programs that are difficult to analyze • That comes at the cost of executing instructions which might not be useful • Power and Performance overhead • Our approach is a non-speculative way of parallelization

Conclusion • Traditional parallelization techniques cannot efficiently parallelize pointer intensive codes • Combining programmer’s knowledge and application runtime information we can exploit parallelism in such codes • The idea presented is not limited to trees and linked lists and can be extended to other dynamic structures like graphs

Thanks You Questions ?

Cooperative Parallelization

Cooperative Parallelization

Presentation Transcript

Parallelization Issues for MINLP

Parallelization in Molecular Dynamics

Loop Parallelization

Trend Towards Parallelization

Parallelization

Parallelization and Tuning

HW5: Parallelization

Automatic Parallelization

Parallelization at a Glance

cc Compiler Parallelization Options

Parallelization of urbanSTREAM

Parallelization of RHSEG

Parallelization of RHSEG

Parallelization and CUDA libraries

Parallelization Strategies

Shared Memory Parallelization

Parallelization and Grid Computing

Basic Loop Parallelization

Reasons for parallelization

PARALLELIZATION OF MULTIPLE BACKSOLVES

Optimistic and Pessimistic Parallelization

Parallelization Issues for MINLP