Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How Gagan Agrawal Wei Du Tahsin Kurc Umit Catalyurek Joel Saltz The Ohio State University

Overall Context • NGS grant titled ``An Integrated Middleware and Language/Compiler Framework for Data-Intensive Applications’’, funded September 2002 – August 2005. • Project Components • Runtime Optimizations in the DataCutter System • Compiler Optimization of DataCutter filters • Automatic Generation of DataCutter filters • Focus of this talk

General Motivation • Language and Compiler Support for Parallelism of many forms has been explored • Shared memory parallelism • Instruction-level parallelism • Distributed memory parallelism • Multithreaded execution • Application and technology trends are making another form of parallelism desirable and feasible • Coarse-Grained Pipelined Parallelism

Range_query Find the K-nearest neighbors Coarse-Grained Pipelined Parallelism(CGPP) • Definition • Computations associated with an application are carried out in several stages, which are executed on a pipeline of computing units • Example — K-nearest Neighbor Given a 3-D range R= <(x1, y1, z1), (x2, y2, z2)>, and a point  = (a, b, c). We want to find the nearest K neighbors of  within R.

Coarse-Grained Pipelined Parallelism is Desirable & Feasible • Application scenarios data data data data Internet data data data

Coarse-Grained Pipelined Parallelism is Desirable & Feasible • A new class of data-intensive applications • Scientific data analysis • data mining • data visualization • image analysis • Two direct ways to implement such applications • Downloading all the data to user’s machine – often not feasible • Computing at the data repository - usually too slow

Coarse-Grained Pipelined Parallelism is Desirable & Feasible • Our belief • A coarse-grained pipelined execution model is a good match data Internet data

Coarse-Grained Pipelined Parallelism needs Compiler Support • Computation needs to be decomposed into stages • Decomposition decisions are dependent on execution environment • How many computing sites available • How many available computing cycles on each site • What are the available communication links • What’s the bandwidth of each link • Code for each stage follows the same processing pattern, so it can be generated by compiler • Shared or distributed memory parallelism needs to be exploited • High-level language and compiler support are necessary

Outline • Coarse-grained pipelined parallelism is desirable & feasible • Coarse-grained pipelined parallelism needs high-level language & compiler support • An entire picture of the system • DataCutter runtime system & language dialect • Overview of the challenges for the compiler • Compiler Techniques • Experimental results • Related work • Future work & Conclusions

Decomposition Code Generation DataCutter Runtime System An Entire Picture Java Dialect Compiler Support

stream stream filter1 filter2 filter3 DataCutter Runtime System • Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al) • Targets a distributed, heterogeneous environment • Allow decomposition of application-specific data processing operations into a set of interacting processes • Provides a specific low-level interface • filter • Stream • layout & placement

Language Dialect • Goal • to give compiler information about independent collections of objects, parallel loops and reduction operations, pipelined parallelism • Extensions of Java • Pipelined_loop • Domain & Rectdomain • Foreach loop • reduction variables

RectDomain<1> PacketRange = [1:4]; Pipelined_loop (b in PacketRange) { 0. foreach ( …) { … } 1. foreach ( …) { … } … … n-1. S; } Merge ISO-Surface Extraction Example Code public class isosurface { public static void main(String arg[]) { float iso_value; RectDomain<1> CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange]; Point<1> p, b; RectDomain<1> PacketRange = [1:runtime_def_num_packets]; RectDomain<1> EachRange = [1:(max-min)/runtime_define_num_packets]; Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) { InputData[p].ISO_SurfaceTriangles(iso_value,…); } … … }} For (int i=min; i++; i<max-1) { // operate on InputData[i] }

Overview of the Challenges for the Compiler • Filter Decomposition • To identify the candidate filter boundaries • Compute communication volume between two consecutive filters • Cost Model • Compute a mapping from computations in a loop to computing units in a pipeline • Filter Code Generation

Identify the Candidate Filter Boundaries • Three types of candidate boundaries • Start & end of a foreach loop • Conditional statement If ( point[p].inRange(high, low) ) { local_KNN(point[p]); } • Start & end of a function call within a foreach loop • Any non-foreach loop must be completely inside a single filter

Compute Required Communication ReqComm(b) = the set of values need to be communicated through this boundary Cons(B) = the set of variables that are used in B, not defined in B Gens(B) = the set of variables that are defined in B, still alive at the end of B ReqComm(b2) = ReqComm(b1) – Gens(B) + Cons(B) b2 B b1

Cost Model • Cost Model • A sequence of m computing units, C1,…, Cm with computing powers P(C1), …, P(Cm) • A sequence of m-1 network links, L1, …, Lm-1 with bandwidths B(L1), …, B(Lm-1) • A sequence of n candidate filter boundaries b1, …, bn

C1 L1 C2 L2 C3 Say, L2 is bottleneck stage, T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3) Say, C2 is bottleneck stage, T = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3) Cost Model stage C1 L1 C2 L2 C3 time

n+1+m-1 m-1 Filter Decomposition f1 Goal: Find a mapping: Li → bj, to minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. Intuitively, the candidate filter boundary bj is inserted between computing units Ci and Ci+1 C1 b1 L1 f2 C2 Cm-1 fn Lm-1 bn Cm fn+1 Exhaustive search

C1 L1 C2 C3 C4 Filter Decomposition: A Greedy Algo. • To minimize the predicted execution time f1 C1 b1 L1 f2 Estimated Cost f1 , f2 f1 C2 b2 L1 to b1 : T1 f3 L2 L1 to b2 : T2 C3 b3 L1 to b3 : T3 f4 L1 to b4 : T4 L3 b4 C4 Min{T1 … T4 } = T2 f5

Code Generation • Abstraction of the work each filter does • Read in a buffer of data from input stream • Iterate over the set of data • Write out the results to output stream • Code generation issues • How to get the Cons(b) from the input stream --- unpacking data • How to organize the output data for the successive filter --- packing data

Experimental Results • Goal • To show Compiler-generated code is efficient • Environment settings • 700MHZ Pentium machines • Connected through Myrinet LANai 7.0 • Configurations # data sites --- # computing sites --- user machine • 1-1-1 • 2-2-1 • 4-4-1

Experimental Results • Versions • Default version • Site hosting the data only reads and transmits data, no processing at all • User’s desktop only views the results, no processing at all • All the work are done by the compute nodes • Compiler-generated version • Intelligent decomposition is done by the compiler • More computations are performed on the end nodes to reduce the communication volume • Manual version • Hand-written DataCutter filters with similar decomposition as the compiler-generated version Computing nodes workload heavy Communication volume high workload balanced between each node Communication volume reduced

Experimental Results: ISO-Surface Rendering (Z-Buffer Based) Small dataset 150M Large dataset 600M 20% improvement over default version Width of pipeline Width of pipeline Speedup 1.92 3.34 Speedup 1.99 3.82

Experimental Results: ISO-Surface Rendering (Active Pixel Based) Small dataset 150M Large dataset 600M > 15% improvement over default version Width of pipeline Width of pipeline Speedup close to linear

Experimental Results: KNN K = 3 108M K = 200 108M >150% improvement over default version Width of pipeline Width of pipeline Speedup 1.89 3.38 Speedup 1.87 3.82

Experimental Results: Virtual Microscope Small query 800M, 512*512 Large query 800M, 2048*2048 ≈40% improvement over default version Width of pipeline Width of pipeline

Experimental Results • Summary • The compiler-decomposed versions achieve an improvement between 10% and 150% over default versions • In most cases, increasing the width of the pipeline results in near-linear speedup • Compared with the manual version, the compiler-decomposed versions are generally quite close

Ongoing and Future Work • Buffer size optimization • Cost model refinement & implementation • More applications • More realistic environment settings: resource dynamically available

Conclusions • Coarse-Grained Pipelined Parallelism is desirable & feasible • Coarse-Grained Pipelined Parallelism needs language & compiler support • An algorithm for required communication analysis is given • A greedy algorithm for filter decomposition is developed • A cost model is designed • Results of detailed evaluation of our compiler are encouraging

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How