1 / 30

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How. Gagan Agrawal Wei Du Tahsin Kurc Umit Catalyurek Joel Saltz The Ohio State University. Overall Context.

hanley
Download Presentation

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How Gagan Agrawal Wei Du Tahsin Kurc Umit Catalyurek Joel Saltz The Ohio State University

  2. Overall Context • NGS grant titled ``An Integrated Middleware and Language/Compiler Framework for Data-Intensive Applications’’, funded September 2002 – August 2005. • Project Components • Runtime Optimizations in the DataCutter System • Compiler Optimization of DataCutter filters • Automatic Generation of DataCutter filters • Focus of this talk

  3. General Motivation • Language and Compiler Support for Parallelism of many forms has been explored • Shared memory parallelism • Instruction-level parallelism • Distributed memory parallelism • Multithreaded execution • Application and technology trends are making another form of parallelism desirable and feasible • Coarse-Grained Pipelined Parallelism

  4. Range_query Find the K-nearest neighbors Coarse-Grained Pipelined Parallelism(CGPP) • Definition • Computations associated with an application are carried out in several stages, which are executed on a pipeline of computing units • Example — K-nearest Neighbor Given a 3-D range R= <(x1, y1, z1), (x2, y2, z2)>, and a point  = (a, b, c). We want to find the nearest K neighbors of  within R.

  5. Coarse-Grained Pipelined Parallelism is Desirable & Feasible • Application scenarios data data data data Internet data data data

  6. Coarse-Grained Pipelined Parallelism is Desirable & Feasible • A new class of data-intensive applications • Scientific data analysis • data mining • data visualization • image analysis • Two direct ways to implement such applications • Downloading all the data to user’s machine – often not feasible • Computing at the data repository - usually too slow

  7. Coarse-Grained Pipelined Parallelism is Desirable & Feasible • Our belief • A coarse-grained pipelined execution model is a good match data Internet data

  8. Coarse-Grained Pipelined Parallelism needs Compiler Support • Computation needs to be decomposed into stages • Decomposition decisions are dependent on execution environment • How many computing sites available • How many available computing cycles on each site • What are the available communication links • What’s the bandwidth of each link • Code for each stage follows the same processing pattern, so it can be generated by compiler • Shared or distributed memory parallelism needs to be exploited • High-level language and compiler support are necessary

  9. Outline • Coarse-grained pipelined parallelism is desirable & feasible • Coarse-grained pipelined parallelism needs high-level language & compiler support • An entire picture of the system • DataCutter runtime system & language dialect • Overview of the challenges for the compiler • Compiler Techniques • Experimental results • Related work • Future work & Conclusions

  10. Decomposition Code Generation DataCutter Runtime System An Entire Picture Java Dialect Compiler Support

  11. stream stream filter1 filter2 filter3 DataCutter Runtime System • Ongoing project at OSU / Maryland ( Kurc, Catalyurek, Beynon, Saltz et al) • Targets a distributed, heterogeneous environment • Allow decomposition of application-specific data processing operations into a set of interacting processes • Provides a specific low-level interface • filter • Stream • layout & placement

  12. Language Dialect • Goal • to give compiler information about independent collections of objects, parallel loops and reduction operations, pipelined parallelism • Extensions of Java • Pipelined_loop • Domain & Rectdomain • Foreach loop • reduction variables

  13. RectDomain<1> PacketRange = [1:4]; Pipelined_loop (b in PacketRange) { 0. foreach ( …) { … } 1. foreach ( …) { … } … … n-1. S; } Merge ISO-Surface Extraction Example Code public class isosurface { public static void main(String arg[]) { float iso_value; RectDomain<1> CubeRange = [min:max]; CUBE[1d] InputData = new CUBE[CubeRange]; Point<1> p, b; RectDomain<1> PacketRange = [1:runtime_def_num_packets]; RectDomain<1> EachRange = [1:(max-min)/runtime_define_num_packets]; Pipelined_loop (b in PacketRange) { Foreach (p in EachRange) { InputData[p].ISO_SurfaceTriangles(iso_value,…); } … … }} For (int i=min; i++; i<max-1) { // operate on InputData[i] }

  14. Overview of the Challenges for the Compiler • Filter Decomposition • To identify the candidate filter boundaries • Compute communication volume between two consecutive filters • Cost Model • Compute a mapping from computations in a loop to computing units in a pipeline • Filter Code Generation

  15. Identify the Candidate Filter Boundaries • Three types of candidate boundaries • Start & end of a foreach loop • Conditional statement If ( point[p].inRange(high, low) ) { local_KNN(point[p]); } • Start & end of a function call within a foreach loop • Any non-foreach loop must be completely inside a single filter

  16. Compute Required Communication ReqComm(b) = the set of values need to be communicated through this boundary Cons(B) = the set of variables that are used in B, not defined in B Gens(B) = the set of variables that are defined in B, still alive at the end of B ReqComm(b2) = ReqComm(b1) – Gens(B) + Cons(B) b2 B b1

  17. Cost Model • Cost Model • A sequence of m computing units, C1,…, Cm with computing powers P(C1), …, P(Cm) • A sequence of m-1 network links, L1, …, Lm-1 with bandwidths B(L1), …, B(Lm-1) • A sequence of n candidate filter boundaries b1, …, bn

  18. C1 L1 C2 L2 C3 Say, L2 is bottleneck stage, T = T(C1)+T(L1)+T(C2)+N*T(L2)+T(C3) Say, C2 is bottleneck stage, T = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3) Cost Model stage C1 L1 C2 L2 C3 time

  19. n+1+m-1 m-1 Filter Decomposition f1 Goal: Find a mapping: Li → bj, to minimize the predicted execution time, where 1≤ i ≤ m-1, 1≤ j ≤ n. Intuitively, the candidate filter boundary bj is inserted between computing units Ci and Ci+1 C1 b1 L1 f2 C2 Cm-1 fn Lm-1 bn Cm fn+1 Exhaustive search

  20. C1 L1 C2 C3 C4 Filter Decomposition: A Greedy Algo. • To minimize the predicted execution time f1 C1 b1 L1 f2 Estimated Cost f1 , f2 f1 C2 b2 L1 to b1 : T1 f3 L2 L1 to b2 : T2 C3 b3 L1 to b3 : T3 f4 L1 to b4 : T4 L3 b4 C4 Min{T1 … T4 } = T2 f5

  21. Code Generation • Abstraction of the work each filter does • Read in a buffer of data from input stream • Iterate over the set of data • Write out the results to output stream • Code generation issues • How to get the Cons(b) from the input stream --- unpacking data • How to organize the output data for the successive filter --- packing data

  22. Experimental Results • Goal • To show Compiler-generated code is efficient • Environment settings • 700MHZ Pentium machines • Connected through Myrinet LANai 7.0 • Configurations # data sites --- # computing sites --- user machine • 1-1-1 • 2-2-1 • 4-4-1

  23. Experimental Results • Versions • Default version • Site hosting the data only reads and transmits data, no processing at all • User’s desktop only views the results, no processing at all • All the work are done by the compute nodes • Compiler-generated version • Intelligent decomposition is done by the compiler • More computations are performed on the end nodes to reduce the communication volume • Manual version • Hand-written DataCutter filters with similar decomposition as the compiler-generated version Computing nodes workload heavy Communication volume high workload balanced between each node Communication volume reduced

  24. Experimental Results: ISO-Surface Rendering (Z-Buffer Based) Small dataset 150M Large dataset 600M 20% improvement over default version Width of pipeline Width of pipeline Speedup 1.92 3.34 Speedup 1.99 3.82

  25. Experimental Results: ISO-Surface Rendering (Active Pixel Based) Small dataset 150M Large dataset 600M > 15% improvement over default version Width of pipeline Width of pipeline Speedup close to linear

  26. Experimental Results: KNN K = 3 108M K = 200 108M >150% improvement over default version Width of pipeline Width of pipeline Speedup 1.89 3.38 Speedup 1.87 3.82

  27. Experimental Results: Virtual Microscope Small query 800M, 512*512 Large query 800M, 2048*2048 ≈40% improvement over default version Width of pipeline Width of pipeline

  28. Experimental Results • Summary • The compiler-decomposed versions achieve an improvement between 10% and 150% over default versions • In most cases, increasing the width of the pipeline results in near-linear speedup • Compared with the manual version, the compiler-decomposed versions are generally quite close

  29. Ongoing and Future Work • Buffer size optimization • Cost model refinement & implementation • More applications • More realistic environment settings: resource dynamically available

  30. Conclusions • Coarse-Grained Pipelined Parallelism is desirable & feasible • Coarse-Grained Pipelined Parallelism needs language & compiler support • An algorithm for required communication analysis is given • A greedy algorithm for filter decomposition is developed • A cost model is designed • Results of detailed evaluation of our compiler are encouraging

More Related