1 / 30

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism. Wei Du , Gagan Agrawal Ohio State University. data. data. data. data. Internet. data. data. data. Distributed Data-Intensive Applications. Fast growing datasets Remote data access Distributed data storage

Download Presentation

Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Filter Decomposition for Supporting Coarse-grained Pipelined Parallelism Wei Du, Gagan Agrawal Ohio State University

  2. data data data data Internet data data data Distributed Data-Intensive Applications • Fast growing datasets • Remote data access • Distributed data storage • More connected world

  3. data data data data Internet data data data Implementation: Local processing Requirements: Huge Storage/Powerful Computer/Fast Connection data data data data Internet data data data

  4. Implementation: Remote processing data data data data Internet data data data Requirements: Complex Analysis at Data Centers

  5. A Practical Solution • Our hypothesis • Coarse-grained pipelined execution model is a good match data Internet data

  6. Range_query Find the K-nearest neighbors Coarse-Grained Pipelined Execution • Definition • Computations associated with an application are carried out in several stages, which are executed on a pipeline of computing units • Example — K-nearest Neighbor (KNN) Given a 3-D range R= <(x1, y1, z1), (x2, y2, z2)>, and a point p = (a, b, c). We want to find the nearest K neighbors of pwithin R.

  7. Challenges • Computation associated with an application needs to be decomposed into stages • Decomposition decisions are dependent on the execution environment • Generating code for each stage (SC03) • Other performance issues for the pipelined execution (ICPP04) • Adapting to the dynamic execution environment (SC04)

  8. RoadMap • Filter Decomposition Problem • MIN_ONETRIP Algorithm • MIN_BOTTLENECK Algorithm • MIN_TOTAL Algorithm • Experimental Results • Related Work • Conclusion

  9. C1 f1 , f2 f1 C1 f1 f1 L1 L1 C2 f3 - f6 C2 f2 , f3 fn-1 fn-2,fn-1 Cm-1 Cm-1 Lm-1 Lm-1 fn fn Cm Cm Filter Decomposition f1 C1 L1 f2 C2 Cm-1 fn-1 Lm-1 Cm fn computation pipeline atomic filters

  10. Filter Decomposition f1 C1 Goal: Find a placement p (f1,f2, …, fn) = (F1, F2, …, Fm) where Fi = fi1, fi1+1, …, fik , (1 ≤ i1,ik ≤ n) such that the predicted execution time is minimal (1≤ i ≤ m). L1 f2 C2 Cm-1 fn-1 Lm-1 Cm fn computation pipeline atomic filters

  11. Bottleneck stage: bth stage the slowest stage in the pipeline Execution time T = T(C1)+T(L1)+N*T(C2)+T(L2)+T(C3) = ∑i≠bTi + (N-1)*Tb Cost Model C1 f1 , f2 f1 L1 C2 f3 L2 f4 C3

  12. Three Algorithms T = ∑i≠bTi + (N-1)*Tb • MIN_ONETRIP Algorithm • dynamic programming algorithm • to minimize ∑Ti • MIN_BOTTLENECK Algorithm • dynamic programming algorithm • to minimize Tb • MIN_TOTAL Algorithm • greedy algorithm • try to minimize T

  13. fn fn-1 Filter Decomposition: MIN_ONETRIP Goal: minimize time spent by one packet on the pipeline Cm-2 Lm-2 Cm-1 Lm-1 Cm fn fn-1

  14. T[i-1,j] + Cost_comp(P(Cj),Task(fi)) T[i,j] = min T[i,j-1] + Cost_comm(B(Lj-1),Vol(fi)) Filter Decomposition: MIN_ONETRIP T[i,j]: min cost of doing computations f1 ,…, fi on computing units C1,…, Cj, where the results of fi are on Cj. • Goal: T[n,m] • Cost: O(mn) Cm-2 Lm-2 Cm-1 Lm-1 Cm

  15. f1 f1 f1 … … … fn fn-1 fn-2 f1 fn fn-1fn f2…fn Filter Decomposition: MIN_BOTTLENECK Goal: minimize time spent at the bottleneck stage Cm-2 Lm-2 Cm-1 …… Lm-1 Cm

  16. max{ N[i,j-1], Cost_comm(B(Lj-1),Vol(fi)) } max{ N[i-1,j-1], Cost_comm(B(Lj-1),Vol(fi-1)), Cost_Comp(P(Cj),Task(fi)) } N[i,j] = min … … max{ N[1,j-1], Cost_comm(B(Lj-1),Vol(f1)), Cost_Comp(P(Cj), Task(f2) + … + Task(fi)) } Filter Decomposition: MIN_BOTTLENECK N[i,j]: min cost of bottleneck stage for computing f1 ,…, fi on computing units C1,…, Cj,where the results of fi are on Cj. • Cost: O(mn2)

  17. C1 L1 C2 C3 C4 Filter Decomposition: MIN_BOTTLENECK f1 C1 • To minimize the predicted execution time T L1 f2 C2 Estimated Cost f1 , f2 f1 f3 L2 f1 : T1 C3 f1, f2 : T2 f4 f1 - f3 : T3 L3 f1 - f4 : T4 C4 f5 Min{T1 … T4 } = T2

  18. RoadMap • Filter Decomposition Problem • MIN_ONETRIP Algorithm • MIN_BOTTLENECK Algorithm • MIN_TOTAL Algorithm • Experimental Results • Related Work • Conclusion

  19. Experimental Results • 4 Configurations • 3 Applications • Virtual Microscope • Iso-Surface Rendering 1 1 1 1 1 0.1 0.5 1 1 1 1 0.001 1 1 0.01 1 0.001 0.1 1 0.01

  20. Used Applications • Virtual Microscope (Vmscope) • an emulation of a microscope • input: a rectangular region, a resolution value • output: portion of the original image with certain resolution

  21. Experimental Results: Virtual Microscope • 3 queries • Q1 : 1 packet • Q2 : 4 packets • Q3 : 4500 packets • 4 Algorithms • MIN_ONETRIP • MIN_BOTTLENECK • MIN_TOTAL • Exhaustive_Search

  22. Execution Time (in ms) Execution Time (in ms) Execution Time (in ms) Execution Time (in ms) Application Application Experimental Results: Virtual Microscope

  23. Experimental Results: Virtual Microscope • Two observations • The performance variance between different algorithms is small • The Exha_Search does not always give the best placement • characteristics based on one packet information • combining two filters as one, saving copying cost

  24. Used Applications • Iso-surface rendering (Iso) • input: a 3-D grid, a scalar value, a view screen with angle specified • output: a surface seen from certain angle, which captures points in the grid whose scalar value matches the given iso-surface value

  25. Experimental Results: Iso • 2 Implementations • ZBUF • ACTP • 2 Datasets • small : 3 packets • large : 47 packets • 4 Algorithms • MIN_ONETRIP • MIN_BOTTLENECK • MIN_TOTAL • Exhaustive_Search

  26. Execution Time (in ms) Execution Time (in ms) Application Application Experimental Results: Iso Small dataset Large dataset

  27. Experimental Results: Iso • The MIN_TOTAL algorithm gives the best placement for small dataset • The MIN_ONETRIP algorithm finds the best placement for large dataset • This application is very data-dependent !

  28. Execution Time (in ms) Execution Time (in ms) Number of Runs Number of Runs Experimental Results: Iso ZBUF ACTP

  29. Conclusion & Future Work • Our algorithms perform quite well • Future Work • To find more accurate characteristics of applications • estimate of the performance change resulting from combining multiple atomic filters • estimate of the impact of data dependence

  30. Thank you !!!

More Related