1 / 24

On-line adaptative parallel prefix computation

On-line adaptative parallel prefix computation. Jean-Louis Roch , Daouda Traore, Julien Bernard INRIA-CNRS Moais team - LIG Grenoble, France. Contents I. Motivation II. Work-stealing scheduling of parallel algorithms III. Processor-oblivious parallel prefix computation.

pascha
Download Presentation

On-line adaptative parallel prefix computation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On-line adaptative parallel prefix computation Jean-Louis Roch, Daouda Traore, Julien Bernard INRIA-CNRS Moais team - LIG Grenoble, France • Contents • I. Motivation • II. Work-stealing scheduling of parallel algorithms • III. Processor-oblivious parallel prefix computation EUROPAR’2006 - Dresden, Germany - 2006, August 29th,

  2. Sequential algorithm: • for ([0] = a[0], i = 1 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ; performs only noperations • Fine grain optimal parallel algorithm : • [Ladner-Fisher-81] Critical time= 2. log n but performs 2.n ops Parallel requires twice more operations than sequential !! • Tight lower bound on p identical processors: Optimal time Tp = 2n / (p+1) but performs 2.n.p/(p+1) ops • [Nicolau&al. 1996] Parallel prefix on fixed architecture • Prefix problem : • input : a0, a1, …, an • output : 1, …, n with

  3. The problem To design a single algorithm that computes efficiently prefix( a ) on an arbitrary dynamic architecture parallel P=max parallel P=2 Sequential algorithm parallel P=100 … … . . . ? Which algorithm to choose ? Heterogeneous network Multi-user SMP server Grid Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode

  4. Lower bound for prefix on processors with changing speeds - Model of heterogeneous processors with changing speed [Bender&al 02] => i(t) = instantaneous speed of processor i at time t (in #operations * per second ) Assumption : max(t) < constant . min(t) Def: ave = average speed per processor for a computation with duration T - Theorem 2 : Lower bound for the time of prefix computation on p processors with changing speeds : Sketch of the proof: - extension of the lower bound on p identical processors [Faith82] - based on the analysis on the number of performed operations.

  5. « Work » W1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) Changing speeds and work-stealing • Workstealing schedule on-line adapts to processors availability and speeds [Bender-02] • Principle of work-stealing= “greedy” schedule but distributed and randomized • Each processor manages locally the tasks it creates • When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen) [Bender-Rabin02]

  6. Work-stealing and adaptation « Work » W1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) • Interest: if W1fixed and Wsmall, near-optimaladaptative schedulewith good probability on p processors with average speeds ave • Moreover : #steals = #task migrations < p.W[Blumofe 98 Narlikar 01 Bender 02] • But lower bounds for prefix : • Minimal workW1 = n W= n • Minimal depth W< 2 log n W1> 2n • With work-stealing, how to reach the lower bound ?

  7. How to getboth work W1and depth W small? • General approach: by coupling two algorithms : • a sequential algorithm with optimal number of operations Ws • and a fine grain parallel algorithm with minimal critical time W butparallel work >> Ws • Folk technique : parallel, than sequential • Parallel algorithm until a certain « grain »; then use the sequential one • Drawback with changing speeds : • Either too much idle processors or too much operations • Work-preserving speed-up technique[Bini-Pan94] sequential, then parallelCascading [Jaja92] =Careful interplay of both algorithms to build one with both W small and W1 = O( Wseq ) • Use the work-optimal sequential algorithm to reduce the size • Then use the time-optimal parallel algorithm to decrease the time Drawback : sequential at coarse grain and parallel at fine grain 

  8. SeqCompute SeqCompute Extract_par LastPartComputation Alternative :concurrently sequential and parallel Based on the work-stealing and the Work-first principle : Execute always a sequential algorithm to reduce parallelism overhead • use parallel algorithm only if a processor becomes idle (ie workstealing) by extracting parallelism from a sequential computation (ie adaptive granularity) Hypothesis : two algorithms : • - 1 sequential : SeqCompute- 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm • Self-adaptive granularity based on work-stealing

  9. SeqCompute Alternative :concurrently sequential and parallel SeqCompute preempt

  10. SeqCompute Alternative :concurrently sequential and parallel merge/jump SeqCompute Seq complete

  11. 0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 Main Seq.  Steal request  Work-stealer 1  Work-stealer 2 Adaptive Prefix on 3 processors Sequential 1 Parallel

  12. 0 a1 a2 a3 a4 Main Seq.  1 2 Steal request  a5 a6 a7 a8 a9 a10 a11 a12  6 Work-stealer 1 i=a5*…*ai  Work-stealer 2 Adaptive Prefix on 3 processors Sequential 3 Parallel 7

  13. 0 a1 a2 a3 a4 Main Seq.  1 2 3 4 8 8 Preempt 4  a5 a6 a7 a8  6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  10 Work-stealer 2 i=a9*…*ai Sequential Adaptive Prefix on 3 processors Parallel 8

  14. 0 a1 a2 a3 a4 8 Main Seq.  1 2 3 4 11 Preempt 11 8  a5 a6 a7 a8  6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  10 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors Sequential 8 Parallel 5 6 8 9 11

  15. 0 a1 a2 a3 a4 8 11 a12 Main Seq.  1 2 3 4  a5 a6 a7 a8  6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  10 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors Sequential 8 11 12 Parallel 5 6 7 8 9 10 11

  16. 0 a1 a2 a3 a4 8 11 a12 Main Seq. 1  2 3 4 8 11 12  a5 a6 a7 a8  5 6 6 7 7 8 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  9 10 10 11 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors Sequential Implicit critical path on the sequential process Parallel

  17. Lower bound Analysis of the algorithm • Theorem 3: Execution time • Sketch of the proof : Analysis of the operations performed by : • The sequential main performs S operations on one processor • The (p-1) work-stealers perform X = 2(n-S) operations with depth log X • Each non constant time task can potentially be splitted (variable speeds) The coupling ensures both algorithms complete simultaneously Ts = Tp - O(log X)=> enables to bound the whole number X of operations performedand the overhead of parallelism = (S+X) - #ops_optimal

  18. Optimal off-line on p procs Pure sequential Adaptive Adaptive prefix : experiments1 Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux) Single user context Time (s) #processors Single-user context : processor-adaptive prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors - Less sensitive to system overhead : even better than the theoretically “optimal” off-line parallel algorithm on p processors :

  19. Off-line parallel algorithm for p processors Adaptive Adaptive prefix : experiments 2 Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux) Multi-user context : External charge (9-p external processes) Time (s) #processors Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-adaptive prefix computation is always the fastest15% benefit over a parallel algorithm for p processors with off-line schedule,

  20. Conclusion The interplay of an on-line parallel algorithm directed by work-stealing schedule is useful for the design of processor-oblivious algorithms Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors with changing speeds - practically, achieves near-optimal performances on multi-user SMPs Generic adaptivescheme to implement parallel algorithms with provable performance - work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]

  21. Interactive Distributed Simulation [B Raffin &E Boyer] - 5 cameras, - 6 PCs 3D-reconstruction + simulation + rendering ->Adaptive scheme to maximize 3D-reconstruction precision within fixed timestamp [L Suares, B Raffin, JL Roch] Thank you !

  22. Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential The Prefix race: sequential/parallel fixed/ adaptive On each of the 10 executions, adaptive completes first

  23. Parallel Parallel Adaptive Adaptive Adaptive prefix : some experiments Prefix of 10000 elements on a SMP 8 procs (IA64 / linux) External charge Time (s) Time (s) #processors #processors Multi-user context Adaptive is the fastest15% benefit over a static grain algorithm • Single user context • Adaptive is equivalent to: • - sequential on 1 proc • - optimal parallel-2 proc. on 2 processors • - … • - optimal parallel-8 proc. on 8 processors

  24. With * = double sum ( r[i]=r[i-1] + x[i] ) Finest “grain” limited to 1 page = 16384 octets = 2048 double Single user Processors with variable speeds Remark for n=4.096.000 doubles : - “pure” sequential : 0,20 s - minimal ”grain” = 100 doubles : 0.26s on 1 proc and 0.175 on 2 procs (close to lower bound)

More Related