On-line adaptative parallel prefix computation

On-line adaptative parallel prefix computation Jean-Louis Roch, Daouda Traore, Julien Bernard INRIA-CNRS Moais team - LIG Grenoble, France • Contents • I. Motivation • II. Work-stealing scheduling of parallel algorithms • III. Processor-oblivious parallel prefix computation EUROPAR’2006 - Dresden, Germany - 2006, August 29th,

Sequential algorithm: • for ([0] = a[0], i = 1 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ; performs only noperations • Fine grain optimal parallel algorithm : • [Ladner-Fisher-81] Critical time= 2. log n but performs 2.n ops Parallel requires twice more operations than sequential !! • Tight lower bound on p identical processors: Optimal time Tp = 2n / (p+1) but performs 2.n.p/(p+1) ops • [Nicolau&al. 1996] Parallel prefix on fixed architecture • Prefix problem : • input : a0, a1, …, an • output : 1, …, n with

The problem To design a single algorithm that computes efficiently prefix( a ) on an arbitrary dynamic architecture parallel P=max parallel P=2 Sequential algorithm parallel P=100 … … . . . ? Which algorithm to choose ? Heterogeneous network Multi-user SMP server Grid Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode

Lower bound for prefix on processors with changing speeds - Model of heterogeneous processors with changing speed [Bender&al 02] => i(t) = instantaneous speed of processor i at time t (in #operations * per second ) Assumption : max(t) < constant . min(t) Def: ave = average speed per processor for a computation with duration T - Theorem 2 : Lower bound for the time of prefix computation on p processors with changing speeds : Sketch of the proof: - extension of the lower bound on p identical processors [Faith82] - based on the analysis on the number of performed operations.

« Work » W1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) Changing speeds and work-stealing • Workstealing schedule on-line adapts to processors availability and speeds [Bender-02] • Principle of work-stealing= “greedy” schedule but distributed and randomized • Each processor manages locally the tasks it creates • When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen) [Bender-Rabin02]

Work-stealing and adaptation « Work » W1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) • Interest: if W1fixed and Wsmall, near-optimaladaptative schedulewith good probability on p processors with average speeds ave • Moreover : #steals = #task migrations < p.W[Blumofe 98 Narlikar 01 Bender 02] • But lower bounds for prefix : • Minimal workW1 = n W= n • Minimal depth W< 2 log n W1> 2n • With work-stealing, how to reach the lower bound ?

How to getboth work W1and depth W small? • General approach: by coupling two algorithms : • a sequential algorithm with optimal number of operations Ws • and a fine grain parallel algorithm with minimal critical time W butparallel work >> Ws • Folk technique : parallel, than sequential • Parallel algorithm until a certain « grain »; then use the sequential one • Drawback with changing speeds : • Either too much idle processors or too much operations • Work-preserving speed-up technique[Bini-Pan94] sequential, then parallelCascading [Jaja92] =Careful interplay of both algorithms to build one with both W small and W1 = O( Wseq ) • Use the work-optimal sequential algorithm to reduce the size • Then use the time-optimal parallel algorithm to decrease the time Drawback : sequential at coarse grain and parallel at fine grain 

SeqCompute SeqCompute Extract_par LastPartComputation Alternative :concurrently sequential and parallel Based on the work-stealing and the Work-first principle : Execute always a sequential algorithm to reduce parallelism overhead • use parallel algorithm only if a processor becomes idle (ie workstealing) by extracting parallelism from a sequential computation (ie adaptive granularity) Hypothesis : two algorithms : • - 1 sequential : SeqCompute- 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm • Self-adaptive granularity based on work-stealing

SeqCompute Alternative :concurrently sequential and parallel SeqCompute preempt

SeqCompute Alternative :concurrently sequential and parallel merge/jump SeqCompute Seq complete

0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 Main Seq.  Steal request  Work-stealer 1  Work-stealer 2 Adaptive Prefix on 3 processors Sequential 1 Parallel

0 a1 a2 a3 a4 Main Seq.  1 2 Steal request  a5 a6 a7 a8 a9 a10 a11 a12  6 Work-stealer 1 i=a5*…*ai  Work-stealer 2 Adaptive Prefix on 3 processors Sequential 3 Parallel 7

0 a1 a2 a3 a4 Main Seq.  1 2 3 4 8 8 Preempt 4  a5 a6 a7 a8  6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  10 Work-stealer 2 i=a9*…*ai Sequential Adaptive Prefix on 3 processors Parallel 8

0 a1 a2 a3 a4 8 Main Seq.  1 2 3 4 11 Preempt 11 8  a5 a6 a7 a8  6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  10 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors Sequential 8 Parallel 5 6 8 9 11

0 a1 a2 a3 a4 8 11 a12 Main Seq.  1 2 3 4  a5 a6 a7 a8  6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  10 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors Sequential 8 11 12 Parallel 5 6 7 8 9 10 11

0 a1 a2 a3 a4 8 11 a12 Main Seq. 1  2 3 4 8 11 12  a5 a6 a7 a8  5 6 6 7 7 8 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  9 10 10 11 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors Sequential Implicit critical path on the sequential process Parallel

Lower bound Analysis of the algorithm • Theorem 3: Execution time • Sketch of the proof : Analysis of the operations performed by : • The sequential main performs S operations on one processor • The (p-1) work-stealers perform X = 2(n-S) operations with depth log X • Each non constant time task can potentially be splitted (variable speeds) The coupling ensures both algorithms complete simultaneously Ts = Tp - O(log X)=> enables to bound the whole number X of operations performedand the overhead of parallelism = (S+X) - #ops_optimal

Optimal off-line on p procs Pure sequential Adaptive Adaptive prefix : experiments1 Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux) Single user context Time (s) #processors Single-user context : processor-adaptive prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors - Less sensitive to system overhead : even better than the theoretically “optimal” off-line parallel algorithm on p processors :

Off-line parallel algorithm for p processors Adaptive Adaptive prefix : experiments 2 Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux) Multi-user context : External charge (9-p external processes) Time (s) #processors Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-adaptive prefix computation is always the fastest15% benefit over a parallel algorithm for p processors with off-line schedule,

Conclusion The interplay of an on-line parallel algorithm directed by work-stealing schedule is useful for the design of processor-oblivious algorithms Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors with changing speeds - practically, achieves near-optimal performances on multi-user SMPs Generic adaptivescheme to implement parallel algorithms with provable performance - work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]

Interactive Distributed Simulation [B Raffin &E Boyer] - 5 cameras, - 6 PCs 3D-reconstruction + simulation + rendering ->Adaptive scheme to maximize 3D-reconstruction precision within fixed timestamp [L Suares, B Raffin, JL Roch] Thank you !

Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential The Prefix race: sequential/parallel fixed/ adaptive On each of the 10 executions, adaptive completes first

Parallel Parallel Adaptive Adaptive Adaptive prefix : some experiments Prefix of 10000 elements on a SMP 8 procs (IA64 / linux) External charge Time (s) Time (s) #processors #processors Multi-user context Adaptive is the fastest15% benefit over a static grain algorithm • Single user context • Adaptive is equivalent to: • - sequential on 1 proc • - optimal parallel-2 proc. on 2 processors • - … • - optimal parallel-8 proc. on 8 processors

With * = double sum ( r[i]=r[i-1] + x[i] ) Finest “grain” limited to 1 page = 16384 octets = 2048 double Single user Processors with variable speeds Remark for n=4.096.000 doubles : - “pure” sequential : 0,20 s - minimal ”grain” = 100 doubles : 0.26s on 1 proc and 0.175 on 2 procs (close to lower bound)

On-line adaptative parallel prefix computation

On-line adaptative parallel prefix computation

Presentation Transcript

Parallel prefix adders

CSE 260 Parallel Computation

Parallel Prefix, Pack, and Sorting

On-line Parallel Tomography

MCMC Using Parallel Computation

On-line adaptive parallel prefix computation

Potential for Parallel Computation

Parallel Skyline Computation on Multicore Architectures

Parallel Computation Models

Parallel computation models

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix

Lecture 7 PRAM Algorithm: Parallel Prefix

List Ranking and Parallel Prefix

Survey of Parallel Computation

Parallel Prefix and Data Parallel Operations

18.337 Parallel Prefix

879 CISC Parallel Computation

Models of Parallel Computation