420 likes | 532 Views
Mini Symposium Adaptive Algorithms for Scientific computing. Adaptive, hybrids, oblivious : what do those terms mean ? Taxonomy of autonomic computing [Ganek & Corbi 2003] : Self-configuring / self-healing / self-optimising / self-protecting
E N D
Mini SymposiumAdaptive Algorithms for Scientific computing • Adaptive, hybrids, oblivious : what do those terms mean ? • Taxonomy of autonomic computing [Ganek & Corbi 2003] : • Self-configuring / self-healing / self-optimising / self-protecting • Objective: towards an analysis based on the algorithm performance 9h45 Adaptive algorithms - Theory and applicationsJean-LouisRoch &al. AHA Team INRIA-CNRS Grenoble, France 10h15 Hybrids in exact linear algebraDave Saunders U. Delaware, USA 10h45 Adaptive programming with hierarchical multiprocessor tasksThomas Rauber, Gudula Runger, U. Bayreuth, Germany 11h15 Cache-Oblivious algorithmsMichael Bender, Stony Brook U., USA
Adaptive algorithmsTheory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram IMAG-INRIA Workgroup on “Adaptive and Hybrid Algorithms” Grenoble, France • Contents • I. Some criteria to analyze adaptive algorithms • II. Work-stealing and adaptive parallel algorithms • III. Adaptive parallel prefix computation
Choices in the algorithm • sequential / parallel(s) • approximated / exact • in memory / out of core • … An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem Why adaptive algorithms and how? Resources availability is versatile Input data vary Measures on resources Measures on data Adaptation to improve performances • Scheduling • partitioning • load-balancing • work-stealing • Calibration • tuning parameters block size/ cache choice of instructions, … • priority managing
Adaptationto choose algo_fj for each call to f Modeling an hybrid algorithm • Several algorithms to solve a same problem f : • Eg : algo_f1, algo_f2(block size), … algo_fk : • each algo_fk being recursive algo_fi ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; … } • E.g. “practical” hybrids: • Atlas, Goto, FFPack • FFTW • cache-oblivious B-tree • any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib… .
How to manage overhead due to choices ? • Classification 1/2 : • Simple hybrid iff O(1) choices [eg block size in Atlas, …] • Baroque hybrid iff an unbounded number of choices [eg recursive splitting factors in FFTW] • choices are either dynamic or pre-computed based on input properties.
Choices may or may not be based on architecture parameters. • Classification 2/2. : an hybrid is • Oblivious: control flow does not depend neither on static properties of the resources nor on the input [eg cache-oblivious algorithm [Bender] • Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ] • Engineered tuned or self tuned[eg ATLAS and GOTO libraries, FFTW, …][eg [LinBox/FFLAS] [ Saunders&al] • Adaptive : self-configuration of the algorithm, dynamlc • Based on input properties or resource circumstances discovered at run-time[eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]
Examples • BLAS libraries • Atlas: simple tuned (self-tuned) • Goto : simple engineered (engineered tuned) • LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al] • FFTW • Halving factor : baroque tuned • Stopping criterion : simple tuned • Parallel algorithm and scheduling : • Choice of parallel degree : eg Tlib [Rauber&Rünger] • Work-stealing schedile : baroque hybrid
Adaptive algorithmsTheory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France • Contents • I. Some criteria to analyze for adaptive algorithms • II. Work-stealing and adaptive parallel algorithms • III. Adaptive parallel prefix computation
Work-stealing (1/2) « Work » W1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) • Workstealing = “greedy” schedule but distributed and randomized • Each processor manages locally the tasks it creates • When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen)
Work-stealing (2/2) « Work » W1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) • Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02] -> with good probability, near-optimal schedule on p processors with average speeds aveTp < W1/(p ave) + O ( W/ ave ) NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02] • Implementation: work-first principle [Cilk, Kaapi] • Local parallelism is implemented by sequential function call • Restrictions to ensure validity of the default sequential schedule - serie-parallel/Cilk - reference order/Kaapi
Work-stealing and adaptability • Work-stealing ensures allocation of processors to tasks transparently to the application with provable performances • Support to addition of new resources • Support to resilience of resources and fault-tolerance (crash faults, network, …) • Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, …] • “Baroque hybrid” adaptation: there is an -implicit- dynamic choice between two algorithms • a sequential (local) algorithm : depth-first (default choice) • A parallel algorithm : breadth-first • Choice is performed at runtime, depending on resource idleness • Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]: • Parallel Divide&Conquer computations • Tree searching, Branch&X … -> suited when both sequential and parallel algorithms perform (almost) the same number of operations
But often parallelism has a cost ! • Solution: to mix both a sequential and a parallel algorithm • Basic technique : • Parallel algorithm until a certain « grain »; then use the sequential one • Problem : W increases also, the number of migration … and the inefficiency ;o( • Work-preserving speed-up[Bini-Pan 94] = cascading [Jaja92] Careful interplay of both algorithms to build one with both W small and W1 = O( Wseq ) • Divide the sequential algorithm into block • Each block is computed with the (non-optimal) parallel algorithm • Drawback : sequential at coarse grain and parallel at fine grain ;o( • Adaptive granularity: dual approach : • Parallelism is extracted at run-time from any sequential task
SeqCompute SeqCompute Extract_par LastPartComputation Self-adaptive grain algorithm Based on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation Hypothesis : two algorithms : • - 1 sequential : SeqCompute- 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm • Examples : - iterated product [Vernizzi 05] - gzip / compression [Kerfali 04] - MPEG-4 / H264 [Bernard 06] - prefix computation [Traore 06]
Adaptive algorithmsTheory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France • Contents • I. Some criteria to analyze for adaptive algorithms • II. Work-stealing and adaptive parallel algorithms • III. Adaptive parallel prefix computation
* * * * Prefix of size n/2 13 … n * * * 24 … n-1 Prefix computation : an example where parallelism always costs1 = a0*a1 2=a0*a1*a2…n=a0*a1*…*an • Sequential algorithm:for (i= 0 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ; • Parallel algorithm [Ladner-Fischer]: W1= W = n a0 a1 a2 a3 a4 … an-1 an W =2. log n butW1= 2.n Twice more expensive than the sequential …
Adaptive prefix computation Any (parallel) prefix performs at least W1 2.n - W ops Strict-lower bound on p identical processors: Tp 2n/(p+1) block algorithm + pipeline [Nicolau&al. 2000] Application of adaptive scheme : One process performs the main “sequential” computation Other work-stealer processes computes parallel « segmented » prefix Near-optimal performance on processors with changing speeds :Tp < 2n/((p+1). ave) + O ( log n / ave) lower bound
Scheme of the proof • Dynamic coupling of two algorithms that completes simultaneously: • Sequential: (optimal) number of operations S • Parallel : : performs X operations • dynamic splitting always possible till finest grain BUT local sequential • Scheduled by workstealing on p-1 processors • Critical path small (log X) • Each non constant time task can be splitted (variable speeds) • Analysis : • Algorithmic scheme ensures Ts = Tp + O(log X)=> enables to bound the whole number X of operations performedand the overhead of parallelism = (s+X) - #ops_optimal • Comparison to the lower bound on the number of operations.
0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 Main Seq. Steal request Work-stealer 1 Work-stealer 2 Adaptive Prefix on 3 processors 1
0 a1 a2 a3 a4 Main Seq. 1 2 Steal request a5 a6 a7 a8 a9 a10 a11 a12 6 Work-stealer 1 i=a5*…*ai Work-stealer 2 Adaptive Prefix on 3 processors 3 7
0 a1 a2 a3 a4 Main Seq. 1 2 3 4 8 8 Preempt 4 a5 a6 a7 a8 6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12 10 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors 8
0 a1 a2 a3 a4 8 Main Seq. 1 2 3 4 11 Preempt 11 8 a5 a6 a7 a8 6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12 10 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors 8 5 6 8 9 11
0 a1 a2 a3 a4 8 11 a12 Main Seq. 1 2 3 4 a5 a6 a7 a8 6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12 10 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors 8 11 12 5 6 7 8 9 10 11
0 a1 a2 a3 a4 8 11 a12 Main Seq. 1 2 3 4 8 11 12 a5 a6 a7 a8 5 6 6 7 7 8 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12 9 10 10 11 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors Implicit critical path on the sequential process
Parallel Parallel Adaptive Adaptive Adaptive prefix : some experiments Joint work with Daouda Traore Prefix of 10000 elements on a SMP 8 procs (IA64 / linux) External charge Time (s) Time (s) #processors #processors Multi-user context Adaptive is the fastest15% benefit over a static grain algorithm • Single user context • Adaptive is equivalent to: • - sequential on 1 proc • - optimal parallel-2 proc. on 2 processors • - … • - optimal parallel-8 proc. on 8 processors
Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential The Prefix race: sequential/parallel fixed/ adaptive On each of the 10 executions, adaptive completes first
With * = double sum ( r[i]=r[i-1] + x[i] ) Finest “grain” limited to 1 page = 16384 octets = 2048 double Single user Processors with variable speeds Remark for n=4.096.000 doubles : - “pure” sequential : 0,20 s - minimal ”grain” = 100 doubles : 0.26s on 1 proc and 0.175 on 2 procs (close to lower bound)
Sequential algorithm : T1= n2/2; T= n (fine grain) 0 0 0 .x = b .x = b .x = b E.g.Triangular system solving 1/ x1 = - b1 / a11 2/ For k=2..n bk = bk - ak1.x1 A system of dimension n system of dimension n-1
Sequential algorithm : T1= n2/2; T= n (fine grain) 0 • Using parallel matrix inversion : T1= n3; T= log2 n (fine grain) .x = b -1 A = with -1 and x=A-1.b 0 0 -1 • Self-adaptive granularity algorithm : T1= n2; T= n.log n A11 A11 -1 -1 S= -A22.A21.A11 = A21 A22 S -1 self adaptive sequential algorithm A22 m self-adaptivematrix inversion h 0 .x = b and self-adaptive scalar product ExtractPar choice of h = m E.g.Triangular system solving
Conclusion Adaptive : what choices and how to choose ? Illustration : Adaptive parallel prefix based on work-stealing - self-tuned baroque hybrid : O(p log n ) choices - achieves near-optimal performance processor oblivious Generic adaptive scheme to implement parallel algorithms with provable performance
Mini SymposiumAdaptive Algorithms for Scientific computing • Adaptive, hybrids, oblivious : what do those terms mean ? • Taxonomy of autonomic computing [Ganek & Corbi 2003] : • Self-configuring / self-healing / self-optimising / self-protecting • Objective: towards an analysis based on the algorithm performance 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France 10h15 Hybrids in exact linear algebra Dave Saunders, U. Delaware, USA 10h45 Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, U. Bayreuth, Germany 11h15 Cache-Obloivious algorithms Michael Bender, Stony Brook U., USA
Some examples (1/2) • Adaptive algorithms used empirically an theoretically : • Atlas [2001] dense linear algebra library • Instruction set and instruction schedule • Self-camobration pg yjr blpvk idr §<“§!uuuuuuuuuu de la taille des blocs à l’installation sur la machine • FFTW (1998, … ) ; FFT (n) <= p FFT(q) and q FFT(n) • For any n, for any recursive call FFT(n) : pre-compite the nest value for p • Pré-calcul de la découpe optimale pour la taille n du vecteur sur la machine • Cache-oblivious B-trees : • Block recursive splitting to minimize #page faults • Self adaptation to memory hierarchy • Workstealing (Cilk (1998, …) (2000, …) : recursive parallelism • Choice between sequential depth-first schedule and breadth-first schedule • « Work-first principle » : to optimize local sequentilal execution and put overhead on rare steals from idle processors . • Implicitly adaptive
Some examples (2/2) • Moldable tasks : Ordonnancement bi-critère avec garantie [Trystram&al 2004] • Combinaison récursive alternatiive d’approximation pour chaque critère • Auto-adaptation avec performance garantie pour chaque critère • Algorithmes « Cache-Oblivious » [Bender&al 2004] • Découpe récursive par bloc qui minimise les défauts de page • Auto-adaptation à la hiérarchie mémoire (B-tree) • Algorithmes « Processor-Oblivious » [Roch&al 2005] • Combinaison récursive de 2 algorithmes séquentiel et parallèle • Auto-adaptation à l’inactivité des ressources
Best case : parallel algorithm is efficient Wis small and W1 = Wseq The parallel algorithm is an optimal sequential one Exemples: parallel D&C algorithms Implementation: work-first principle - no overhead when local execution of tasks Examples : Cilk : THE protocol Kaapi : Compare&swap only
Experimentation: knary benchmark Distributed Archi. iClusterAthapascan SMP Architecture Origin 3800 (32 procs)Cilk / Athapascan Ts = 2397 s T1 = 2435
F(2,a) G(a,b) H(b) b a H(a) O(b,7) High potential degree of parallelism In « practice »: coarse granularity Splitting into p = #resources Drawback : heterogeneous architecture, dynamic: i(t) : speed of processor i at time t In « theory »: fine granularity Maximal parallelism Drawback : overhead of tasks management How to choose/adapt granularity ?
How to obtain an efficientfine-grain algorithm ? • Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism) • Problem : • Fine grain (T small) parallel algorithms may involve a large overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization • But also arithmetic overhead
Self-grain Adaptive algorithms • Recursive computations • Local sequential computation • Special case: • recursive extraction of parallelism when a resource becomes idle • But local execution of a sequential algorithm • Hypothesis : two algorithms : • - 1 sequential : SeqCompute • - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm • Example : • - iterated product [Vernizzi] - gzip / compression [Kerfali] • - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]
Illustration: adaptive parallel prefix • Adaptive parallel computing on non-uniform and shared resources • Example of adaptive prefix computation
a0 a1 a2 a3 an-1 an * * * Prefix ( n / 2 ) P1 P3 Pn * * * Pn-1 P2 P4 Indeed parallelism often costs ...eg : Prefix computationP1 = a0*a1, P2=a0*a1*a2, …, Pn=a0*a1*…*an • Sequential algorithm:for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ;W1= n • Parallel algorithm [Ladner-Fischer]: W =2. log n but W1= 2.n Twice more expensive than the sequential