500 likes | 616 Views
Adaptabilité. Les données varient. Les ressources varient. Application. Nécessité d’adaptation pour améliorer la performance. MiniSymposium Adaptive Algortihms for Scientific computing. 9h45 Adaptive algorithms - Theory and applications
E N D
Adaptabilité Les données varient Les ressources varient Application Nécessité d’adaptation pour améliorer la performance
MiniSymposiumAdaptive Algortihms for Scientific computing • 9h45 Adaptive algorithms - Theory and applications • Collective work - AHA Team Jean-Louis Roch, INRIA-CNRS Grenoble, France 10h15 Hybrids in exact linear algebra • Dave Saunders, U. Delaware, USA 10h45 Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, U. Bayreuth, Germany 11h15 Cache-Obloivious algorithms Michael Bender, Stony Brook U., USA
Why adaptive algorithms ? Resources availability is versatile Data vary Objectif de AHA : vision intégrée de l’adaptation Approche algorithmique : combinaison auto-adaptative d’algorithmes avec comportement global justifié d’un point de vue théorique Mesures sur les ressources Mesures sur les données Adaptations • Choix algorithme • séquentiels/parallèle(s) • approché/exact • en mémoire / out of core • Ordonnancement • planification (scheduling) volume calculs / hétérogénéité • redistribution (load-balancing) • Calibrage • pré-paramétrage taille de blocs / cache choix d’instructions • gestion de priorités
Algorithmes parallèles à grain adaptatif Exemple du préfixe Jean-Louis.Roch@imag.fr Projet MOAIS (www-id.imag.fr/MOAIS) Laboratoire ID-IMAG (CNRS-INRIA INPG-UJF)
How to adapt the application ? • By minimizing communications • e.g. amortizing synchronizations in the simulation [Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004]adaptive granularity • By contolling latency (interactivity constraints) : • FlowVR[Allard, Menier, Raffin] overhead • By managing node failures and resilience [Checkpoint/restart][checkers] • FlowCert[Jafar, Krings, Leprevost; Roch, Varrette] • By adapting granularity • malleable tasks [Trystram, Mounié] • dataflow cactus-stack : Athapascan/Kaapi[Gautier] • recursive parallelism by « work-stealling » [Blumofe-Leiserson 98, Cilk, Athapascan, ... ] [Bender Rabin 2002] • Self-adaptive grain algorithms • dynamic extraction of paralllelism [Daoudi, Gautier, Revire, Roch - J. TSI 2005 ] [Roch, Traore, Bernard - … ]
Algorithmes parallèles à grain adaptatif : Quelques exemples • Ordonnancement de programme parallèle à grain fin : work-stealing • Algorithmes à grain adaptatif : principe d’une « cascade » dynamique exemple du produit itéré • Couplage séquentiel - parallèle : exemple du préfixe
F(2,a) G(a,b) H(b) b a H(a) O(b,7) High potential degree of parallelism In « practice »: coarse granularity Splitting into p = #resources Drawback : heterogeneous architecture, dynamic: i(t) : speed of processor i at time t In « theory »: fine granularity Maximal parallelism Drawback : overhead of tasks management How to choose/adapt granularity ?
Greedy scheduling «Depth » parallel time on resources W = #ops on a critcal path « Work » sequential timeW1= #operations Homogeneous case [Graham 69] : greedy scheduling : No ready task when a processor is idleTp < W1/p + (1-1/p).W => Tp < W1/p + W Heterogeneous case [Jaffe 80] Maximum utilization schedule If i < p ready tasks, assign the threads to the i faster procs High utilisation schedule [Bender 02] : parameter B If i < p ready tasks, the fastest idle processor is at most B times faster than the slowest busy processorTp < W1/(p. ave) + B.W/ave
Work stealing • Distributed randomized implementation of greedy scheduling • Each processor manages locally the tasks it creates • When idle, a processor steals the oldest ready task on a remote -non idle- processor (randomly chosen) • Implementation: local stack = deque [Cilk, Kaapi] • Local parallelism is implemented by sequential function call • Local sequential execution correct => restrictions • serie-parallel/Cilk - reference order/Kaapi • On heteorogeneous processors : • Slight modification : when a processor steals a B-times slower busy processor, it preempts its task • Interests : => with good probability, #succeeded steals < p. W few task migrations [Blumofe 98, Narlikar 01, Bender 02,Revire-Roch 03, ....] => suited to heterogeneous architectures [Bender-Rabin 02] • Tp < W1/(p. ave) + O ( W/ ave )with good probability => How to have W small and W1 = #ops seq ???
Best case : parallel algorithm is efficient Wis small and W1 = Wseq The parallel algorithm is an optimal sequential one Exemples: parallel D&C algorithms Implementation: work-first principle - no overhead when local execution of tasks Examples : Cilk : THE protocol Kaapi : Compare&swap only
Experimentation: knary benchmark Distributed Archi. iClusterAthapascan SMP Architecture Origin 3800 (32 procs)Cilk / Athapascan Ts = 2397 s T1 = 2435
But usually, when Wis small W1 >> Wseq • Solution: to mix both sequential and parallel algorithm • Basic technique : • Parallel algorithm until a certain « grain »; then use the sequential one • Problem : T increases also, the number of migration … and the inefficiency ;o( • Work-preserving speed-up[Bini-Pan 94] = cascading technique [Jaja92] Careful interplay of both algorithms to build one with both T small and T1 = O( Ts ) • Divide the sequential algorithm into block • Each block is compute with the (non-optimal) parallel algorithm • Drawback : sequential at coarse grain and parallel at fine grain ;o( • Adaptive grain: dual approach : parallelism is extracted from any sequential task
How to obtain an efficientfine-grain algorithm ? • Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism) • Problem : • Fine grain (T small) parallel algorithms may involve a large overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization • But also arithmetic overhead
Self-grain Adaptive algorithms • Recursive computations • Local sequential computation • Special case: • recursive extraction of parallelism when a resource becomes idle • But local execution of a sequential algorithm • Hypothesis : two algorithms : • - 1 sequential : SeqCompute • - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm • Example : • - iterated product [Vernizzi] - gzip / compression [Kerfali] • - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]
SeqCompute SeqCompute Extract_par LastPartComputation Self-adaptive grain algorithm Principle : To save parallelism overhead by privilegiating a sequential algorithm : => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation Hypothesis : two algorithms : • - 1 sequential : SeqCompute- 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm • Examples : • - iterated product [Vernizzi] - gzip / compression [Kerfali] • - MPEG-4 / H264 [Bernard ….] - prefix computation [Traore]
a0 a1 a2 a3 an-1 an * * * Préfixe ( n / 2 ) P1 P3 Pn * * * Pn-1 P2 P4 Indeed parallelism often costs ...Eg : Prefix computation P1 = a0*a1, P2=a0*a1*a2, …, Pn=a0*a1*…*an • Sequential algorithm :for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ;T1= n • Parallel algorithm : T =2. log n but T1= 2.n
Adaptive prefix computation Any (parallel) algorithm with depth T =d performs at least 2n-d operations Slower bound on p identical processors: 2n/(p+1) Block algorithm + pipeline [Nicolau 2000] Adaptive scheme : One process performs sequential computation p-1 processes perform a parallel « segmented » prefix computation :Tp < 2n/((p+1). ave) + O (log n/ ave)
Adaptive Prefix with variable speeds - Lower bound: decreasing parallel time => #ops increases > 2n. (1-1/p) - Adaptive grain algorithm with provable performances : dynamic cascading of two algorithms (sequential/parallel) [TSI2005}] - Theorem : T = 2n / (p*+1) + O(log n) ~ optimal on processors with average speed p* [soon 2006] External charge Parallel Parallel Adaptive Adaptive • Single user context • Adaptive is equivalent to: • - sequential on 1 proc • - optimal parallel-2 proc. on 2 processors • - … • - optimal parallel-8 proc. on 8 processors Multiuser context Adaptive is the fastest15% benefit over a static grain algorithm
Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential The race: sequential/parallel fixed/ Adaptive Prefix
Conclusion Adaptive algorithm with provable performances -> also confirmed by first experimentations To experiment : - on SMP at fine grain [floating point prefix sum] (memory, fixing workstealer on cpus) - on distributed heterogeneous architectures The scheme (and its complexity analysis) appears general - to apply the technique on oher problems [AHA]
f1 f1 steal f2 P P’ Implementation of work-stealing Hypothesis : a sequential schedule is valid + non-préemptive execution of ready task • Intérêt : Grain fin « statique », mais contrôle dynamique • Inconvénient: surcôut possible de l’algorithme parallèle [ex. préfixes] Stack f1() { …. fork f2 ; … } fork f2
Illustration : f(i), i=1..100 LastPart(w) W=2..100 SeqComp(w) sur CPU=A f(1)
Illustration : f(i), i=1..100 LastPart(w) W=3..100 SeqComp(w) sur CPU=A f(1);f(2)
Illustration : f(i), i=1..100 LastPart(w)on CPU=B W=3..100 SeqComp(w) sur CPU=A f(1);f(2)
Illustration : f(i), i=1..100 LastPart(w)on CPU=B LastPart(w’) LastPart(w) W=3..51 W’=52..100 SeqComp(w) surCPU=A f(1);f(2) SeqComp(w’)
Illustration : f(i), i=1..100 LastPart(w’) LastPart(w) W=3..51 W’=52..100 SeqComp(w) sur CPU=A f(1);f(2) SeqComp(w’)
Illustration : f(i), i=1..100 LastPart(w) LastPart(w’) W=3..51 W’=53..100 SeqComp(w) sur CPU=A f(1);f(2) SeqComp(w’) sur CPU=B f(52)
Adaptivité • Kaapi: réification, interaction avec l’environnement (ajout de ressources), … (interaction) • Mais aussi : impact sur l’algorithmique / ordonnancement • Example : workstealing based algorithms • Recursive parallel computations • Local sequential computation • Special case: • recursive extraction of parallelism when a resource becomes idle • But local execution of a sequential algorithm • Example : prefix computation • Sequential : n operations • Parallel on p identical resources : at least 2n.(p/(p+1)) operations • Adaptive with work-stealing : • Coupling sequential and parallel partial-prefix computation • May benefit of an unbounded number or ressources • Performance : on p processors of variable speeds :2n/(p+1) + O(log n)
Adaptive algorithms • Recursive computations • Local sequential computation • Special case: • recursive extraction of parallelism when a resource becomes idle • But local execution of a sequential algorithm • Example : prefix computation • Sequential : n operations • Parallel on p identical resources : at least 2n.(p/(p+1)) operations • Adaptive with work-stealing : • Coupling sequential and parallel partial-prefix computation • May benefit of an unbounded number or ressources • Performance : on p processors of variable speeds :2n/(p+1) + O(log n)
Sequential algorithm : T1= n2/2; T= n (fine grain) 0 0 0 .x = b .x = b .x = b E.g.Triangular system solving 1/ x1 = - b1 / a11 2/ For k=2..n bk = bk - ak1.x1 A system of dimension n system of dimension n-1
Sequential algorithm : T1= n2/2; T= n (fine grain) 0 • Using parallel matrix inversion : T1= n3; T= log2 n (fine grain) .x = b -1 A = with -1 and x=A-1.b 0 0 -1 • Self-adaptive granularity algorithm : T1= n2; T= n.log n A11 A11 -1 -1 S= -A22.A21.A11 = A21 A22 S -1 self adaptive sequential algorithm A22 m self-adaptivematrix inversion h 0 .x = b and self-adaptive scalar product ExtractPar choice of h = m E.g.Triangular system solving
Algorithmes parallèles à grain adaptatif : Quelques exemples • Ordonnancement de programme parallèle à grain fin : work-stealing et efficacité • Algorithmes à grain adaptatif : principe d’une « cascade » dynamique • Exemples • Produit itéré, préfixe • Compression gzip • Inversion de systèmes triangulaire • Vision 3D / Calcul d’oct-tree
Expérimentation : parallèle <=> adaptatif Produit iteré Séquentiel, parallèle, adaptatif [Davide Vernizzi] • Séquentiel : • Entrée: tableau de n valeurs • Sortie: • c/c++ code: for (i=0; i<n; i++) res += atoi(x[i]); • Algorithme parallèle : • calcul récursif par bloc (arbre binaire avec fusion) • Taille de bloc = pagesize • Code kaapi : athapascan API
Expérimentation : - l’algorithme parallèle coûte environ 2 fois plus que l’algorithme séquentiel- l’algorithme adaptatif a une efficacité proche de 1 Variante : somme de pages • Entrée: ensemble de n pages. Chaque page est un tableau de valeurs • Sortie: une page où chaque élément estla somme des éléments de même indice des pages précédentes • c/c++ code: for (i=0; i<n; i++) for (j=0; j<pageSize; j++) res [j] += f (pages[i][j]);
Démonstration sur ensibull Script: [vernizzd@ensibull demo]$ more go-tout.sh #!/bin/sh ./spg /tmp/data & ./ppg /tmp/data 1 --a1 -thread.poolsize 3 & ./apg /tmp/data 1 --a1 -thread.poolsize 3 & Résultat: [vernizzd@ensibull demo]$ ./go-tout.sh Page size: 4096 Memory allocated Memory allocated 0:In main: th = 1, parallel 0: ----------------------------------------- 0: res = -2.048e+07 0: time = 0.408178 s ADAPTATIF (3 procs) 0: Threads created: 54 0: ----------------------------------------- 0: res = -2.048e+07 0: time = 0.964014 s PARALLELE (3 procs) 0: #fork = 7497 0: ----------------------------------------- : ----------------------------------------- : res = -2.048e+07 : time = 1.15204 s SEQUENTIEL (1 proc) : -----------------------------------------
D’où vient la différence ? …Les sources des programmes Source des codes pour la somme des pages : parallèle / arbre binaire adaptatif par couplage ; - séquentiel + Fork<LastPartComp> - LastParComp: génération (récursive) de 3 tâches
Algorithme parallèle struct Iterated { void operator() (a1::Shared_w<Page> res, int start, int stop) { if ( (stop-start) <2) { // If max num of pages is reached, sequential algorithm Page resLocal (pageSize); IteratedSeq(start, resLocal); res.write(resLocal); } else { // If max num of pages is not reached int half = (start+stop)/2; a1::Shared<Page> res1; // First thread result a1::Shared<Page> res2; // Second thread result a1::Fork<Iterated> () (res1, start, half); //First thread a1::Fork<Iterated> () (res2, half, stop); //Second thread a1::Fork<Merge> () (res, res1, res2); //Merging results... }}};
Parallélisation adaptative • Calcul par bloc sur des entrées en k blocs: • 1 bloc = pagesize • Exécution indépendante des k tâches • Fusion des resultats
Algorithme adaptatif (1/3) • Hypothèse: ordonnancement non préemptif - de type work-stealing • Couplage séquentiel adaptatif : void Adaptative (a1::Shared_w<Page> *resLocal, DescWork dw) { // cout << "Adaptative" << endl; a1::Shared <Page> resLPC; a1::Fork<LPC>() (resLPC, dw); Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork <Merge> () (resLPC, *resLocal, resSeq); }
Algorithme adaptatif (2/3) • Côté séquentiel : void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i<pageSize; i++ ) { k = resLoc.get (i) + (double) buff[w*pageSize+i]; resLoc.put(i, k); } } *resSeq=resLoc; }
Algorithme adaptatif (3/3) • Côté extraction = algorithme parallèle : struct LPC { void operator () (a1::Shared_w<Page> resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared<Page> res2; a1::Fork<AdaptativeMain>() (res2, dw2.desc->i, dw2.desc->j); a1::Shared<Page> resLPCold; a1::Fork<LPC>() (resLPCold, dw); a1::Fork<MergeLPC>() (resLPCold, res2, resLPC); } } };
Parallélisation adaptative • Une seule tache de calcul est demarrée pour toutes les entrées • Division du travail qui reste à faire seulement dans le cas où un processeur devient inactif • Moins de taches, moins de fusions
Exemple 2 : parallélisation de gzip • Gzip : • Utilisé (web) et coûteux bien que de complexité linéaire • Code source :10000 lignes C, structures de données complexes • Principe : LZ77 + arbre Huffman • Pourquoi gzip ? • Problème P-complet, mais parallélisation pratique possible • Inconvénient: toute parallélisation (connue) entraîne un surcoût • -> perte de taux de compression
Algorithme Parallélisation => Fichieren entrée Partition statique en blocs Partition dynamique en blocs Compressionà la volée => Compressionparallèle Blocs compressés Fichiercompressé Comment paralléliser gzip ? Parallélisation « facile » ,100% compatible avec gzip/gunzip Problèmes : perte de taux de compression, grain dépend de la machine, surcoût
SeqComp InputFile Compressionà la volée Dynamicpartitionin blocks Parallelcompression Outputcompressedfile Outputcompressedblocks cat Parallélisation gzip à grain adaptatif LastPartComputation
Surcoût en taille de fichier comprimé Gain enT
Performances Pentium 4x200Mhz