270 likes | 287 Views
This study focuses on developing heuristics for work distribution in parallel dynamic programming on heterogeneous systems, with the goal of obtaining parallel routines with autotuning capacity. The study explores various parallel dynamic programming schemes and their performance on different types of processors. The results provide insights into the optimal work distribution strategies for efficient parallel processing on heterogeneous systems.
E N D
Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería y Tecnología de Computadores Universidad de Murcia, Spain javiercm@ditec.um.es Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain domingo@dif.um.es dis.um.es/~domingo Juan-Pedro Martínez Departamento de Estadística y Matemática Aplicada Universidad Miguel Hernández de Elche, Spain jp.martinez@uhm.es HeteroPar2004
Our Goal General Goal: to obtain parallel routines with autotuning capacity • Previous works: Linear Algebra Routines, Homogeneous Systems • This communication: Parallel Dynamic Programming Schemes on Heterogeneous Systems • In the future: apply the techniques to other algorithmic schemes HeteroPar2004
Outline • Parallel Dynamic Programming Schemes • Autotuning in Parallel Dynamic Programming Schemes • Work Distribution • Experimental Results HeteroPar2004
Parallel Dynamic Programming Schemes • There are different Parallel Dynamic Programming Schemes. • The simple scheme of the “coins problem” is used: • A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. • But the granularity of the computation has been varied to study the scheme, not the problem. HeteroPar2004
1 2 . . . . . . . . j . . . . . N 1 2 …. i … n Parallel Dynamic Programming Schemes • Sequential scheme: fori=1 tonumber_of_decisions forj=1 toproblem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor HeteroPar2004
1 2 . . . . . . . j . . . . . 1 2 ... i … n PO P1 P2 ...... PS ... PK-1 PK Parallel Dynamic Programming Schemes • Parallel scheme: fori=1 tonumber_of_decisions In Parallel: forj=1 toproblem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel endfor HeteroPar2004
1 2 . . . . . . . . j . . . . . 1 2 ... i … n Parallel Dynamic Programming Schemes • Message-passing scheme: In each processor Pj for i=1 tonumber_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor N PO P1 P2 .................... PK-1 PK HeteroPar2004
Parallel Dynamic Programming Schemes • There are different possibilities in heterogeneous systems: • Heterogeneous algorithms. • Homogeneous algorithms and assignation of: • One process to each processor • A variable number of processes to each processor, depending on the relative speed The general assignation problem is NP use of heuristics approximations HeteroPar2004
1 1 2 2 . . . . . . . . . . . . . . j j . . . . . . . . . . 1 1 2 2 ... ... i i … … n n Parallel Dynamic Programming Schemes • Dynamic Programming (the coins problem scheme) Homogeneous algorithm + Heterogeneous algorithm distribution p0 p1 p2 p3 p4 p5 ... ps ... pr-1 pr P0 P0 P1 P3 P3 P3 ... PS ... PK PK P0 P1 P2 ...... PS ... PK-1 PK HeteroPar2004
Autotuning in Parallel Dynamic Programming Schemes • The model: t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d)) • Problem size: • n number of types of coins • Cvalue to give • varray of values of the coins • q quantity of coins of each type • Algorithmic parameters (AP): • pnumber of processes • b block size (here n/p) • d processes to processors assignment • System parameters (SP): • tc cost of basic arithmetic operations • ts start-up time • tw word-sending time HeteroPar2004
one step Maximum values Autotuning in Parallel Dynamic Programming Schemes • Theoretical model: Sequential cost: Computational parallel cost (qilarge): Communication cost: • The APs are p and the assignation array d • The SPs are the unidimensional array tc , and the bidimensional arrays ts and tw HeteroPar2004
Work distribution • Assignment tree (P types of processors and p processes): P processors ... 1 2 3 P ... ... ... 1 2 3 P 2 3 P 3 P P p processes ... Some limit in the height of the tree (the number of processes) is necessary HeteroPar2004
1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 Work distribution • Assignment tree (P types of processors and p processes): P =2 and p =3: 10 nodes in general: HeteroPar2004
Work distribution • Systems: • SUNEt: • five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) • Ethernet • TORC (Innovative Computing Laboratory): • 21 nodes of different types (dual and single, Pentium II III and 4, AMD Athlon, …) • FastEthernet, Myrinet, … HeteroPar2004
U1 U1 U1 U5 U5 U1 U1 U1 U1 U1 Work distribution • Assignment tree. SUNEt P=2 types ofprocessors (five SUN1 + one SUN5): nodes: when more processes than available processors are assigned to a type of processor, the costs of operations (SPs) change 2processors U5 U1 U1 one process to each processor p processes ... HeteroPar2004
Work distribution • Assignment tree. TORC, used P=4 types ofprocessors: one 1.7 Ghz Pentium 4(only one process can be assigned). Type 1 one 1.2 Ghz AMD Athlon. Type 2 one 600 Mhz single Pentium III. Type 3 eight 550 Mhz dual Pentium III. Type 4 4processors not in the tree two consecutive processes are assigned to a same node 1 2 3 4 1 2 3 4 2 3 4 3 4 4 p processes ... the values of SPs change HeteroPar2004
Work distribution • Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: • Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge: HeteroPar2004
Work distribution • Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: • Use the theoretical execution model to obtain a lower bound for each node For example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with relative speeds si, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations HeteroPar2004
Experimental Results • Systems: • SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet • TORC: 11 nodes of different types (1.7 Ghz Pentium 4+1.2 Ghz AMD Athlon+600 Mhz Pentium III+8 550 Mhz Dual Pentium III) + FastEthernet • Varying: • The problem size C = 10000, 50000, 100000, 500000 • Large value of qi • The granularity of the computation (the cost of a computational step) HeteroPar2004
Experimental Results • How to estimate arithmetic SPs: Solving a small problem on each type of processors • How to estimate communication SPs: • Using a ping-pong between each pair of processors, and processes in the same processor (CP1) • Does not reflect the characteristics of the system • Solving a small problem varying the number of processors, and with linear interpolation (CP2) • Larger installation time HeteroPar2004
Experimental Results • Three types of users are considered: • GU (greedy user): • Uses all the available processors, with one process per processor. • CU (conservative user): • Uses half of the available processors (the fastest), with one process per processor. • EU (user expert in the problem, the system and heterogeneous computing): • Uses a different number of processes and processors depending on the granularity: • 1 process in the fastest processor, for low granularity • The number of processes is half of the available processors, and in the appropriate processors, for middle granularity • A number of processes equal to the number of processors, and in the appropriate processors, for large granularity HeteroPar2004
Experimental Results • Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt: HeteroPar2004
C gra LT CP2 50000 10 (1,2) (1,2) 50000 50 (1,2) (1,2,4,4) 50000 100 (1,2) (1,2,4,4) 100000 10 (1,2) (1,2) 100000 50 (1,2) (1,2,4,4) 100000 100 (1,2) (1,2,4,4) 500000 10 (1,2) (1,2) 500000 50 (1,2) (1,2,3,4) 500000 100 (1,2) (1,2,3,4) Experimental Results • Parameters selection, in TORC, with CP2: HeteroPar2004
C gra LT CP2 50000 10 (1,1,2) (1,1,2,3,3,3,3,3,3) 50000 50 (1,1,2) (1,1,2,3,3,3,3,3,3,3,3) 50000 100 (1,1,3,3) (1,1,2,3,3,3,3,3,3,3,3) 100000 10 (1,1,2) (1,1,2) 100000 50 (1,1,3) (1,1,2,3,3,3,3,3,3,3,3) 100000 100 (1,1,3) (1,1,2,3,3,3,3,3,3,3,3) 500000 10 (1,1,2) (1,1,2) 500000 50 (1,1,2) (1,1,2,3) 500000 100 (1,1,2) (1,1,2) Experimental Results • Parameters selection, in TORC (without the 1.7 Ghz Pentium 4), with CP2: one 1.2 Ghz AMD Athlon. Type 1 one 600 Mhz single Pentium III. Type 2 eight 550 Mhz dual Pentium III. Type 3 HeteroPar2004
Experimental Results • Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC: HeteroPar2004
Experimental Results • Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4): HeteroPar2004
Conclusions and future work • The inclusion of Autotuning capacities in a Parallel Dynamic Programming Scheme for Heterogeneous Networks of Processors has been considered. • Parameters selection is combined with heuristics search in the assignation tree. • Experimentally the selection proves to be satisfactory, and useful in providing the users with routines capable of reduced time executions. • In the future we plan to apply this technique to other algorithmic schemes. HeteroPar2004