1 / 27

Heuristics for Work Distribution in Parallel Dynamic Programming on Heterogeneous Systems

This study focuses on developing heuristics for work distribution in parallel dynamic programming on heterogeneous systems, with the goal of obtaining parallel routines with autotuning capacity. The study explores various parallel dynamic programming schemes and their performance on different types of processors. The results provide insights into the optimal work distribution strategies for efficient parallel processing on heterogeneous systems.

virginiai
Download Presentation

Heuristics for Work Distribution in Parallel Dynamic Programming on Heterogeneous Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Heuristics for Work Distribution of a Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems Javier Cuenca Departamento de Ingeniería y Tecnología de Computadores Universidad de Murcia, Spain javiercm@ditec.um.es Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain domingo@dif.um.es dis.um.es/~domingo Juan-Pedro Martínez Departamento de Estadística y Matemática Aplicada Universidad Miguel Hernández de Elche, Spain jp.martinez@uhm.es HeteroPar2004

  2. Our Goal General Goal: to obtain parallel routines with autotuning capacity • Previous works: Linear Algebra Routines, Homogeneous Systems • This communication: Parallel Dynamic Programming Schemes on Heterogeneous Systems • In the future: apply the techniques to other algorithmic schemes HeteroPar2004

  3. Outline • Parallel Dynamic Programming Schemes • Autotuning in Parallel Dynamic Programming Schemes • Work Distribution • Experimental Results HeteroPar2004

  4. Parallel Dynamic Programming Schemes • There are different Parallel Dynamic Programming Schemes. • The simple scheme of the “coins problem” is used: • A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. • But the granularity of the computation has been varied to study the scheme, not the problem. HeteroPar2004

  5. 1 2 . . . . . . . . j . . . . . N 1 2 …. i … n Parallel Dynamic Programming Schemes • Sequential scheme: fori=1 tonumber_of_decisions forj=1 toproblem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor HeteroPar2004

  6. 1 2 . . . . . . . j . . . . . 1 2 ... i … n PO P1 P2 ...... PS ... PK-1 PK Parallel Dynamic Programming Schemes • Parallel scheme: fori=1 tonumber_of_decisions In Parallel: forj=1 toproblem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel endfor HeteroPar2004

  7. 1 2 . . . . . . . . j . . . . . 1 2 ... i … n Parallel Dynamic Programming Schemes • Message-passing scheme: In each processor Pj for i=1 tonumber_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor N PO P1 P2 .................... PK-1 PK HeteroPar2004

  8. Parallel Dynamic Programming Schemes • There are different possibilities in heterogeneous systems: • Heterogeneous algorithms. • Homogeneous algorithms and assignation of: • One process to each processor • A variable number of processes to each processor, depending on the relative speed The general assignation problem is NP  use of heuristics approximations HeteroPar2004

  9. 1 1 2 2 . . . . . . . . . . . . . . j j . . . . . . . . . . 1 1 2 2 ... ... i i … … n n Parallel Dynamic Programming Schemes • Dynamic Programming (the coins problem scheme) Homogeneous algorithm + Heterogeneous algorithm distribution p0 p1 p2 p3 p4 p5 ... ps ... pr-1 pr P0 P0 P1 P3 P3 P3 ... PS ... PK PK P0 P1 P2 ...... PS ... PK-1 PK HeteroPar2004

  10. Autotuning in Parallel Dynamic Programming Schemes • The model: t(n,C,v,q,tc(n,C,v,q,p,b,d),ts(n,C,v,q,p,b,d),tw(n,C,v,q,p,b,d)) • Problem size: • n number of types of coins • Cvalue to give • varray of values of the coins • q quantity of coins of each type • Algorithmic parameters (AP): • pnumber of processes • b block size (here n/p) • d processes to processors assignment • System parameters (SP): • tc cost of basic arithmetic operations • ts start-up time • tw word-sending time HeteroPar2004

  11. one step Maximum values Autotuning in Parallel Dynamic Programming Schemes • Theoretical model: Sequential cost: Computational parallel cost (qilarge): Communication cost: • The APs are p and the assignation array d • The SPs are the unidimensional array tc , and the bidimensional arrays ts and tw HeteroPar2004

  12. Work distribution • Assignment tree (P types of processors and p processes): P processors ... 1 2 3 P ... ... ... 1 2 3 P 2 3 P 3 P P p processes ... Some limit in the height of the tree (the number of processes) is necessary HeteroPar2004

  13. 1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 Work distribution • Assignment tree (P types of processors and p processes): P =2 and p =3: 10 nodes in general: HeteroPar2004

  14. Work distribution • Systems: • SUNEt: • five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) • Ethernet • TORC (Innovative Computing Laboratory): • 21 nodes of different types (dual and single, Pentium II III and 4, AMD Athlon, …) • FastEthernet, Myrinet, … HeteroPar2004

  15. U1 U1 U1 U5 U5 U1 U1 U1 U1 U1 Work distribution • Assignment tree. SUNEt P=2 types ofprocessors (five SUN1 + one SUN5): nodes: when more processes than available processors are assigned to a type of processor, the costs of operations (SPs) change 2processors U5 U1 U1 one process to each processor p processes ... HeteroPar2004

  16. Work distribution • Assignment tree. TORC, used P=4 types ofprocessors: one 1.7 Ghz Pentium 4(only one process can be assigned). Type 1 one 1.2 Ghz AMD Athlon. Type 2 one 600 Mhz single Pentium III. Type 3 eight 550 Mhz dual Pentium III. Type 4 4processors not in the tree two consecutive processes are assigned to a same node 1 2 3 4 1 2 3 4 2 3 4 3 4 4 p processes ... the values of SPs change HeteroPar2004

  17. Work distribution • Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: • Use the theoretical execution model to estimate the cost at each node with the highest values of the SPs between those of the types of processors considered, through multiplying the values by the number of processes assigned to the processor of this type with more charge: HeteroPar2004

  18. Work distribution • Use Branch and Bound or Backtracking (with nodes elimination) to search through the tree: • Use the theoretical execution model to obtain a lower bound for each node For example, with an array of types of processors (1,1,1,2,2,2,3,3,3,4,4,4), with relative speeds si, and array of assignations a=(2,2,3), the array of possible assignations is pa=(0,0,0,1,1,0,1,1,1,1,1,1), and the maximum achievable speed is the minimum arithmetic cost is obtained from this speed, and the lowest communication costs are obtained from those between processors in the array of assignations HeteroPar2004

  19. Experimental Results • Systems: • SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet • TORC: 11 nodes of different types (1.7 Ghz Pentium 4+1.2 Ghz AMD Athlon+600 Mhz Pentium III+8 550 Mhz Dual Pentium III) + FastEthernet • Varying: • The problem size C = 10000, 50000, 100000, 500000 • Large value of qi • The granularity of the computation (the cost of a computational step) HeteroPar2004

  20. Experimental Results • How to estimate arithmetic SPs: Solving a small problem on each type of processors • How to estimate communication SPs: • Using a ping-pong between each pair of processors, and processes in the same processor (CP1) • Does not reflect the characteristics of the system • Solving a small problem varying the number of processors, and with linear interpolation (CP2) • Larger installation time HeteroPar2004

  21. Experimental Results • Three types of users are considered: • GU (greedy user): • Uses all the available processors, with one process per processor. • CU (conservative user): • Uses half of the available processors (the fastest), with one process per processor. • EU (user expert in the problem, the system and heterogeneous computing): • Uses a different number of processes and processors depending on the granularity: • 1 process in the fastest processor, for low granularity • The number of processes is half of the available processors, and in the appropriate processors, for middle granularity • A number of processes equal to the number of processors, and in the appropriate processors, for large granularity HeteroPar2004

  22. Experimental Results • Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in SUNEt: HeteroPar2004

  23. C gra LT CP2 50000 10 (1,2) (1,2) 50000 50 (1,2) (1,2,4,4) 50000 100 (1,2) (1,2,4,4) 100000 10 (1,2) (1,2) 100000 50 (1,2) (1,2,4,4) 100000 100 (1,2) (1,2,4,4) 500000 10 (1,2) (1,2) 500000 50 (1,2) (1,2,3,4) 500000 100 (1,2) (1,2,3,4) Experimental Results • Parameters selection, in TORC, with CP2: HeteroPar2004

  24. C gra LT CP2 50000 10 (1,1,2) (1,1,2,3,3,3,3,3,3) 50000 50 (1,1,2) (1,1,2,3,3,3,3,3,3,3,3) 50000 100 (1,1,3,3) (1,1,2,3,3,3,3,3,3,3,3) 100000 10 (1,1,2) (1,1,2) 100000 50 (1,1,3) (1,1,2,3,3,3,3,3,3,3,3) 100000 100 (1,1,3) (1,1,2,3,3,3,3,3,3,3,3) 500000 10 (1,1,2) (1,1,2) 500000 50 (1,1,2) (1,1,2,3) 500000 100 (1,1,2) (1,1,2) Experimental Results • Parameters selection, in TORC (without the 1.7 Ghz Pentium 4), with CP2: one 1.2 Ghz AMD Athlon. Type 1 one 600 Mhz single Pentium III. Type 2 eight 550 Mhz dual Pentium III. Type 3 HeteroPar2004

  25. Experimental Results • Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC: HeteroPar2004

  26. Experimental Results • Quotient between the execution time with the parameters selected by each one of the selection methods and the modelled users and the lowest execution time, in TORC (without the 1.7 Ghz Pentium 4): HeteroPar2004

  27. Conclusions and future work • The inclusion of Autotuning capacities in a Parallel Dynamic Programming Scheme for Heterogeneous Networks of Processors has been considered. • Parameters selection is combined with heuristics search in the assignation tree. • Experimentally the selection proves to be satisfactory, and useful in providing the users with routines capable of reduced time executions. • In the future we plan to apply this technique to other algorithmic schemes. HeteroPar2004

More Related