250 likes | 674 Views
Automatic Optimization in Parallel Dynamic Programming Schemes Juan-Pedro Martínez Departamento de Estadística y Matemática Aplicada Universidad Miguel Hernández de Elche, Spain jp.martinez@uhm.es Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain
E N D
Automatic Optimization in Parallel Dynamic Programming Schemes Juan-Pedro Martínez Departamento de Estadística y Matemática Aplicada Universidad Miguel Hernández de Elche, Spain jp.martinez@uhm.es Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain domingo@dif.um.es dis.um.es/~domingo VECPAR2004
Our Goal General Goal: to obtain parallel routines with autotuning capacity • Previous works: Linear Algebra Routines • This communication: Parallel Dynamic Programming Schemes • In the future: apply the techniques to hybrid, heterogeneous and distributed systems VECPAR 2004
Outline • Modelling Parallel Routines for Autotuning • Parallel Dynamic Programming Schemes • Autotuning in Parallel Dynamic Programming Schemes • Experimental Results VECPAR 2004
Modelling Parallel Routines for Autotuning Necessary to predict accurately the execution time and select • The number of processes • The number of processors • Which processors • The number of rows and columns of processes (the topology) • The processes to processors assignation • The computational block size (in linear algebra algorithms) • The communication block size • The algorithm (polyalgorithms) • The routine or library (polylibraries) VECPAR 2004
Modelling Parallel Routines for Autotuning Cost of a parallel program: : arithmetic time : communication time : overhead, for synchronization, imbalance, processes creation, ... : overlapping of communication and computation VECPAR 2004
Modelling Parallel Routines for Autotuning Estimation of the time: Considering computation and communication divided in a number of steps: And for each part of the formula that of the process which gives the highest value. VECPAR 2004
Modelling Parallel Routines for Autotuning The time depends on the problem (n) and the system (p) size: But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of processors (q) used from the total available VECPAR 2004
Modelling Parallel Routines for Autotuning And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system. Typically the cost of an arithmetic operation (tc) and the start-up (ts) and word-sending time (tw) VECPAR 2004
Modelling Parallel Routines for Autotuning The values of the System Parameters could be obtained • With installation routines associated to the routine we are installing • From information stored when the library was installed in the system • At execution time by testing the system conditions prior to the call to the routine VECPAR 2004
Modelling Parallel Routines for Autotuning These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters. In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored, And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time VECPAR 2004
Parallel Dynamic Programming Schemes • There are different Parallel Dynamic Programming Schemes. • The simple scheme of the “coins problem” is used: • A quantity C and n coins of values v=(v1,v2,…,vn), and a quantity q=(q1,q2,…,qn) of each type. Minimize the quantity of coins to be used to give C. • But the granularity of the computation has been varied to study the scheme, not the problem. VECPAR 2004
1 2 . . . . . . . . j . . . . . N 1 2 …. i … n Parallel Dynamic Programming Schemes • Sequential scheme: fori=1 tonumber_of_decisions forj=1 toproblem_size obtain the optimum solution with i decisions and problem size j endfor Complete the table with the formula: endfor VECPAR 2004
1 2 . . . . . . . j . . . . . 1 2 ... i … n PO P1 P2 ...... PS ... PK-1 PK Parallel Dynamic Programming Schemes • Parallel scheme: fori=1 tonumber_of_decisions In Parallel: forj=1 toproblem_size obtain the optimum solution with i decisions and problem size j endfor endInParallel endfor VECPAR 2004
1 2 . . . . . . . . j . . . . . 1 2 ... i … n Parallel Dynamic Programming Schemes • Message-passing scheme: In each processor Pj for i=1 tonumber_of_decisions communication step obtain the optimum solution with i decisions and the problem sizes Pj has assigned endfor endInEachProcessor N PO P1 P2 .................... PK-1 PK VECPAR 2004
Process Pp Autotuning in Parallel Dynamic Programming Schemes • Theoretical model: Sequential cost: Computational parallel cost (qilarge): Communication cost: • The only AP is p • The SPs are tc , ts and tw one step VECPAR 2004
Autotuning in Parallel Dynamic Programming Schemes • How to estimate arithmetic SPs: Solving a small problem • How to estimate communication SPs: • Using a ping-pong (CP1) • Solving a small problem varying the number of processors (CP2) • Solving problems of selected sizes in systems of selected sizes (CP3) VECPAR 2004
Experimental Results • Systems: • SUNEt: five SUN Ultra 1 and one SUN Ultra 5 (2.5 times faster) + Ethernet • PenET: seven Pentium III + FastEthernet • Varying: • The problem size C = 10000, 50000, 100000, 500000 • Large value of qi • The granularity of the computation (the cost of a computational step) VECPAR 2004
Experimental Results • CP1: • ping-pong (point-to-point communication). • Does not reflect the characteristics of the system • CP2: • Executions with the smallest problem (C =10000) and varying the number of processors • Reflects the characteristics of the system, but the time also changes with C • Larger installation time (6 and 9 seconds) • CP3: • Executions with selected problem (C =10000, 100000) and system (p =2, 4, 6) sizes, and linear interpolation for other sizes • Larger installation time (76 and 35 seconds) VECPAR 2004
Experimental Results Parameter selection SUNEt PenFE VECPAR 2004
Experimental Results • Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in SUNEt: VECPAR 2004
Experimental Results • Quotient between the execution time with the parameter selected by each one of the selection methods and the lowest execution time, in PenFE: VECPAR 2004
Experimental Results • Three types of users are considered: • GU (greedy user): • Uses all the available processors. • CU (conservative user): • Uses half of the available processors • EU (expert user): • Uses a different number of processors depending on the granularity: • 1 for low granularity • Half of the available processors for middle granularity • All the processors for high granularity VECPAR 2004
Experimental Results • Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in SUNEt: VECPAR 2004
Experimental Results • Quotient between the execution time with the parameter selected by each type of user and the lowest execution time, in PenFE: VECPAR 2004
Conclusions and future work • The inclusion of Autotuning capacities in a Parallel Dynamic Programming Scheme has been considered. • Different forms of modelling the scheme and how parameters are selected have been studied. • Experimentally the selection proves to be satisfactory, and useful in providing the users with routines capable of reduced time executions • In the future we plan to apply this technique • to other algorithmic schemes, • in hybrid, heterogeneous and distributed systems. VECPAR 2004