170 likes | 394 Views
A GPU algorithm design for the Resource Constrained Project Scheduling Problem. Libor Bukata and Přemysl Šůcha { bukatlib,suchap }@ fel.cvut.cz The Czech Technical University in Prague. Motivation. Our motivation is to use power of the GPU to solve combinatorial problems. Existing works:
E N D
A GPU algorithm design forthe Resource Constrained Project Scheduling Problem Libor Bukata and Přemysl Šůcha {bukatlib,suchap}@fel.cvut.cz The Czech Technical University in Prague
Motivation • Our motivation is to use power of the GPU to solve combinatorial problems. • Existing works: • [1] M. Czapinski and S. Barnes, “Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform,” J. Parallel Distrib. Comput., vol. 71, pp. 802–811, June 2011. • [2] V. Boyer, D. El-Baz, and M. Elkihel, “Solving knapsack problems on GPU,” Computers & Operations Research, vol. 39, no. 1, pp. 42–47, 2012. • We tackle more complex combinatorial problem than [1,2]. • We are focused on homogeneous model.
Outline • Problem Statement (RCPSP) • Sequential Solution (Tabu Search Algorithm) • Parallelization • Parallelization on the Nvidia CUDA Framework • Experimental Results • Conclusions
Problem Statement • The Resource Constrained Project Scheduling Problem (RCPSP) is a general scheduling problem. • It is one of the most important problem in project management, manufacturing and production optimization. • The problem is NP-hard since P2||Cmax is already NP-hard (two partitioning problem) 1 2 3 0 4 7 5 6
Problem Statement • A set of NactivitiesV = {0, … , N-1} with durationsD = (d0; … ; dN-1) : di ℤ+. Activity 0 is the first activity of the project and N-1 is the last one. • Precedence among activities are given via a Direct Acyclic Graph G(V, E) where E is a set of edges such that (i, j) E. 1 2 3 0 4 7 5 6
Problem Statement • A set of M renewable resources with capacities R = {R0, … , RM-1}, where Rk ℤ+. • Activity resource requirement ri,k ℤ+. Cmax Resource 1 R1 4 3 2 1 5 6 3 4 1 1 2 t 0 1 2 3 4 5 6 3 Resource 2 R2 3 2 1 0 4 7 5 6 2 5 1 3 6 t 0 1 2 3 4 5 6
Problem Statement • ScheduleS is vector (s0, … , sN-1) of activities start time values si ℤ+ satisfying constraints of the mathematical model: objective function precedence constraints resource constraints
The Tabu Search Algorithm for the RCPSP • The RCPSP can be solved via the meta-heuristic approach Tabu Search (TS) • l = 0; Find an initial solutionWl W (a topological order); Wbest = Wl. • While (l < L) • Determine W (Wl) neighborhood of Wl. • Eliminate infeasible solutions W (Wl) -> W ‘(Wl) • Compute Cmax(Wnext) of solution Wnext W ‘(Wl). • Assign Wl+1 = argminCmax(Wnext) : Wnext TL • TL = TL Wl+1; • If Cmax(Wbest) > Cmax(Wl+1) then Wl+1 -> Wbest. • If the solution was not improved during the given number of iterations perform diversification of Wl+1 • l++ • ReturnWbest
Representation of the Solution • The solution represented by vector of start time values (s0, … , sN-1) results in a huge solution space. • That is the reason why we selected the order of activities W = (w0, … , wN-1) as the solution representation, e.g. (1,5,6,3,4,2) Cmax R1 4 3 2 1 5 6 3 4 1 t 0 1 2 3 4 5 6 R2 3 2 1 5 6 2 1 3 t 0 1 2 3 4 5 6
The Neighborhood of the Solution • Neighborhood W (Wl) is a set of solutions obtained by applying all possible swap operators to Wl . • A swap operator exchanges two activities in Wl. • For example swap(3,7): (1,5,2,3,4,6) (1,5,6,3,4,2) Cmax Cmax R2 R2 3 2 1 3 2 1 5 2 6 5 6 2 1 3 1 3 t t 0 1 2 3 4 5 6 0 1 2 3 4 5 6
TS Parallelization on the GPU • Parallelization was inspired by [3]. • There is a set of independent solutions. • Each CPU thread tries to improve an assigned solution until the given number of iterations is reached. • Each thread processes solutions one by one. • Access is controlled via atomic operations. • [3] T. James, C. Rego, and F. Glover, “A cooperative parallel tabu search algorithm for the quadratic assignment problem,” European Journal of Operational Research, vol. 195, no. 3, pp. 810 – 826, 2009. solution makespan Tabu List
CUDA Mapping • Each CUDA block executes an independent TS algorithm • A thread processes one or more solution(s) in the neighborhood of the solution (elimination of infeasible solutions and Cmax(Wnext) computation).
CUDA Mapping Block 0 Block 27 Shared memory Shared memory current solution W current solution W Registers helper variables … Registers helper variables precedence constraints precedence constraints durations of activities D durations of activities D Global memory Local memory Texture memory Arrays for evaluation of resources TL of Block 0 … TL of Block 27 required resources ri,k activities predecessors Activities start time values
Implementation of the Tabu List • TL is stored in the global memory – access needs to be accelerated. • TLC (Tabu List Cache) is a 2D dimensional array of Boolean values. • Test whether a move is in the TL can be performed by a single read operation. TL: TLC: Add new move to TL: (iold, jold) = TL[index] TC[iold, jold] = false TL[index]= (i, j) TC[i, j] = true index = (index + 1)% |TL|
Computation of Cmax • The goal is to minimize memory consumption. • Activities are added into the schedule one by one according to Wl taking into account precedence constraints and resource constraints. si si + di Rk +1 7 6 5 4 3 2 1 The earliest start time when activity i with ri,k = 3 can be executed. +1 +3 i +2 +2 di = 3 t 0 1 2 3 4 5 6 7 8
Experimental Results • Experiments were performed on the Intel Xeon 2.66 GHz server and Nvidia Tesla 2050C (448 CUDA cores, 14 multiprocessors) graphics card. • J120 benchmark instances (600 projects with 120 activities) were used for performance measurements. • The GPU algorithm tests 1.8 106 solutions per second in average. • GPU is able to perform the same number of iterations 55 times faster than the CPU.
Conclusions • The first known GPU algorithm solving the RCPSP. • Compared to [1] we propose a more efficient TL (Tabu List cache). • The algorithm for the schedule evaluation is suitable for the GPU (low memory requirements). • The homogenous model reduces required communication bandwidth between the CPU and the GPU. • [1] M. Czapinski and S. Barnes, “Tabu Search with two approaches to parallel flowshop evaluation on CUDA platform,” J. Parallel Distrib. Comput., vol. 71, pp. 802–811, June 2011.