150 likes | 340 Views
Parallel processing is not easy. Good schedules for parallel implementations of composite tasks are hard to find. Parallel implementation Mapping and scheduling Optimization for makespan or throughput. User’s idea on Parallel Computations (e.g. STAP). Executable file
E N D
Parallel processing is not easy Good schedules for parallel implementations of composite tasks are hard to find. • Parallel implementation • Mapping and scheduling • Optimization for makespan • or throughput User’s idea on Parallel Computations (e.g. STAP) Executable file for a given parallel computer
Parallel processing can be made easy with ALPS software framework /* This is a .cpp file of the partially adaptive STAP methods that the user had in mind. This is optimized for minimum latency or maximum throughput on a given parallel machine. */ #include <alps.h> ... ALPS CodeGen User’s idea on Parallel Computations (e.g. STAP) ALPS Library Various STAP heuristics constructed from ALPS building blocks 3. ALPS Scheduler Find optimal configuration that minimizes overall makespan or maximizes throughput 1. Task Graph 2. ALPS Benchmarker & Modeler Modeling execution times of basic building blocks under various configurations
split with read fft1d overlap Target cube STAP join write Data cube Steering vector 1 Task Graph: PRI-overlapped STAP User describes the algorithm using task graphs ALPS software framework determines the implementation parameters such as number of processors for each task and return the executable code
x1 x2 y1 2 y2 2 … g1 g2 g3 gn-1 gn minxW(Ax-b) = minx W W 2 … 2 … ym xn 2 ALPS Benchmarker and Modeler • Measures of actual timings yi for a set of input parameters • Candidate terms {gk} such as the size of data cube or the size of processor cube are generated based on the input parameters • Model coefficients xk’s are found using least-squares Ttask ≈ ∑(xkgk) Solved incrementally by adding a best column to the model at a time.
2 minxW(Ax-b) 2 Ordinary least-squares W = I COV-weighted least-squares (BLUE estimator) W = R-1/2 where R is the covariance matrix one of the best, but requires lots of replications hence not practical VAR-weighted least-squares W = D-1/2 where D=diag(R) INV-weighted least-squares W = diag(1/b) Used ordinary, VAR-weighted and INV-weighted least-squares with 5 replications for each measurement
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 26.6% error from ordinary least-square 10% better 20% better From weighted least-square 30% better 40% better 7.5% error from INV-weighted least-square From ordinary least-square
A B C A C A A B B B C C 3 ALPS Scheduler Schedules tree-shaped task graphs (Tree-shaped graphs) = (linear chains) + (joining nodes) rather simple complicated processors t t t Data-parallelism only [3] Task-parallelism preferred [4] Other possibly better schedules
We consider both data-parallel (series) and task-parallel (parallel) schedules for a joining node Scheduling a joining node with two incoming arcs Use dynamic programming considering two incoming arcs in series and in parallel Scheduling a joining node with more than two incoming arcs 1. Sort the incoming arcs based on makespan when maximally parallelized 2. Schedule two incoming nodes at a time 10 A 10 A 1 B 8 C 5 8 D C 1 B 5 D
Makespan (sec) 80 60 40 20 0 2 4 8 1 Experiments on 8-node linux cluster: executable codes were generated by ALPS software framework Makespan: achieved vs. estimated Makespan (sec) estimated achieved 30 20 achieved estimated 10 0 Our scheduling algorithm Data- parallelism only [3] Task- parallelism preferred [4] Number of processors used
P1 P1 P1 P8 P8 P8 Schedules found in makespan minimization Target cube (32x2048x32) Data cube (32x2048x32) Steering vector (32x1x32) ~31 (sec) Data-parallelism only [3] ~24 (sec) Task-parallelism preferred [4] ~24 (sec) Our scheduling algorithm
Throughput (1/sec) 0.12 0.1 0.05 0 1 2 4 8 Throughput: achieved vs. estimated Throughput (1/sec) estimated 0.1 achieved estimated 0.05 achieved 0 Our scheduling algorithm Data- parallelism only [3] Task- parallelism preferred [4] Number of processors used
P1 P1 P8 P8 Schedules found in throughput maximization Target cube (32x2048x32) Data cube (32x2048x32) Steering vector (32x1x32) ~77 (sec) Data-parallelism only [3] and our scheduling algorithm Schedule on 1 processor replicated 8 times ~45 (sec) Task-parallelism preferred [4] Schedule on 4 processors replicated 2 times
Conclusion The schedules found by ALPS framework are as good as or often better than those computed by other published algorithms The relative error between the estimated and the achieved makespan/throughput were around 10% confirming that the modeled execution time is useful in predicting performance Weighted least-squares produced estimates 20—30% better than ordinary least-squares in terms of relative errors; may be useful when modeling the unreliable timings.
Related works Frameworks [1] PVL and[2] S3P We use text version of task graph as front-end and source code generation as back-end without users dealing with C++ code We cover larger scheduling problem space (tree-structured task graphs) than S3P. Scheduling algorithms [3] J. Subholk and G. Vondran: chain of tasks only [4] D. Nicol, R. Simha, and A. Choudhary: s-p task graph, but task- parallelism is preferred (… and many other works; refer to reference in the abstract)
References [1] Eddie Rutledge and Jeremy Kepner, “PVL: An Object Oriented Software Library for Parallel Signal Processing”, IEEE Cluster 2001. [2] H. Hoffmann, J. V. Kepner, and R. A. Bond, “S3P: automatic, optimized mapping of signal processing applications to parallel architectures,” in Proceedings of the Fifth Annual High-Performance Embedded Computing (HPEC) Workshop, Nov. 2001. [3] J. Subhlok and G. Vondran, “Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations”, Journal of Parallel and Distributed Computing, vol. 60, 2000. [4] D. Nicol, R. Simha, and A. Choudhary, “Optimal Processor Assignments for a Class of Pipelined Computations”, IEEE Transaction on Parallel and Distributed Systems, vol. 5(4), 1994.