190 likes | 301 Views
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches. Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science, Texas State University-San Marcos 2 Department of Mathematics, Texas State University-San Marcos. Problem: HPC is Hard to Exploit.
E N D
A Scalable Heterogeneous Parallelization Framework forIterative Local Searches Martin Burtscher1 and Hassan Rabeti2 1Department of Computer Science, Texas State University-San Marcos 2Department of Mathematics, Texas State University-San Marcos
Problem: HPC is Hard to Exploit • HPC application writers are domain experts • They are not typically computer scientists and have little or no formal education in parallel programming • Parallel programming is difficult and error prone • Modern HPC systems are complex • Consist of interconnected compute nodes with multiple CPUs and one or more GPUs per node • Require parallelization at multiple levels (inter-node, intra-node, and accelerator) for best performance
Target Area: Iterative Local Searches • Important application domain • Widely used in engineering & real-time environments • Examples • All sorts of random restart greedy algorithms • Ant colony opt, Monte Carlo, n-opt hill climbing, etc. • ILS properties • Iteratively produce better solutions • Can exploit large amounts of parallelism • Often have exponential search space
Our Solution: ILCS Framework • Iterative Local Champion Search (ILCS) framework • Supports non-random restart heuristics • Genetic algorithms, tabu search, particle swarm opt, etc. • Simplifies implementation of ILS on parallel systems • Design goal • Ease of use and scalability • Framework benefits • Handles threading, communication, locking, resource allocation, heterogeneity, load balance, termination decision, and result recording (check pointing)
User Interface • User writes 3 serial C functions and/or 3 single-GPU CUDA functions with some restrictions size_t CPU_Init(int argc, char *argv[]); void CPU_Exec(long seed, void const *champion, void *result); void CPU_Output(void const *champion); • See paper for GPU interface and sample code • Framework runs Exec (map) functions in parallel
Internal Operation: Threading master forks a worker per core workers evaluate seeds, record local opt master sporadically finds global opt via MPI, sleeps ILCS master thread starts handlers launch GPU code, sleep, record result GPU workers evaluate seeds, record local opt master forks a handler per GPU
Internal Operation: Seed Distribution each node gets chunk of 64-bit seed range CPUs process chunk bottom up GPUs process chunk top down • E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2) • Benefits • Balanced workload irrespective of number of CPU cores or GPUs (or their relative performance) • Users can generate other distributions from seeds • Any injective mapping results in no redundant evaluations
Related Work • MapReduce/Hadoop/MARS and PADO • Their generality and unnecessary features for ILS incur overhead and increase learning curve • Some do not support accelerators, some require Java • ILCS framework is optimized for ILS applications • Reduction is provided, does not require multiple keys, does not need secondary storage to buffer data, directly supports non-random restart heuristics, allows early termination, works with GPUs and MICs, targets single-node workstations through HPC clusters
Evaluation Methodology datacenterknowledge.com Three HPC Systems (at TACC and NICS) Largest tested configuration
Sample ILS Codes • Traveling Salesman Problem (TSP) • Find shortest tour • 4 inputs from TSPLIB • 2-opt hill climbing • Finite State Machine (FSM) • Find best FSM config to predict hit/miss events • 4 sizes (n = 3, 4, 5, 6) • Monte Carlo method
FSM Transitions/Second Evaluated 21,532,197,798,304 s-1 GPU shmem limit Ranger uses twice as many cores as Stampede
TSP Tour-Changes/Second Evaluated 12,239,050,704,370 s-1 based on serial CPU code GPU re-computes: O(n) memory CPU pre-computes: O(n2) memory each core evals a tour change every 3.6 cycles
TSP Moves/Second/Node Evaluated GPUs provide >90% of performance on Keeneland
ILCS Scaling on Ranger (FSM) >99% parallel efficiency on 2048 nodes other two systems are similar
ILCS Scaling on Ranger (TSP) >95% parallel efficiency on 2048 nodes longer runs are even better
Intra-Node Scaling on Stampede (TSP) >98.9% parallel efficiency on 16 threads framework overhead is very small
Tour Quality Evolution (Keeneland) quality depends on chance: ILS provides good solution quickly, then progressively improves it
Tour Quality after 6 Steps (Stampede) larger node counts typically yield better results faster
Summary and Conclusions • ILCS Framework • Automatic parallelization of iterative local searches • Provides MPI, OpenMP, and multi-GPU support • Checkpoints currently best solution every few seconds • Scales very well (decentralized) • Evaluation • 2-opt hill climbing (TSP) and Monte Carlo method (FSM) • AMD + Intel CPUs, NVIDIA GPUs, and Intel MICs • ILCS source code is freely available • http://cs.txstate.edu/~burtscher/research/ILCS/ Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS