10 likes | 199 Views
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato olgat@cs.tamu.edu gabrielt@cs.tamu.edu amato@cs.tamu.edu Parasol Lab, Department of Computer Science, Texas A&M University, http://parasol.tamu.edu/. STAPL: Standard Template Adaptive Parallel Library.
E N D
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato olgat@cs.tamu.edugabrielt@cs.tamu.eduamato@cs.tamu.edu Parasol Lab, Department of Computer Science, Texas A&M University, http://parasol.tamu.edu/ STAPL: Standard Template Adaptive Parallel Library STAPL Design Goals STAPL Main Components • pContainer • Distributed data structures. • pRange • Presents an abstract view of a scoped data space, which allows random access to a partition or subrange of the data in a pContainer • Stores data dependence information. • pAlgorithms • Parallel Algorithms which provide basic functionality, bound with the pContainer by pRange. • Adaptive Runtime System • Adaptive Remote Method Invocation (aRMI) communication library hides machine specifics and provides a uniform communication interface. • Adaptive performance optimization toolbox, including scheduler, load-balancer, and system profiling tools. • Ease of use • STAPL emulates Shared Memory Programming. Users can program assuming a single address space in both shared and distributed systems. • Efficiency • STAPL provides building blocks equivalent to STL containers, iterators, and algorithms that are automatically tuned for parallel and distributed systems. • Portability • STAPL has its own runtime that hides machine specific details and provides a uniform and efficient communication interface. The Standard Template Adaptive Parallel Library (STAPL) is a parallel library designed as a superset of the ISO Standard C++ Standard Template Library (STL). It executes on uni- or multi-processor systems that utilize shared or distributed memory. The goal of STAPL is to allow the user to work at a high level of abstraction by insulating them from the complexity of parallel programming, such as problem decomposition, problem mapping, scheduling, and execution, while still providing scalable performance. Container pContainer Runtime system : aRMI Iterator pRange TOOLBOXES: Performance Optimization Algorithms pAlgorithms System Profiling STAPL STL Parallel Sorting Algorithms Sample Sort Bitonic Sort Radix Sort • Sequential Algorithm • Select p-1 splitters • Sort the splitters; the splitters are the upper and lower bounds that define p buckets • Compare each element to the splitters and place it in an appropriate bucket • Sort the content of each bucket • Copy the values from buckets into the original container • Parallelization • If each processor is responsible for one bucket, the steps can be done in parallel • The running time of this algorithm is dependent on the maximum number of elements contained in any bucket (distribution of elements between buckets) • Thus we want all buckets to contain an equal number of elements • Technique used to achieve the above: oversampling • Parallel Algorithm • Locally sort the elements on each thread • Form a bitonic sequence (a sequence which is first increasing and then decreasing, or can be circularly shifted to become so) • Sort in an increasing order • Note: Each step of the Bitonic Sort consists of 2 threads exchanging data, merging the 2 sequences, and keeping its corresponding half • Heuristic applied: the threads exchange minimum and maximum first, then trade only the elements necessary for the merge Sequential Algorithm • Radix sort is not a comparison sort, therefore it is not subject to the O(n log n)sorting lower bound • Each element is represented by b bits (i.e. 32 bit integers) • The algorithm performs a number of iterations; each iteration considers only r bits of each elementat a time, with the ith pass sorting according to the ith group of the least significant bits • The sorting algorithm used to sort the r bits must be stable, meaning that if two elements have the same value, they appear in the same order in the output sequence as they did in the input sequence. Counting sort is usually used here, as demonstrated on the parallel example to the right • Works for all types of elements that can be compared • Sequential Complexity: O(n log n) • Parallel Complexity: O(n/p log n/p) • Works for all types of elements that can be compared • Sequential Complexity: O(n log n) • Parallel Complexity: O(n/p log n/p + n/p log p) (sort+merge) • Works only for integers • Sequential Complexity: O(n) • Parallel Complexity: O(n/p) Performance Radix Sort Sample Sort Comparison • Performance of parallel sorts depends on: • Machine Architecture • Number of Processors • Type of Elements to Sort • How presorted the elements are References • [1] "STAPL: An Adaptive, Generic Parallel C++ Library," P.An, A.Jula, S.Rus, S.Saunders, T.Smith, G.Tanase, N.Thomas, N.Amato and L.Rauchwerger, 2001. • [2] “A Comparison of Parallel Sorting Algorithms on Different Architectures,” N.Amato, R.Iyer, S.Sundaresan, Y.Wu, 1996. Goal To be able to adaptively select the best algorithm based on the data provided and the system information available • Radix Sort • The fastest sort for integers • Scalability is poor for random data • Sample Sort • Scales better than Radix Sort • Performs well on various data types • Random Data • Radix Sort is faster up to 8 processors • Sample Sort outperforms Radix Sort as the number of processors increases • Nearly Sorted Data • Radix Sort is faster • The difference in performance is smaller as the number of processors increases