1 / 53

A Dynamically Tuned Sorting Library

A Dynamically Tuned Sorting Library. Xiaoming Li, María Jesús Garzarán, and David Padua. In 2004 International Symposium on Code Generation and Optimization (CGO ’ 04). University of Illinois at Urbana-Champaign. Motivation. Sorting Core operation in many applications, such as databases

Download Presentation

A Dynamically Tuned Sorting Library

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Dynamically Tuned Sorting Library Xiaoming Li, María Jesús Garzarán, and David Padua In 2004 International Symposium on Code Generation and Optimization (CGO’04) University of Illinois at Urbana-Champaign

  2. Motivation • Sorting • Core operation in many applications, such as databases • Well understood symbolic computing problem • Libraries generators such as ATLAS and SPIRAL have used empirical search to adapt to • Architectural features of the target machine • Size of the input data But, performance of sorting also depends on the distribution of the values to be sorted 2

  3. Motivation • Main difficulties to build a sorting library • Theoretical complexity is not sufficient to measure quality • Cache effect, instructions executed • Performance depends on the characteristics of the input • Amount & distribution of data to sort • A single algorithm is not optimal for all possible input sets 3

  4. Contributions • Identify the architectural and runtime factors that affect the performance of the sorting algorithms. • Use empirical search to identify the best shape and parameter values of a sorting algorithm. • Use machine learning and runtime adaptation to select the best sorting algorithm for a specific input set. 4

  5. Contributions IBM Power 3, sorting 12 M keys (integer 32 bits) Execution Time (Cycles) Standard deviation of the inputs 5

  6. Outline • Sorting Algorithms • Factors that determine performance • The Library • Evaluation • Future Work • Conclusions 6

  7. Sorting Algorithms • Our sorting library contains • Quicksort • CC-Radix • Multiway Merge • Insertion Sort • Sorting Networks For small partitions 7

  8. Quicksort • Divide and conquer in-place sorting algorithm • Our implementation includes Sedgewick’s optimizations: • Set guardians at both ends of the input array. • Eliminate recursion. • Correctly select the pivot. • Use insertion sort for small partitions. 8

  9. Radix sort counter accum. Dest. vector 1 2 3 4 2 1 2 1 1 2 3 4 0 2 3 5 0 1 2 3 4 5 • Non comparison algorithm Vector to sort 31 1 12 23 33 4 1 1 2 3 3 4 3 1 2 3 2 3 1 3 4 1 1 2 3 1 0 1 2 3 4 5 12 23 31 13 4 1 3 12 23 9

  10. CC-radix (Cache Conscious Radix Sort) • Tries to exploit data locality in caches • Based on radix sort (Jimenez and Larriba – UPC) CC-radix(bucket) if fits in cache (bucket) then radix sort (bucket) • else • sub-buckets = Reverse sorting(bucket) • for each sub-bucket in sub-buckets • CC-radix(sub-buckets) • endfor • endif 10

  11. Multiway Merge Sort • This algorithm exploits data locality very efficiently Heap 2*p -1 nodes Sorted Subset Sorted Subset Sorted Subset Sorted Subset p subsets 11

  12. Sorting algorithms for small partitions • Insertion sort  Exploits locality in the cache line • Sorting networks  Register blocking 12

  13. Performance Comparison Pentium III Xeon, 16 M keys (float) 13

  14. Outline • Sorting Algorithms • Factors that determine performance • The Library • Evaluation • Future Work • Conclusions 14

  15. Factors that determine performance • Architectural Factors Considered • Cache / TLB size • Number of Registers • Cache Line Size • Runtime Factors Considered • Amount of data to Sort • Distribution of the data 15

  16. Architectural: Cache Size/TLB Size • Tiling: Partition the data in subsets that fit in the cache • Quicksort • Using multiple pivots to tile • CC-radix • Fit each partition into cache • The # active partitions < TLB size • Multiway Merge Sort • Fit the heap into cache • Fit sorted subsets into cache 16

  17. Architectural: Number of Registers • For small partitions, sort in place using the processor registers • Optimizations like unroll and scheduling can be applied cmp&swap(r0,r1) cmp&swap(r2,r3) cmp&swap(r1,r2) cmp&swap(r0,r3) cmp&swap(r4,r5) ….. cmp&swap(r0,r1) cmp&swap(r2,r3) cmp&swap(r4,r5) cmp&swap(r1,r2) cmp&swap(r0,r3) 17

  18. Architectural: Cache Line Size • Fanout = Cache Line Size • Increase cache line utilization when accessing children nodes … Cache Line 18

  19. Runtime: Amount and Distribution Shape Execution Time (Cycles) Number of Keys (Millions) 19

  20. Runtime: Amount and Distribution Shape Execution Time (Cycles) Number of Keys (Millions) 20

  21. Runtime: Standard Deviation Pentium III Xeon, 16 M keys Execution Time (Cycles) Standard deviation of the keys 21

  22. Outline • Sorting Algorithms • Factors that determine performance • The Library • Evaluation • Future Work • Conclusions 22

  23. Library adaptation • Architectural Factors • Cache / TLB size • Number of Registers • Cache Line Size Empirical Search • Runtime Factors • Distribution shape of the data • Amount of data to Sort • Standard Deviation Does not matter Machine learning and runtime adaptation 23

  24. The Library • Building the library  Intallation time • Empirical Search • Learning Procedure • Use of training data • Running the library  Runtime • Runtime Procedure Runtime Adaptation 24

  25. Runtime Adaptation: Learning Procedure • Goal function: f:(N,E)  {Multiway Merge Sort, Quicksort, CC-radix} N: amount of input data E: the entropy vector • Use N to choose between Multiway Merge or Quicksort • Use the entropy and Winnow algorithm to learn the best algorithm • Output: weight vector (w) and threshold (S) 25

  26. Runtime Adaptation:Runtime Procedure • Sample the input array • Compute the entropy vector • Compute S = ∑i wi * entropyi • If S ≥threshold choose CC-radix else choose others 26

  27. Outline • Sorting Algorithms • Factors that determine performance • The Library • Evaluation • Future Work • Conclusions 27

  28. Experimental Setup • Test Platforms: • SGI R12000: 300 Mhz; L1I/D=32KB; L2 = 4MB • UltraSparcIII: 750 Mhz; L1I/D=32KB, 64KB; L2 = 8MB • PentiumIII Xeon: 550 Mhz; L1I/D=16KB; L2 = 512KB • IBM Power3: 375 Mhz, L1I/D=64KB; L2 = 8MB 28

  29. Sun UltraSparcIII: 12 M keys Execution Time (Cycles per key) Standard deviation of the keys 29

  30. IBM Power3: 12 M Keys Execution Time (Cycles per key) Standard deviation of the keys 30

  31. Conclusions • Identify the architectural and runtime factors • Use empirical search to find the best parameters values • Our machine learning techniques prove to be quite effective: • Always selects the best algorithm. • The wrong decision introduces a 37% average performance degradation • Overhead (average 5%, worst case 7%) 31

  32. Future Work • Search in the space of sorting algorithms using high-level primitives • Extend sorting to include more data types • Include other comparison strategies • Parallel algorithms • Explore other database operations, such as join. For example, less than to sort vectors, graphs, … 32

  33. A Memory Hierarchy Conscious and Self-tunable Sorting Library Xiaoming Li, María Jesús Garzarán, and David Padua To appear in 2004 International Symposium on Code Generation and Optimization (CGO’04) University of Illinois at Urbana-Champaign

  34. Empirical search for small partitions Intel Pentium III Xeon Sorting networks obtains the best performance improvement (average 15%) 34

  35. Runtime: Amount and Distribution Shape Execution Time (Cycles) Number of Keys (Millions) 35

  36. Performance vs. Distribution 36

  37. Performance vs. Distribution 37

  38. Performance vs. Sdev 38

  39. Performance vs. Sdev 39

  40. Multiway Merge Sort 40

  41. Runtime: Distribution of Data • Distribution shapes: Uniform, Normal, Exponential, … 41

  42. Architectural: Number of Registers 42

  43. Sorting algorithms for small partitions • Insertion sort  Exploits locality in the cache line • Sorting networks  Register blocking 43

  44. Runtime: Distribution of Data • Distribution shapes: Uniform, Normal, Exponential, … • Distribution width: • Standard deviation (sdev): • Only good for one-peak distribution • Expensive to calculate • Entropy • Represents the distribution of each bit The goal is to distinguish the comparison-based algorithm the radix based one 44

  45. Entropy • Goal: determine when CC-radix is best • Standard Deviation • Expensive to compute • Not a good metric for our goal • Compute the entropy of of each digit Entropy = ∑i -Pi * log2 Pi, where Pi = ci/N; ci = number of keys that have a particular value for that digit. 45

  46. Learning Procedure → → w w • f:(N,E)  {Multiway merge, CC-radix} is a linear separable problem: • f(x1, x2, …,xn) is a decision problem where there exists a weight vector • Use machine learning Winnow algorithm to learn f:(N,E). • The results of the learning are and Ө . → → → f (x) is true if w * x ≥Ө or false otherwise 46

  47. Intel PIII Xeon 47

  48. SGI R12000 48

  49. Runtime: Amount of Data to Sort • Quicksort • Cache misses will increase with the increasing amount of data. • CC-radix • As amount of data increases, CC-radix needs more partitioning passes. • Multiway Merge Sort • Can only show advantages when the amount of data is big, i.e., when the gain in cache miss can compensate the complexity of the algorithm. 49

  50. Empirical Search • Adaptation to the architecture of the machine • Quicksort and CC-radix, • the best configuration does not change significantly with the characteristics of the input data set. • Quicksort, CC-Radix: • Use of insertion sort/sorting networks for small partitions • Threshold to use them • CC-radix • Size of the radix • Multiway Merge Sort • the best configuration changes with the amount and the distribution of the input data. • The best values will be searched during the learning procedure. 50

More Related