A Dynamically Tuned Sorting Library

A Dynamically Tuned Sorting Library Xiaoming Li, María Jesús Garzarán, and David Padua In 2004 International Symposium on Code Generation and Optimization (CGO’04) University of Illinois at Urbana-Champaign

Motivation • Sorting • Core operation in many applications, such as databases • Well understood symbolic computing problem • Libraries generators such as ATLAS and SPIRAL have used empirical search to adapt to • Architectural features of the target machine • Size of the input data But, performance of sorting also depends on the distribution of the values to be sorted 2

Motivation • Main difficulties to build a sorting library • Theoretical complexity is not sufficient to measure quality • Cache effect, instructions executed • Performance depends on the characteristics of the input • Amount & distribution of data to sort • A single algorithm is not optimal for all possible input sets 3

Contributions • Identify the architectural and runtime factors that affect the performance of the sorting algorithms. • Use empirical search to identify the best shape and parameter values of a sorting algorithm. • Use machine learning and runtime adaptation to select the best sorting algorithm for a specific input set. 4

Contributions IBM Power 3, sorting 12 M keys (integer 32 bits) Execution Time (Cycles) Standard deviation of the inputs 5

Outline • Sorting Algorithms • Factors that determine performance • The Library • Evaluation • Future Work • Conclusions 6

Sorting Algorithms • Our sorting library contains • Quicksort • CC-Radix • Multiway Merge • Insertion Sort • Sorting Networks For small partitions 7

Quicksort • Divide and conquer in-place sorting algorithm • Our implementation includes Sedgewick’s optimizations: • Set guardians at both ends of the input array. • Eliminate recursion. • Correctly select the pivot. • Use insertion sort for small partitions. 8

Radix sort counter accum. Dest. vector 1 2 3 4 2 1 2 1 1 2 3 4 0 2 3 5 0 1 2 3 4 5 • Non comparison algorithm Vector to sort 31 1 12 23 33 4 1 1 2 3 3 4 3 1 2 3 2 3 1 3 4 1 1 2 3 1 0 1 2 3 4 5 12 23 31 13 4 1 3 12 23 9

CC-radix (Cache Conscious Radix Sort) • Tries to exploit data locality in caches • Based on radix sort (Jimenez and Larriba – UPC) CC-radix(bucket) if fits in cache (bucket) then radix sort (bucket) • else • sub-buckets = Reverse sorting(bucket) • for each sub-bucket in sub-buckets • CC-radix(sub-buckets) • endfor • endif 10

Multiway Merge Sort • This algorithm exploits data locality very efficiently Heap 2*p -1 nodes Sorted Subset Sorted Subset Sorted Subset Sorted Subset p subsets 11

Sorting algorithms for small partitions • Insertion sort  Exploits locality in the cache line • Sorting networks  Register blocking 12

Performance Comparison Pentium III Xeon, 16 M keys (float) 13

Factors that determine performance • Architectural Factors Considered • Cache / TLB size • Number of Registers • Cache Line Size • Runtime Factors Considered • Amount of data to Sort • Distribution of the data 15

Architectural: Cache Size/TLB Size • Tiling: Partition the data in subsets that fit in the cache • Quicksort • Using multiple pivots to tile • CC-radix • Fit each partition into cache • The # active partitions < TLB size • Multiway Merge Sort • Fit the heap into cache • Fit sorted subsets into cache 16

Architectural: Number of Registers • For small partitions, sort in place using the processor registers • Optimizations like unroll and scheduling can be applied cmp&swap(r0,r1) cmp&swap(r2,r3) cmp&swap(r1,r2) cmp&swap(r0,r3) cmp&swap(r4,r5) ….. cmp&swap(r0,r1) cmp&swap(r2,r3) cmp&swap(r4,r5) cmp&swap(r1,r2) cmp&swap(r0,r3) 17

Architectural: Cache Line Size • Fanout = Cache Line Size • Increase cache line utilization when accessing children nodes … Cache Line 18

Runtime: Amount and Distribution Shape Execution Time (Cycles) Number of Keys (Millions) 19

Runtime: Standard Deviation Pentium III Xeon, 16 M keys Execution Time (Cycles) Standard deviation of the keys 21

Library adaptation • Architectural Factors • Cache / TLB size • Number of Registers • Cache Line Size Empirical Search • Runtime Factors • Distribution shape of the data • Amount of data to Sort • Standard Deviation Does not matter Machine learning and runtime adaptation 23

The Library • Building the library  Intallation time • Empirical Search • Learning Procedure • Use of training data • Running the library  Runtime • Runtime Procedure Runtime Adaptation 24

Runtime Adaptation: Learning Procedure • Goal function: f:(N,E)  {Multiway Merge Sort, Quicksort, CC-radix} N: amount of input data E: the entropy vector • Use N to choose between Multiway Merge or Quicksort • Use the entropy and Winnow algorithm to learn the best algorithm • Output: weight vector (w) and threshold (S) 25

Runtime Adaptation:Runtime Procedure • Sample the input array • Compute the entropy vector • Compute S = ∑i wi * entropyi • If S ≥threshold choose CC-radix else choose others 26

Experimental Setup • Test Platforms: • SGI R12000: 300 Mhz; L1I/D=32KB; L2 = 4MB • UltraSparcIII: 750 Mhz; L1I/D=32KB, 64KB; L2 = 8MB • PentiumIII Xeon: 550 Mhz; L1I/D=16KB; L2 = 512KB • IBM Power3: 375 Mhz, L1I/D=64KB; L2 = 8MB 28

Sun UltraSparcIII: 12 M keys Execution Time (Cycles per key) Standard deviation of the keys 29

IBM Power3: 12 M Keys Execution Time (Cycles per key) Standard deviation of the keys 30

Conclusions • Identify the architectural and runtime factors • Use empirical search to find the best parameters values • Our machine learning techniques prove to be quite effective: • Always selects the best algorithm. • The wrong decision introduces a 37% average performance degradation • Overhead (average 5%, worst case 7%) 31

Future Work • Search in the space of sorting algorithms using high-level primitives • Extend sorting to include more data types • Include other comparison strategies • Parallel algorithms • Explore other database operations, such as join. For example, less than to sort vectors, graphs, … 32

A Memory Hierarchy Conscious and Self-tunable Sorting Library Xiaoming Li, María Jesús Garzarán, and David Padua To appear in 2004 International Symposium on Code Generation and Optimization (CGO’04) University of Illinois at Urbana-Champaign

Empirical search for small partitions Intel Pentium III Xeon Sorting networks obtains the best performance improvement (average 15%) 34

Performance vs. Distribution 36

Performance vs. Distribution 37

Performance vs. Sdev 38

Performance vs. Sdev 39

Multiway Merge Sort 40

Runtime: Distribution of Data • Distribution shapes: Uniform, Normal, Exponential, … 41

Architectural: Number of Registers 42

Sorting algorithms for small partitions • Insertion sort  Exploits locality in the cache line • Sorting networks  Register blocking 43

Runtime: Distribution of Data • Distribution shapes: Uniform, Normal, Exponential, … • Distribution width: • Standard deviation (sdev): • Only good for one-peak distribution • Expensive to calculate • Entropy • Represents the distribution of each bit The goal is to distinguish the comparison-based algorithm the radix based one 44

Entropy • Goal: determine when CC-radix is best • Standard Deviation • Expensive to compute • Not a good metric for our goal • Compute the entropy of of each digit Entropy = ∑i -Pi * log2 Pi, where Pi = ci/N; ci = number of keys that have a particular value for that digit. 45

Learning Procedure → → w w • f:(N,E)  {Multiway merge, CC-radix} is a linear separable problem: • f(x1, x2, …,xn) is a decision problem where there exists a weight vector • Use machine learning Winnow algorithm to learn f:(N,E). • The results of the learning are and Ө . → → → f (x) is true if w * x ≥Ө or false otherwise 46

Intel PIII Xeon 47

SGI R12000 48

Runtime: Amount of Data to Sort • Quicksort • Cache misses will increase with the increasing amount of data. • CC-radix • As amount of data increases, CC-radix needs more partitioning passes. • Multiway Merge Sort • Can only show advantages when the amount of data is big, i.e., when the gain in cache miss can compensate the complexity of the algorithm. 49

Empirical Search • Adaptation to the architecture of the machine • Quicksort and CC-radix, • the best configuration does not change significantly with the characteristics of the input data set. • Quicksort, CC-Radix: • Use of insertion sort/sorting networks for small partitions • Threshold to use them • CC-radix • Size of the radix • Multiway Merge Sort • the best configuration changes with the amount and the distribution of the input data. • The best values will be searched during the learning procedure. 50

A Dynamically Tuned Sorting Library

A Dynamically Tuned Sorting Library

Presentation Transcript

Sorting

DYNAMICALLY ALLOCATED ARRAYS

PHOTON A Dynamically Reconfigurable Hybrid

Sorting

( ( ( Tuned ) ) )

A look at Sorting

Sorting: A Deeper Look

Sorting

Tuned Mass Dampers

Tuned Mass Dampers

Sorting

Dynamically Linked Libraries

are you TUNED in?

Dynamically Allocated Memory

Sorting

Sorting

Warp Processor: A Dynamically Reconfigurable Coprocessor