150 likes | 325 Views
A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting. Arne Maus and Stein Gjessing Dept. of Informatics, University of Oslo, Norway. Overview. Motivation, CPU versus Memory speed Caches A cache test A simple model for the execution times of algorithms
E N D
A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics, University of Oslo, Norway OMS 2007
Overview • Motivation, CPU versus Memory speed • Caches • A cache test • A simple model for the execution times of algorithms • Does theoretical cache tests carry over to real programs? • A real example – three Radix sorting algorithms compared The number of instructions executed is no longer a good measure for the performance of an algorithm OMS 2007
The need for caches, the CPU-Memory performance gap from: John L. Hennessy , David A. Patterson, Computer architecture a quantitative approach,: Morgan Kaufmann Publishers Inc., San Francisco, CA, 2003 OMS 2007
A cache test random vs. sequential access i large arrays • Both a and b are of length n (n= 100, 200, 400,..., 97m) • 2 test runs – the same number of instruction performed: • Random access: set b[i] = random(0..n-1) • We will get 15 random accesses i b and 1 in a, and 1 sequential access i b (the innermost) • Sequential access :set b[i] = i. • then b[b[.....b[i]....]] = i, and we will get 16 sequential accesses in b and 1 in a for (int i= 0; i < n; i++) a[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[i]]]]]]]]]]]]]]]]] = i; OMS 2007
Random vs. sequential access times, the same number of instructions performed. Cache-misses slowing random access down a factor: 50 – 60 (4 CPUs) start of cache miss from L2 to memory start of cache miss from L1 to L2 OMS 2007
Why a slowdown of 50-60 and not factor 400 ? • Patterson and Hennessy suggests a slowdown factor of 400, test shows 50 to 60 – why? • Answer: Every array access in Java is checked for lower and upper array limits – say: • load array index • compare with zero (lower limit) • load upper limit • compare index and upper limit • load array base address • load/store array element ( = possible cache miss) • We see 5 cache hit operations + one cache miss – then average = (5 + 400)/6 = 67 OMS 2007
n L1 L2 A simple model for the execution time of a program • For every loop in program • Count the number of sequential references • Count the number of random accesses and the number of places n in which the randomly accessed object (array) is used • From the figure for random access, we see a asymptotical slowdown factor of: 1 if n < L1 4 if L1 < n < L2 50 if L2 < n • The access time TR for one random read or write is then: TR = 1* Pr (access in L1) + 4* Pr (access in L2) + 50* Pr (access in memory) ( = 1* L1/n + 4* L2/n + 50* (n - L2)/n , when n > L2 ) • The sequential reads and writes is set to 1, and we can then estimate the total execution time as the weighted sum over all loop accesses OMS 2007
Applying the model to Radix sorting – the test • Three Radix algorithms • radix1, sorting the array in one pass with one ‘large’ digit • radix2, sorting the array in two passes with two half sized digits • radix3, sorting the array in three passes with three ‘small’ digits • radix3 performs almost three times as many instructions as radix1 • should be almost 3 times as slow as radix1? • radix2 performs almost twice as many instructions as radix1 • should be almost 2 times as slow as radix1? OMS 2007
static void radixSort ( int [] a, int [] b ,int left, int right, int maskLen, int shift) { int acumVal = 0, j, n = right-left+1; int mask = (1<<maskLen) -1; int [] count = new int [mask+1]; // a) count=the frequency of each radix value in a for (int i = left; i <=right; i++) count[(a[i]>> shift) & mask]++; // b) Add up in 'count' - accumulated values for (int i = 0; i <= mask; i++) { j = count[i]; count[i] = acumVal; acumVal += j; } // c) move numbers in sorted order a to b for (int i = 0; i < n; i++) b[count[(a[i+left]>>shift) & mask]++] = a[i+left]; // d) copy back b to a for (int i = 0; i < n; i++) a[i+left] = b[i] ; } Base: Right Radix sorting algorithm : One pass of array a with one sorting digit of width: maskLen(shifted shift bits up) OMS 2007
Radix sort with 1, 2 and 3 digits = 1,2 and 3 passes static void radix1 (int [] a, int left, int right) { // 1 digit radixSort: a[left..right] int max = 0, numBit = 1, n = right-left+1; for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i]; while (max >= (1<<numBit)) numBit++; int [] b = new int [n]; radixSort( a,b, left, right, numBit, 0); } static void radix3(int [] a, int left, int right) { // 3 digit radixSort: a[left..right] int max = 0,numBit = 3, n = right-left+1; for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i]; while (max >= (1<<numBit)) numBit++; int bit1 = numBit/3, bit2 = bit1, bit3 = numBit-(bit1+bit2); int [] b = new int [n]; radixSort( a,b, left, right, bit1, 0); radixSort( a,b, left, right, bit2, bit1); radixSort( a,b, left, right, bit3, bit1+bit2); } 1 3 OMS 2007
Random /sequential test (AMD Opteron) , Radix 1, 2 and 3 compared withQuicksort and Flashsort radix1 slowed down by a factor 7 radix3, no slowdown radix2, slowdown started OMS 2007
Random /sequential test (Intel Xeon) , Radix 1, 2 and 3 compared withQuicksort and Flashsort OMS 2007
The model, careful counting of loops in radix1,2,3 • Let Ek denote the number of the different operations for a k-pass radix algorithm (k=1,2,..), S denote a sequential read or write, and Rk a random read or write in m different places in an array where: • After some simplification: • and (+ some more simplifications): OMS 2007
Conclusions • The effects of cache-misses are real and show up in ordinary user algorithms when doing random access in large arrays. • We have demonstrated that radix3, that performs almost 3 times as many instructions as radix1, is 4-5 times as fast as radix1 for large n. • i.e. radix1 experiences a slowdown of factor 7-10 because of cache-misses The number of instructions executed is no longer a good measure for the performance of an algorithm. Algorithms should be rewritten such that random access inlarge data structures is removed. OMS 2007