300 likes | 429 Views
Hitting for performance. Cache Performance Analysis. Standard Matrix Multiplication. for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ sum = 0.0; for(k = 0; k<n ; k++){ temp1 load(a[i,k]); temp2 load(b[k,j]);
E N D
Hitting for performance Cache Performance Analysis CMPUT 229
CMPUT 229 Standard Matrix Multiplication for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ sum = 0.0; for(k = 0; k<n ; k++){ temp1 load(a[i,k]); temp2 load(b[k,j]); sum sum + temp1*temp2; } store(c[i,j]) sum; } } for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ c[i,j] = 0.0; for(k = 0; k<n ; k++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } } } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program?
CMPUT 229 Data Access Pattern A B
CMPUT 229 32K-byte cache = 256 lines/cache 128-byte cache line Cache Access Analysis Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program?
CMPUT 229 32K-byte cache = 256 lines/cache 128-byte cache lines 128-byte cache line 8-byte element = 16 elements/line Cache Access Analysis Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program?
CMPUT 229 # hits 960 hits Hit ratio = = # of accesses 2048 accesses 256 lines/cache 16 elements/line Cache Data Access Pattern If we ignore conflict misses, then: Every 16th access of A is a miss; Every access to B is a miss; How many hits and misses will occur to compute one element of C? In A there will be 1024/16 = 64 misses and 1024-64 = 960 hits. In B there will be 1024 misses. Thus, what is the hit ratio? = 0.47 = 47%
CMPUT 229 31 15 14 7 6 0 Tag Index Offset 7 bits 17 bits 8 bits 256 lines/cache 16 elements/line Address anatomy The data cache has 32 Kbytes and 128-byte cache lines; 128 = 27 256 = 28
CMPUT 229 256 lines/cache 16 elements/line Conflict Misses Cache Access Address Index Outcome A[0,0] $80000000 0 miss B[0,0] $80800000 0 miss A[0,1] $80000004 0 miss B[1,0] $80801000 32 miss A[0,2] $80000008 0 hit B[2,0] $80802000 64 miss A[0,3] $8000000C 0 hit B[3,0] $80803000 96 miss A[0,4] $80000010 0 hit B[4,0] $80804000 128 miss A[0,5] $80000014 0 hit B[5,0] $80805000 160 miss A[0,6] $80000018 0 hit B[6,0] $80806000 192 miss A[0,7] $8000001C 0 hit B[7,0] $80807000 244 miss A[0,8] $80000020 0 hit B[8,0] $80808000 0 miss A[0,9] $80000024 0 miss B[9,0] $80809000 32 miss 0 32 64 In General: A 1024-element row of A Occupies 64 16-element cache lines. There will be 2 conflict misses in two of these rows. A total of 4 conflict misses per row. Thus the accesses of A will result in 68 misses and 986 hits for each 1024 accesses. The conflict misses are not significant and can be ignored. 96 128 160 192 244
CMPUT 229 Matrix Multiplication with Transpose for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ b1[i,j] = b[j,i]; } } for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ for(k = 0; k<n ; k++){ c[i,j] = c[i,j] + a[i,k] * b1[j,k]; } } } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; n = 1024, Address(a[0,0]) = $8000000, Address(b[0,0]) = $80800000 Address(c[0,0]) = $8100000 What is the data cache hit ratio for this program? Where in memory should we place matrix b1 to reduce conflict misses?
CMPUT 229 31 15 14 7 6 0 Tag Index Offset Where to place matrix b1? Intuitively the index of b1[0][0] should be away from the index of a[0][0]. The index of a[0][0] is 0. Thus we could aim to place b1 at an address whose index is 128.
CMPUT 229 # hits 960 hits # of accesses 2048 accesses Hit ratio = = Cache Access Pattern for the Transpose for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ b1[i,j] = b[j,i]; } } If we ignore conflict misses, then: Every 16th access of b1 is a miss; Every access to b is a miss; The transpose’s inner loop yields: 2048 accesses 960 hits. And the inner loop is repeated 1024 times: 1024 2048 accesses 1024 960 hits Thus, the hit ratio is: = 0.47 = 47%
CMPUT 229 Cache Access Pattern for the Multiplication for (i = 0; i<n ; i++){ for(j = 0; j<n ; j++){ sum = 0.0; for(k = 0; k<n ; k++){ temp1 load(a[i,k]); temp2 load(b1[j,k]); sum sum + temp1*temp2; } store(c[i,j]) sum; } } If we ignore conflict misses, then: Every 16th access of a is a miss; Every 16th access to b1 is a miss; Thus the inner loop yields 2048 accesses and 1920 hits. The inner loop is executed n2 times. The total number of accesses (ignoring accesses to c) in the multiplication is: 1024 1024 2048 accesses 1024 1024 1920 hits
CMPUT 229 1024 960+ 1024 1024 1920 hits Hit ratio = 2048 1024 + 1024 1024 2048 accesses 960+ 1024 1920 hits Hit ratio = 1025 2048 accesses Hit Ratio for Multiplication with Transpose The total number of accesses (ignoring accesses to c) in the multiplication is: 1024 1024 2048 accesses 1024 1024 1920 hits The transpose yields: 1024 2048 accesses 1024 960 hits. = 0.937 = 93.7%
CMPUT 229 Blocked Matrix Multiplication* for (i0 = 0; i0<n ; i0 = i0 + b){ for(j0 = 0; j0<n ; j0 = j0 + b){ for(k0 = 0; k0<n ; k0 = k0 + b){ for(i = i0; i< min(i0+b-1,n) ; i++){ for(j = j0; j< min(j0+b-1,n) ; j++){ for(k = k0; j< min(k0+b-1,n) ; j++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } } } } } • Code adapted from http://www.netlib.org/utk/papers/autoblock/node2.html Assumes that all elements of matrix c were initialized to zero beforehand
CMPUT 229 miss 2 Data Access Pattern hit 0 A B
CMPUT 229 miss 3 Data Access Pattern hit 1 A B
CMPUT 229 miss 4 Data Access Pattern hit 2 A B
CMPUT 229 miss 4 Data Access Pattern hit 4 A B
CMPUT 229 miss 4 Data Access Pattern hit 6 A B
CMPUT 229 miss 4 Data Access Pattern hit 8 A B
CMPUT 229 miss 4 Data Access Pattern hit 10 A B
CMPUT 229 miss 4 Data Access Pattern hit 12 A B
CMPUT 229 miss 4 Data Access Pattern hit 14 A B Multiplying the first row of the block of A by the block of B required 18 accesses that resulted in 4 misses. How many of the 18 accesses required to multiply the second row of the block of A by the block of B will be misses?
CMPUT 229 miss 4|1 Data Access Pattern hit 14|1 A B
CMPUT 229 miss 4|1 Data Access Pattern hit 14|17 A B
CMPUT 229 miss 4| 1 | 1 = 6 Data Access Pattern hit 14|17|17 = 48 A B
CMPUT 229 miss 4| 1 | 1 = 6 Data Access Pattern hit 14|17|17 = 48 A B What is the hit ratio for the next block multiplication? 2b3 - b Hit ratio = 3 hits and 48 references 2b3 In general, there are b misses and 2b3 accesses
CMPUT 229 2b2 - 1 Hit ratio = 2b2 miss 4| 1 | 1 = 6 Data Access Pattern hit 14|17|17 = 48 A B What is the hit ratio for the next block multiplication? 3 hits and 48 references In general, there are b misses and 2b3 accesses
CMPUT 229 miss Data Access Pattern hit A B Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; What should be the value of b? Do the memory locations of A and B matter?
CMPUT 229 2b2 - 1 2(16)2 - 1 Hit ratio = Hit ratio = 2b2 2(16)2 Cache Usage for Blocked Matrix Multiplication for (i0 = 0; i0<n ; i0 = i0 + b){ for(j0 = 0; j0<n ; j0 = j0 + b){ for(k0 = 0; k0<n ; k0 = k0 + b){ for(i = i0; i< min(i0+b-1,n) ; i++){ for(j = j0; j< min(j0+b-1,n) ; j++){ for(k = k0; j< min(k0+b-1,n) ; j++){ c[i,j] = c[i,j] + a[i,k] * b[k,j]; } }} } } Assume that: Each matrix element is stored in 8 bytes; The data cache has 32 Kbytes and 128-byte cache lines; The data cache is direct associative; Ignore conflict misses. Estimate the hit ratio for the block computation if b=16. Hit ratio = 99.8%