Computer Systems

Computer Systems Cache characteristics Computer Systems – cache characteristics

How do you know? (ow133)cpuid This system has a Genuine Intel(R) Pentium(R) 4 processor Processor Family: F, Extended Family: 0, Model: 2, Stepping: 7 Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz) Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entries Data TLB: 4K or 4M pages, fully associative, 64 entries 1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line size No 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cache Trace cache: 12K-uops, 8-way set associative 2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size Computer Systems – cache characteristics

CPUID-instruction asm("push %ebx;movl $2,%eax;CPUID;movl %eax,cache_eax;movl %ebx,cache_ebx;movl %edx,cache_edx;movl %ecx,cache_ecx;pop %ebx"); Computer Systems – cache characteristics

A limited number of answers http://www.sandpile.org/ia32/cpuid.htm Computer Systems – cache characteristics

set 0: valid tag cache block cache block set 1: valid tag • • • cache block set S-1: valid tag Direct-Mapped Cache • Simplest kind of cache • Characterized by exactly one line per set. E=1 lines per set Computer Systems – cache characteristics

Valid Tag Cache block Cache block Valid Tag E = C/B lines in the one and only set Set 0: • • • Cache block Valid Tag Fully Associative Cache • Difficult and expensive to build one that is both large and fast Computer Systems – cache characteristics

valid tag cache block set 0: E=2 lines per set valid tag cache block valid tag cache block set 1: valid tag cache block • • • valid tag cache block set S-1: valid tag cache block E – way Set Associative Caches • Characterized by more than one line per set Computer Systems – cache characteristics

Accessing Set Associative Caches • Set selection • Use the set index bits to determine the set of interest. valid tag cache block set 0: cache block valid tag valid tag cache block selected set set 1: cache block valid tag t bits s bits • • • b bits 0 0 0 0 1 cache block valid tag m-1 0 tag set index block offset cache block valid tag set S-1: s = log2 (S) S = C / (B x E) Computer Systems – cache characteristics

(1) The valid bit must be set. =1? (3) If (1) and (2), then cache hit, and block offset selects starting byte. (2) The tag bits in one of the cache lines must match the tag bits in the address = ? Accessing Set Associative Caches • Line matching and word selection • must compare the tag in each valid line in the selected set. 0 1 2 3 4 5 6 7 1 1001 selected set (i): 1 0110 w0 w1 w2 w3 t bits s bits b bits 0110 i 100 b = log2 (B) m-1 0 tag set index block offset Computer Systems – cache characteristics

Observations • Contiguously region of addresses form a block • Multiple address blocks share the same set-index • Blocks that map to the same set can be uniquely identified by the tag Computer Systems – cache characteristics

Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 8 Level k: 9 14 3 Data is copied between levels in block-sized transfer units Caching in a Memory Hierarchy 4 10 10 4 Level k+1: 0 1 2 3 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. 4 4 5 6 7 8 9 10 10 11 12 13 14 15 Same set-index Computer Systems – cache characteristics

General Caching Concepts Request 12 Request 14 Level k: 14 12 • Program needs object d, which is stored in some block b. • Cache hit • Program finds b in the cache at level k. E.g., block 14. • Cache miss • b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. • If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? 0 1 2 3 14 4* 9 14 3 12 4* 14 Request 12 12 4* Level k+1: 0 1 2 3 12 4 5 6 7 4* 8 9 10 11 12 13 14 15 12 Conflict miss Computer Systems – cache characteristics

Intel Processors CacheSRAM http://www11.brinkster.com/bayup/dodownload.asp?f=37 Computer Systems – cache characteristics

Intel Processors CacheSRAM ftp://download.intel.com/design/Pentium4/manuals/25366514.pdf (section 2.4) Computer Systems – cache characteristics

Impact • Associativitymore lines decrease the vulnerability on thrashing. It requires more tag-bits and control logic, which increases the hit-time (1-2 cycles for L1) • Block sizelarger blocks exploit spatial locality (not temporal locality) to increase the hit-rate. Larger blocks increase the miss-penalty (5- 100 cycles for L1). Computer Systems – cache characteristics

Optimum Block size http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture15.pdf Computer Systems – cache characteristics

Absolute Miss Rate http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture16.pdf Computer Systems – cache characteristics

Relative Miss Rate http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture16.pdf Computer Systems – cache characteristics

Matrix Multiplication Example • Major Cache Effects to Consider • Total cache size • Exploit temporal locality and keep the working set small (e.g., by using blocking) • Block size • Exploit spatial locality • Description: • Multiply N x N matrices • O(N3) total operations • Accesses • N reads per source element • N values summed per destination • Assumption • No so large that cache not big enough to hold multiple rows /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Variable sum held in register Computer Systems – cache characteristics

Column- wise Fixed Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Inner loop: (*,j) (i,j) (i,*) A B C Row-wise • Misses per Inner Loop Iteration: ABC 0.25 1.0 0.0 = 1.25 Computer Systems – cache characteristics

Row-wise Column- wise Fixed Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } } Inner loop: (*,j) (i,j) (i,*) A B C • Misses per Inner Loop Iteration: ABC 0.25 1.0 0.0 = 1.25 Computer Systems – cache characteristics

Row-wise Row-wise Fixed Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) (i,*) A B C • Misses per Inner Loop Iteration: ABC 0.0 0.25 0.25 = 0.5 Computer Systems – cache characteristics

Row-wise Row-wise Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) (i,*) A B C Fixed • Misses per Inner Loop Iteration: ABC 0.0 0.25 0.25 = 0.5 Computer Systems – cache characteristics

Column - wise Column- wise Fixed Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A B C • Misses per Inner Loop Iteration: ABC 1.0 0.0 1.0 = 2.0 Computer Systems – cache characteristics

Column- wise Column- wise Fixed Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A B C • Misses per Inner Loop Iteration: ABC 1.0 0.0 1.0 = 2.0 Computer Systems – cache characteristics

Summary of Matrix Multiplication • ijk (& jik): • 2 loads, 0 stores • misses/iter = 1.25 • kij (& ikj): • 2 loads, 1 store • misses/iter = 0.5 • jki (& kji): • 2 loads, 1 store • misses/iter = 2.0 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Computer Systems – cache characteristics

Matrix Multiply Performance • Miss rates are helpful but not perfect predictors. • Code scheduling matters, too. Computer Systems – cache characteristics

Improving Temporal Locality by Blocking • Example: Blocked matrix multiplication • do not mean “cache block”. • Instead, it mean a sub-block within the matrix. • Example: N = 8; sub-block size = 4 A11 A12 A21 A22 B11 B12 B21 B22 C11 C12 C21 C22 = X Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars. C11 = A11B11 + A12B21 C12 = A11B12 + A12B22 C21 = A21B11 + A22B21 C22 = A21B12 + A22B22 Computer Systems – cache characteristics

Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } } } Computer Systems – cache characteristics

kk jj jj kk i i A B C Update successive elements of sliver row sliver accessed bsize times block reused n times in succession Blocked Matrix Multiply Analysis • Innermost loop pair multiplies • a 1 X bsize sliver of A by • a bsize X bsize block of B and accumulates into 1 X bsize sliver of C • Loop over i steps through n row slivers of A & C, using same B Computer Systems – cache characteristics

Matrix Multiply Performance • Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions • relatively insensitive to array size. Computer Systems – cache characteristics

Concluding Observations • Programmer can optimize for cache performance • How data structures are organized • How data are accessed • Nested loop structure • Blocking is a general technique • All systems favor “cache friendly code” • Getting absolute optimum performance is very platform specific • Cache sizes, line sizes, associativities, etc. • Can get most of the advantage with generic code • Keep working set reasonably small (temporal locality) • Use small strides (spatial locality) Computer Systems – cache characteristics

Assignment • Homework problem 6.28:What percentage of writes will miss in the cache for the blank screen function? Computer Systems – cache characteristics

Computer Systems