John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

CS252Graduate Computer ArchitectureLecture 143+1 Cs of Caching and many ways Cache Optimizations John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

Review: Cache performance • Miss-oriented Approach to Memory Access: • Separating out Memory component entirely • AMAT = Average Memory Access Time cs252-S09, Lecture 15

Reducing hit time Small and simple caches Way prediction Trace caches Increasing cache bandwidth Pipelined caches Multibanked caches Nonblocking caches Reducing Miss Penalty Critical word first Merging write buffers Reducing Miss Rate Victim Cache Hardware prefetching Compiler prefetching Compiler Optimizations 12 Advanced Cache Optimizations (Con’t) cs252-S09, Lecture 15

BR BR BR 3. Fast (Instruction Cache) Hit times via Trace Cache Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line BR BR BR • Single fetch brings in multiple basic blocks • Trace cache indexed by start address and next n branch predictions cs252-S09, Lecture 15

3. Fast Hit times via Trace Cache (Pentium 4 only; and last time?) • Find more instruction level parallelism?How avoid translation from x86 to microops? • Trace cache in Pentium 4 • Dynamic traces of the executed instructions vs. static sequences of instructions as determined by layout in memory • Built-in branch predictor • Cache the micro-ops vs. x86 instructions • Decode/translate from x86 to micro-ops on trace cache miss + 1.  better utilize long blocks (don’t exit in middle of block, don’t enter at label in middle of block) • 1.  complicated address mapping since addresses no longer aligned to power-of-2 multiples of word size - 1.  instructions may appear multiple times in multiple dynamic traces due to different branch outcomes cs252-S09, Lecture 15

4: Increasing Cache Bandwidth by Pipelining • Pipeline cache access to maintain bandwidth, but higher latency • Instruction cache access pipeline stages: 1: Pentium 2: Pentium Pro through Pentium III 4: Pentium 4 •  greater penalty on mispredicted branches •  more clock cycles between the issue of the load and the use of the data cs252-S09, Lecture 15

5. Increasing Cache Bandwidth: Non-Blocking Caches • Non-blocking cacheor lockup-free cacheallow data cache to continue to supply cache hits during a miss • requires F/E bits on registers or out-of-order execution • requires multi-bank memories • “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests • “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses • Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses • Requires muliple memory banks (otherwise cannot support) • Penium Pro allows 4 outstanding memory misses cs252-S09, Lecture 15

Value of Hit Under Miss for SPEC (old data) • FP programs on average: Miss Penalty = 0.68 -> 0.52 -> 0.34 -> 0.26 • Int programs on average: Miss Penalty = 0.24 -> 0.20 -> 0.19 -> 0.19 • 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92 0->1 1->2 2->64 Base “Hit under n Misses” Floating Point Integer cs252-S09, Lecture 15

6: Increasing Cache Bandwidth via Multiple Banks • Rather than treat the cache as a single monolithic block, divide into independent banks that can support simultaneous accesses • E.g.,T1 (“Niagara”) L2 has 4 banks • Banking works best when accesses naturally spread themselves across banks  mapping of addresses to banks affects behavior of memory system • Simple mapping that works well is “sequential interleaving” • Spread block addresses sequentially across banks • E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; … cs252-S09, Lecture 15

block 7. Reduce Miss Penalty: Early Restart and Critical Word First • Don’t wait for full block before restarting CPU • Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution • Spatial locality  tend to want next sequential word, so not clear size of benefit of just early restart • Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block • Long blocks more popular today  Critical Word 1st Widely used cs252-S09, Lecture 15

8. Merging Write Buffer to Reduce Miss Penalty • Write buffer to allow processor to continue while waiting to write to memory • If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry • If so, new data are combined with that entry • Increases block size of write for write-through cache of writes to sequential words, bytes since multiword writes more efficient to memory • The Sun T1 (Niagara) processor, among many others, uses write merging cs252-S09, Lecture 15

9. Reducing Misses: a “Victim Cache” • How to combine fast hit time of direct mapped yet still avoid conflict misses? • Add buffer to place data discarded from cache • Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache • Used in Alpha, HP machines DATA TAGS One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator To Next Lower Level In Hierarchy cs252-S09, Lecture 15

10. Reducing Misses by Hardware Prefetching of Instructions & Data • Prefetching relies on having extra memory bandwidth that can be used without penalty • Instruction Prefetching • Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block. • Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer • Data Prefetching • Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8 different 4 KB pages • Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes cs252-S09, Lecture 15

Issues in Prefetching • Usefulness – should produce hits • Timeliness – not late and not too early • Cache and bandwidth pollution L1 Instruction Unified L2 Cache CPU L1 Data RF Prefetched data cs252-S09, Lecture 15

Hardware Instruction Prefetching Instruction prefetch in Alpha AXP 21064 • Fetch two blocks on a miss; the requested block (i) and the next consecutive block (i+1) • Requested block placed in cache, and next block in instruction stream buffer • If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2) Prefetched instruction block Req block Stream Buffer Unified L2 Cache CPU L1 Instruction Req block RF cs252-S09, Lecture 15

Hardware Data Prefetching • Prefetch-on-miss: • Prefetch b + 1 upon miss on b • One Block Lookahead (OBL) scheme • Initiate prefetch for block b + 1 when block b is accessed • Why is this different from doubling block size? • Can extend to N block lookahead • Strided prefetch • If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access cs252-S09, Lecture 15

Administrivia • Exam: This Wednesday Location: 310 Soda TIME: 6:00-9:00pm • Material: Everything up to next Monday, including papers (especially ones discussed in detail in class) • Closed Book, but 1 page hand-written notes (both sides) • Meet at LaVal’s afterwards for Pizza and Beverages • We have been reading Chapter 5 • You should take a look, since might show up in test cs252-S09, Lecture 15

11. Reducing Misses by Software Prefetching Data • Data Prefetch • Load data into register (HP PA-RISC loads) • Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) • Special prefetching instructions cannot cause faults;a form of speculative execution • Issuing Prefetch Instructions takes time • Is cost of prefetch issues < savings in reduced misses? • Higher superscalar reduces difficulty of issue bandwidth cs252-S09, Lecture 15

12. Reducing Misses by Compiler Optimizations • McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software • Instructions • Reorder procedures in memory so as to reduce conflict misses • Profiling to look at conflicts(using tools they developed) • Data • Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays • Loop Interchange: change nesting of loops to access data in order stored in memory • Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap • Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows cs252-S09, Lecture 15

Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality cs252-S09, Lecture 15

Loop Interchange Example /* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improved spatial locality cs252-S09, Lecture 15

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j]= 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j]+ c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality cs252-S09, Lecture 15

Blocking Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; • Two Inner Loops: • Read all NxN elements of z[] • Read N elements of 1 row of y[] repeatedly • Write N elements of 1 row of x[] • Capacity Misses a function of N & Cache Size: • 2N3 + N2 => (assuming no conflict; otherwise …) • Idea: compute on BxB submatrix that fits cs252-S09, Lecture 15

Blocking Example /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; • B called Blocking Factor • Capacity Misses from 2N3 + N2 to 2N3/B +N2 • Conflict Misses Too? cs252-S09, Lecture 15

Reducing Conflict Misses by Blocking • Conflict misses in caches not FA vs. Blocking size • Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache cs252-S09, Lecture 15

Summary of Compiler Optimizations to Reduce Cache Misses (by hand) cs252-S09, Lecture 15

Impact of Hierarchy on Algorithms • Today CPU time is a function of (ops, cache misses) • What does this mean to Compilers, Data structures, Algorithms? • Quicksort: fastest comparison based sorting algorithm when keys fit in memory • Radix sort: also called “linear time” sort For keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys • “The Influence of Caches on the Performance of Sorting” by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, 370-379. • For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4000 to 4000000 cs252-S09, Lecture 15

Quicksort vs. Radix: Instructions Job size in keys cs252-S09, Lecture 15

Quicksort vs. Radix Inst & Time Time Insts Job size in keys cs252-S09, Lecture 15

Quicksort vs. Radix: Cache misses Job size in keys cs252-S09, Lecture 15

s • for array A of length L from 4KB to 8MB by 2x • for stride s from 4 Bytes (1 word) to L/2 by 2x • time the following loop • (repeat many times and average) • for i from 0 to L by s • load A[i] from memory (4 Bytes) Experimental Study (Membench) • Microbenchmark for memory system performance 1 experiment cs252-S09, Lecture 15

memory time size > L1 cache hit time total size < L1 Membench: What to Expect average cost per access • Consider the average cost per load • Plot one line for each array length, time vs. stride • Small stride is best: if cache line holds 4 words, at most ¼ miss • If array is smaller than a given cache, all those accesses will hit (after the first run, which is negligible for large enough runs) • Picture assumes only one level of cache • Values have gotten more difficult to measure on modern procs s = stride cs252-S09, Lecture 15

Mem: 396 ns (132 cycles) L2: 2 MB, 12 cycles (36 ns) L1: 16 KB 2 cycles (6ns) L1: 16 B line L2: 64 byte line 8 K pages, 32 TLB entries Memory Hierarchy on a Sun Ultra-2i Sun Ultra-2i, 333 MHz Array length See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details cs252-S09, Lecture 15

Mem: 396 ns (132 cycles) L2: 8 MB 128 B line 9 cycles L1: 32 KB 128B line .5-2 cycles Memory Hierarchy on a Power3 Power3, 375 MHz Array size cs252-S09, Lecture 15

Compiler Optimization vs. Memory Hierarchy Search • Compiler tries to figure out memory hierarchy optimizations • New approach: “Auto-tuners” 1st run variations of program on computer to find best combinations of optimizations (blocking, padding, …) and algorithms, then produce C code to be compiled for that computer • “Auto-tuner” targeted to numerical method • E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity (Sparse linear algebra), Spiral (DSP), FFT-W cs252-S09, Lecture 15

Mflop/s Best: 4x2 Reference Mflop/s Sparse Matrix – Search for Blocking for finite element problem [Im, Yelick, Vuduc, 2005] cs252-S09, Lecture 15

Best Sparse Blocking for 8 Computers • All possible column block sizes selected for 8 computers; How could compiler know? 8 4 row block size (r) 2 1 1 2 4 8 column block size (c) cs252-S09, Lecture 15

cs252-S09, Lecture 15

Main Memory Background • Performance of Main Memory: • Latency: Cache Miss Penalty • Access Time: time between request and word arrives • Cycle Time: time between requests • Bandwidth: I/O & Large Block Miss Penalty (L2) • Main Memory is DRAM: Dynamic Random Access Memory • Dynamic since needs to be refreshed periodically (8 ms, 1% time) • Addresses divided into 2 halves (Memory as a 2D matrix): • RAS or Row Address Strobe • CAS or Column Address Strobe • Cache uses SRAM: Static Random Access Memory • No refresh (6 transistors/bit vs. 1 transistorSize: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16 cs252-S09, Lecture 15

Core Memories (1950s & 60s) • Core Memory stored data as magnetization in iron rings • Iron “cores” woven into a 2-dimensional mesh of wires by hand (25 billion a year at peak production) • invented by Forrester in late 40s/early 50s at MIT for Whirlwind • Origin of the term “Dump Core” • Rumor that IBM consulted Life Saver company • Robust, non-volatile storage • Used on space shuttle computers until recently • Core access time ~ 1ms • See: http://www.columbia.edu/acis/history/core.html DEC PDP-8/E Board, 4K words x 12 bits, (1968) First magnetic core memory, from IBM 405 Alphabetical Accounting Machine. cs252-S09, Lecture 15

Semiconductor Memory, DRAM • Semiconductor memory began to be competitive in early 1970s • Intel formed to exploit market for semiconductor memory • First commercial DRAM was Intel 1103 • 1Kbit of storage on single chip • charge on a capacitor used to hold value • Semiconductor memory quickly replaced core in 1970s • Today (March 2009), 4GB DRAM < $40 • People can easily afford to fill 32-bit address space with DRAM (4GB) • New Vista systems often shipping with 6GB cs252-S09, Lecture 15

DRAM Architecture bit lines word lines Col. 1 Col.2M Row 1 N Row Address Decoder Row 2N Memory cell(one bit) M N+M Column Decoder & Sense Amplifiers D Data • Bits stored in 2-dimensional arrays on chip • Modern chips have around 4 logical banks on each chip • each logical bank physically implemented as many smaller arrays cs252-S09, Lecture 15

Review:1-T Memory Cell (DRAM) • Write: • 1. Drive bit line • 2.. Select row • Read: • 1. Precharge bit line to Vdd/2 • 2.. Select row • 3. Cell and bit line share charges • Very small voltage changes on the bit line • 4. Sense (fancy sense amp) • Can detect changes of ~1 million electrons • 5. Write: restore the value • Refresh • 1. Just do a dummy read to every cell. row select bit cs252-S09, Lecture 15

DRAM Capacitors: more capacitance in a small area • Trench capacitors: • Logic ABOVE capacitor • Gain in surface area of capacitor • Better Scaling properties • Better Planarization • Stacked capacitors • Logic BELOW capacitor • Gain in surface area of capacitor • 2-dim cross-section quite small cs252-S09, Lecture 15

DRAM Operation: Three Steps • Precharge • charges bit lines to known value, required before next row access • Row access (RAS) • decode row address, enable addressed row (often multiple Kb in row) • bitlines share charge with storage cell • small change in voltage detected by sense amplifiers which latch whole row of bits • sense amplifiers drive bitlines full rail to recharge storage cells • Column access (CAS) • decode column address to select small number of sense amplifier latches (4, 8, 16, or 32 bits depending on DRAM package) • on read, send latched bits out to chip pins • on write, change sense amplifier latches. which then charge storage cells to required value • can perform multiple column accesses on same row without another row access (burst mode) cs252-S09, Lecture 15

RAS_L CAS_L WE_L OE_L A 256K x 8 DRAM D 9 8 RAS_L DRAM Read Timing (Example) • Every DRAM access begins at: • The assertion of the RAS_L • 2 ways to read: early or late v. CAS DRAM Read Cycle Time CAS_L A Row Address Col Address Junk Row Address Col Address Junk WE_L OE_L D High Z Junk Data Out High Z Data Out Read Access Time Output Enable Delay Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L cs252-S09, Lecture 15

Main Memory Performance • DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time • 2:1; why? • DRAM (Read/Write) Cycle Time : • How frequent can you initiate an access? • Analogy: A little kid can only ask his father for money on Saturday • DRAM (Read/Write) Access Time: • How quickly will you get what you want once you initiate an access? • Analogy: As soon as he asks, his father will give him the money • DRAM Bandwidth Limitation analogy: • What happens if he runs out of money on Wednesday? Cycle Time Access Time Time cs252-S09, Lecture 15

Increasing Bandwidth - Interleaving Access Pattern without Interleaving: CPU Memory D1 available Start Access for D1 Start Access for D2 Memory Bank 0 Access Pattern with 4-way Interleaving: Memory Bank 1 CPU Memory Bank 2 Memory Bank 3 Access Bank 1 Access Bank 0 Access Bank 2 Access Bank 3 We can Access Bank 0 again cs252-S09, Lecture 15

Main Memory Performance • Simple: • CPU, Cache, Bus, Memory same width (32 bits) • Wide: • CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits) • Interleaved: • CPU, Cache, Bus 1 word: Memory N Modules(4 Modules); example is word interleaved cs252-S09, Lecture 15

address address address address 0 1 5 4 2 3 8 9 6 7 12 13 10 11 14 15 Bank 1 Bank 0 Bank 2 Bank 3 Main Memory Performance • Timing model • 1 to send address, • 4 for access time, 10 cycle time, 1 to send data • Cache Block is 4 words • Simple M.P. = 4 x (1+10+1) = 48 • Wide M.P. = 1 + 10 + 1 = 12 • Interleaved M.P. = 1+10+1 + 3 =15 cs252-S09, Lecture 15

John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley