Improving Data Cache Performance Under a Cache Miss

$ Improving Data Cache Performance Under a Cache Miss J. Dundas and T. Mudge Supercomputing ‘97 Laura J. Spencer, ljspence@cs.wisc.edu Jim Gast, jgast@cs.wisc.edu CS703, Spring 2000 UW/Madison

Automatic I/O Hint Generation through Speculative Execution F. Chang and G. Gibson SOSDI ‘99

$ Similar algorithms in different worlds • The Run Ahead paper tries to hide cache miss latency • The I/O Hinting paper tries to hidedisk read latency

$ Basic Concept: Prefetching RunAhead via Shadow Thread • Prefetch • Try to get long-latency events started as soon as possible • Shadow Thread • Start a copy of the program to run-ahead to find the next few long-latency events • Let the RunAhead Speculate • Don’t let your shadow change any of your data • Every time your shadow goes off-track put it back on-track

$ Shadow Code • Prefetch • Far enough ahead to hide latency • Perhaps incorrectly • Runs speculatively during stall • Don’t wait for the data • Contents might be invalid • Keep shadow values privately • Suppress exceptions • Stay ahead until end of leash • Low confidence of being on-track • Outrunning resources b  c + d c  a[b] f  e / b if (d == 1) then . . .

$ Talk Roadmap Show how to RunAhead Backup the Registers, Speculate under stall Copy-on-Write the RAM, Speculate when stalled How far to speculate? Fill DMAQ with prefetches A constant number of hints (if on-track) Experimental Results Dundas Chang $ $ $

$ Simple ArrayExample for(int i = 0;i<size;i++) { r[i] = a[i] + b[i]; } for(int i = 0;i<size;i++) { _r[i]=prefetch(a[i])+prefetch(b[i]); } sleep LD a[0] LD b[0] LD a[1] LD b[1] PreFetch(b[0]); PreFetch(a[1]); PreFetch(b[1]); PreFetch(a[2]); cache miss Run ahead * execute sleep execute * Only needs execution logic (which would be wasted)

$ Long-latency Events • Miss in L2 cache costing 100-200 cycles • Whenever L1 cache misses, start shadow • Decide which values will be needed next and place them into Direct Memory Access Queue as prefetch prefetch value 1 prefetch value 2 prefetch value 3 DMAQ [1] [2] [3] [4] [5] [6] [7] [8] The longer the miss, the more chance this thread has of finding useful things to prefetch

$ Backup Register File • Checkpoint current state to backup register file • Thread will execute. When you don’t know something, mark a state bit invalid (INV) Register File latch Backup Register File Save Address of Faulting Instruction Register file and cache also maintain an invalid bit INV = read after write hazard

$ What is invalid? • Register-to-register op: mark dest reg INV if any source reg is INV • Load op: marks dest reg INV if • address reg is INV • load causes miss • prev store marked cache INV • Store op: marks cache INV if address is known and no miss would occur *If store does not mark cache INV, LD may use INV data

Disks in 1973 "The first guys -- when they started out to try and make these disks -- they would take an epoxy paint mixture, ground some rust particles into it, put that in a Dixie cup, strain that through a women's nylon to filter it down, and then pour it on a spinning disk and let it spread out as it was spinning, to coat the surface of the disk.” Source: http://www.newmedianews.com/032798/ts_harddisk.html Rotational Latency? 65 milliseconds (1973) vs. 10 milliseconds (2000)

Existing Predictors Work Well • Sequential Read Ahead • History-based Habits Cache of disk blocks in RAM Blocks on disk 2 3 1-3 1 10,000,000 ns latency 100 ns latency

Sequential Read Ahead • Prefetch a few blocks ahead • Read Ahead / Stay Ahead • Works well with Scatter / Gather 5 6 4-6 4

What about random reads? • Programmer could manually modify app • tipio_seg • tipio_fd_seg • Good performance, if human is smart • Hard to do • Old programs • Hard to predict how far ahead to prefetch

Kernel thread coordinates hints from multiple processes

Sample TIPIO /* Process records from file f1 */ /* Prefetch the first 5 records */ tipio_seg(f1,0,5*REC_LEN); /* Process the records */ for (rec = 0; ; rec++) { tipio_seg(f1, (rec+5)*REC_LEN, REC_LEN); bytes = read(f1, REC_LEN, bf); if (bytes < 0) break; process(bf); } Warning: over-simplification of tipio_seg

History-based Habits • EXAMPLE: • Edit / Compile / Link cycle is very predictable edit link compile

Normal vs. Prefetch on 3 Disks

Too Much Prefetch? • Disk head busy and far away when an unexpected useful read happens • Speculated block becomes victim before it is used

Chang / Gibson Approach • Create a kernel thread w/ shadow code • Run speculatively when real code stalls • Copy-on-write for all memory stores • Ignore exceptions (e.g. div by 0) • Speculation is safe • No real disk writes • Shadow page table • Predicts reads far in advance • Perhaps incorrectly

Staying on-track Real 23 412 6 92 408 409 410 54 16 17 18 19 • Hint Log • If next hinted read == this read • Then on-track • Else • OOPS Spec What if actual program reads 23, 412, 6, then 88!

$ Staying On Track - 2 ways • Conservative Approach • Stop when you reach an INV branch and wait for the main thread to return • Aggressive Approach • Use branch prediction to go beyond branch and stop only when cache miss has been serviced * Aggressive approach can execute farther, but may introduce useless fetches

$ Possible prefetch results • Generate prefetch using correct address • Fill up DMAQ, drop the prefetch • Used incorrect address • Prefetch is redundant with an outstanding cache-line fetch

$ Fetch Parallelism prefetch value 1 prefetch value 2 Use value 1 prefetch value 3 Use value 2 Use value 3 main * Prefetching overlaps cache misses rather than paying each sequentially

If I/O Gets Off-Track • Real Process copies registers to shadow thread’s register save area • Lights “restart” flag • Then performs blocking I/O • Which causes shadow thread to run • Shadow thread grabs a copy of the real stack • Invalidates copy-on-write pages • Cancels all hints: tipio_cancel_all

Overhead in I/O case • Before Read • Check hint log • If OK, continue else restart spec thread with MY stack and MY registers right here

$ Overhead in cache case • Add the backup register file • Add INV bits in the cache

$ Results - Simulations * Conservative values in cache sizes

$ Results for PERL next-miss: fetch next cache line on miss next-always: always fetch next cache line *Run ahead improves performance. Sequential policies hurt performance!

$ Results for TomcatV *Even for a “scientific” style benchmark, run-ahead does better than sequential.

$ What Happens to Prefetches? Aggressive Case * DMAQ is often dropping potential prefetches.

$ Prepare resources across a longer critical path section Effective Instruction Window Speculated Instructions Instruction Window time Stalled Load

I/O Spec Hint Tool • Transforms subject binary into speculative code solely to predict I/O • Add copy-on-write checks • Fix dynamic memory allocations (malloc ...) • Fix control transfers that cannot be statically resolved (jump tables, function pointer calls) • remove system calls (printf, fprintf, flsbuf, ...) • Complex control transfers stop spec • Could benefit from code slicing

Experimental Setup • 12 MByte disk cache • Prefetch limited to 64 blocks • 4 disks, striped 64 KBytes per stripe

Original vs. Spec vs. Manual

Use Lots of Overlapping Reads

$ Conclusion • Response time can benefit from hints • The latencies being hidden are getting bigger (in terms of instruction opportunities) every year • Static hinting is too hard • And not smart enough • Dynamic run-ahead can get improvements • Without programmer involvement Thought to leave you with: Do these techniques attack the critical path or do they mitigate resource constraints?

Improving Data Cache Performance Under a Cache Miss