1 / 37

Improving Data Cache Performance Under a Cache Miss

$. Improving Data Cache Performance Under a Cache Miss. J. Dundas and T. Mudge Supercomputing ‘97. Laura J. Spencer, ljspence@cs.wisc.edu Jim Gast, jgast@cs.wisc.edu CS703, Spring 2000 UW/Madison. Automatic I/O Hint Generation through Speculative Execution. F. Chang and G. Gibson

brilliant
Download Presentation

Improving Data Cache Performance Under a Cache Miss

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. $ Improving Data Cache Performance Under a Cache Miss J. Dundas and T. Mudge Supercomputing ‘97 Laura J. Spencer, ljspence@cs.wisc.edu Jim Gast, jgast@cs.wisc.edu CS703, Spring 2000 UW/Madison

  2. Automatic I/O Hint Generation through Speculative Execution F. Chang and G. Gibson SOSDI ‘99

  3. $ Similar algorithms in different worlds • The Run Ahead paper tries to hide cache miss latency • The I/O Hinting paper tries to hidedisk read latency

  4. $ Basic Concept: Prefetching RunAhead via Shadow Thread • Prefetch • Try to get long-latency events started as soon as possible • Shadow Thread • Start a copy of the program to run-ahead to find the next few long-latency events • Let the RunAhead Speculate • Don’t let your shadow change any of your data • Every time your shadow goes off-track put it back on-track

  5. $ Shadow Code • Prefetch • Far enough ahead to hide latency • Perhaps incorrectly • Runs speculatively during stall • Don’t wait for the data • Contents might be invalid • Keep shadow values privately • Suppress exceptions • Stay ahead until end of leash • Low confidence of being on-track • Outrunning resources b  c + d c  a[b] f  e / b if (d == 1) then . . .

  6. $ Talk Roadmap Show how to RunAhead Backup the Registers, Speculate under stall Copy-on-Write the RAM, Speculate when stalled How far to speculate? Fill DMAQ with prefetches A constant number of hints (if on-track) Experimental Results Dundas Chang $ $ $

  7. $ Simple ArrayExample for(int i = 0;i<size;i++) { r[i] = a[i] + b[i]; } for(int i = 0;i<size;i++) { _r[i]=prefetch(a[i])+prefetch(b[i]); } sleep LD a[0] LD b[0] LD a[1] LD b[1] PreFetch(b[0]); PreFetch(a[1]); PreFetch(b[1]); PreFetch(a[2]); cache miss Run ahead * execute sleep execute * Only needs execution logic (which would be wasted)

  8. $ Long-latency Events • Miss in L2 cache costing 100-200 cycles • Whenever L1 cache misses, start shadow • Decide which values will be needed next and place them into Direct Memory Access Queue as prefetch prefetch value 1 prefetch value 2 prefetch value 3 DMAQ [1] [2] [3] [4] [5] [6] [7] [8] The longer the miss, the more chance this thread has of finding useful things to prefetch

  9. $ Backup Register File • Checkpoint current state to backup register file • Thread will execute. When you don’t know something, mark a state bit invalid (INV) Register File latch Backup Register File Save Address of Faulting Instruction Register file and cache also maintain an invalid bit INV = read after write hazard

  10. $ What is invalid? • Register-to-register op: mark dest reg INV if any source reg is INV • Load op: marks dest reg INV if • address reg is INV • load causes miss • prev store marked cache INV • Store op: marks cache INV if address is known and no miss would occur *If store does not mark cache INV, LD may use INV data

  11. Disks in 1973 "The first guys -- when they started out to try and make these disks -- they would take an epoxy paint mixture, ground some rust particles into it, put that in a Dixie cup, strain that through a women's nylon to filter it down, and then pour it on a spinning disk and let it spread out as it was spinning, to coat the surface of the disk.” Source: http://www.newmedianews.com/032798/ts_harddisk.html Rotational Latency? 65 milliseconds (1973) vs. 10 milliseconds (2000)

  12. Existing Predictors Work Well • Sequential Read Ahead • History-based Habits Cache of disk blocks in RAM Blocks on disk 2 3 1-3 1 10,000,000 ns latency 100 ns latency

  13. Sequential Read Ahead • Prefetch a few blocks ahead • Read Ahead / Stay Ahead • Works well with Scatter / Gather 5 6 4-6 4

  14. What about random reads? • Programmer could manually modify app • tipio_seg • tipio_fd_seg • Good performance, if human is smart • Hard to do • Old programs • Hard to predict how far ahead to prefetch

  15. Kernel thread coordinates hints from multiple processes

  16. Sample TIPIO /* Process records from file f1 */ /* Prefetch the first 5 records */ tipio_seg(f1,0,5*REC_LEN); /* Process the records */ for (rec = 0; ; rec++) { tipio_seg(f1, (rec+5)*REC_LEN, REC_LEN); bytes = read(f1, REC_LEN, bf); if (bytes < 0) break; process(bf); } Warning: over-simplification of tipio_seg

  17. History-based Habits • EXAMPLE: • Edit / Compile / Link cycle is very predictable edit link compile

  18. Normal vs. Prefetch on 3 Disks

  19. Too Much Prefetch? • Disk head busy and far away when an unexpected useful read happens • Speculated block becomes victim before it is used

  20. Chang / Gibson Approach • Create a kernel thread w/ shadow code • Run speculatively when real code stalls • Copy-on-write for all memory stores • Ignore exceptions (e.g. div by 0) • Speculation is safe • No real disk writes • Shadow page table • Predicts reads far in advance • Perhaps incorrectly

  21. Staying on-track Real 23 412 6 92 408 409 410 54 16 17 18 19 • Hint Log • If next hinted read == this read • Then on-track • Else • OOPS Spec What if actual program reads 23, 412, 6, then 88!

  22. $ Staying On Track - 2 ways • Conservative Approach • Stop when you reach an INV branch and wait for the main thread to return • Aggressive Approach • Use branch prediction to go beyond branch and stop only when cache miss has been serviced * Aggressive approach can execute farther, but may introduce useless fetches

  23. $ Possible prefetch results • Generate prefetch using correct address • Fill up DMAQ, drop the prefetch • Used incorrect address • Prefetch is redundant with an outstanding cache-line fetch

  24. $ Fetch Parallelism prefetch value 1 prefetch value 2 Use value 1 prefetch value 3 Use value 2 Use value 3 main * Prefetching overlaps cache misses rather than paying each sequentially

  25. If I/O Gets Off-Track • Real Process copies registers to shadow thread’s register save area • Lights “restart” flag • Then performs blocking I/O • Which causes shadow thread to run • Shadow thread grabs a copy of the real stack • Invalidates copy-on-write pages • Cancels all hints: tipio_cancel_all

  26. Overhead in I/O case • Before Read • Check hint log • If OK, continue else restart spec thread with MY stack and MY registers right here

  27. $ Overhead in cache case • Add the backup register file • Add INV bits in the cache

  28. $ Results - Simulations * Conservative values in cache sizes

  29. $ Results for PERL next-miss: fetch next cache line on miss next-always: always fetch next cache line *Run ahead improves performance. Sequential policies hurt performance!

  30. $ Results for TomcatV *Even for a “scientific” style benchmark, run-ahead does better than sequential.

  31. $ What Happens to Prefetches? Aggressive Case * DMAQ is often dropping potential prefetches.

  32. $ Prepare resources across a longer critical path section Effective Instruction Window Speculated Instructions Instruction Window time Stalled Load

  33. I/O Spec Hint Tool • Transforms subject binary into speculative code solely to predict I/O • Add copy-on-write checks • Fix dynamic memory allocations (malloc ...) • Fix control transfers that cannot be statically resolved (jump tables, function pointer calls) • remove system calls (printf, fprintf, flsbuf, ...) • Complex control transfers stop spec • Could benefit from code slicing

  34. Experimental Setup • 12 MByte disk cache • Prefetch limited to 64 blocks • 4 disks, striped 64 KBytes per stripe

  35. Original vs. Spec vs. Manual

  36. Use Lots of Overlapping Reads

  37. $ Conclusion • Response time can benefit from hints • The latencies being hidden are getting bigger (in terms of instruction opportunities) every year • Static hinting is too hard • And not smart enough • Dynamic run-ahead can get improvements • Without programmer involvement Thought to leave you with: Do these techniques attack the critical path or do they mitigate resource constraints?

More Related