1 / 23

Accelerating Multiprocessor Simulation

Accelerating Multiprocessor Simulation. Kenneth C. Barr 6.895 Final Project. Motivation. Detailed simulation of large benchmarks can take days… …and that’s just a uniprocessor. Parallel simulations are more complex. Cache coherence Interconnect / bus timing models N CPUs

duer
Download Presentation

Accelerating Multiprocessor Simulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accelerating Multiprocessor Simulation Kenneth C. Barr6.895 Final Project

  2. Motivation • Detailed simulation of large benchmarks can take days… • …and that’s just a uniprocessor. Parallel simulations are more complex. • Cache coherence • Interconnect / bus timing models • N CPUs • Memory Address Record (MAR): a structure to speed up simulation of directory-based cache coherent computers. Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  3. Directory Based Cache Coherence:Review • Same state idea as snooping (e.g., MESI), but more scalable Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  4. Directory Based Cache Coherence:Review • Same state idea as snooping (e.g., MESI), but more scalable • Add directory to hold state, and replace bus with network Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  5. Directory Based Cache Coherence:Review • Same state idea as snooping (e.g., MESI), but more scalable • Add directory to hold state, and replace bus with network • Each cache line has state in directory • On load and store, contact the “home node” for line’s state • Exclusive + Owner • Shared + Readers • Invalid EXCL CPU17 SHRD 01101…01110 INV don’t care Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  6. Directory Example • CPU1 prepares to modify word at 0x8000 • Contact home with “read-exclusive” request • Home replies: “<data> is shared by CPU2, CPU3, and CPU4” • Home transitions line to Exclusive • CPU1 sends invalidates to CPU2, CPU3, and CPU4 • Sharers invalidate their copies • CPU1 waits for all three replies • CPU1 puts data in cache (possibly evicting LRU line) • Not hard to see that this is intense simulation effort! • 11 hard-to-predict if-statements in top-level function! • Let’s skip it! Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  7. Sampling Microarchitecture Simulation • MAR used in conjunction with “warming” • Long periods of fast “functional warming” • Short periods of detailed simulation • 35-60 times faster than detailed simulation • Less than 1% error in CPI - Wunderlich et al. 2003 DetailedSimulation FunctionalWarming FunctionalWarming DetailedWarming Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  8. Proposal: Memory Address Record to enable fast warmup • Quick updates to MAR during warming • No detailed cache or directory model; all accesses straight to shared memory space (on simulator’s heap, so it’s easy and fast to access) • Everything looks like a hit • For each access, record {CPU, read/write, time} • Playback from MAR to enable detailed simulation DetailedSimulation FunctionalWarming FunctionalWarming DetailedWarming Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  9. Proposal: Memory Address Record to enable fast warmup • For each access, record {CPU, read/write, time} Physical Memory struct mar_record{ int writer; stime_t writetime; vector<stime_t> readers; }; Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  10. Proposal: Memory Address Record to enable fast warmup • For each access, record {CPU, read/write, time} • 3:07am, CPU1 issues “load r1, 0x8004” Physical Memory struct mar_record{ int writer; stime_t writetime; vector<stime_t> readers; }; Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  11. Algorithm • Update • simply record <cpu, read/write, time> overwriting old values. • Playback • Two stages • Reconstruct Caches • Use caches to build directory Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  12. AlgorithmReconstructing Caches • Uses N w-deep priority queues and N “latest writers” for each set{ for each line in set { update latest w writers throw out all prior reads insert writers and remaining (subsequent) reads in queue } for each CPU{ empty priority queue (tag, dirty, valid bits) into set } Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  13. AlgorithmReconstructing Directory for each CPU{ for each set in current cache{ check other caches (N-1)W places to get correct state } } • All lines start as I. • Line present in one place (dirty) -> M • Line present in one place (clean) -> S (or E, but evictions may not let us distinguish). • Line present in many places (clean) -> S • Other cases should never arise Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  14. Proof (or intuition) of correctness • We’ve got everything the cache has and more! • So how can MAR be faster? • No timing models, no state machines until playback. And when we do play-back, we’re only interested in the final state, not what happened in between • Default scheme time • all accesses ∙ T(simulate cache + simulate directory + simulate network) • MAR time • Update: all accesses ∙ T(hash table update) • Playback: touched lines ∙ (writes ∙ sharers + reads ∙ cpus) • In our favor • MAR “all access” step is fast • Touched lines are tiny compared to all accesses • Sharing should be minimal at the application level Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  15. Measured sharing and cache lines Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  16. Testing • UVSIM • Models SGI Origin 3000 CPU (R10K) and most structures • Used 4 CPUS • Added MAR update, playback, and stats Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  17. Splash2 Benchmarks • Parallel Scientific Computing • Fast Fourier Transform (FFT) • Left-upper (LU) dense matrix factorization • Barnes-Hut n-body interaction • Ocean current simulation • Chosen subset has diverse behavior Data from Hennessy & Patterson, 2003 Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  18. Remaining work • Time it! • Note, reduce space requirements • Simplify formulas Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  19. Future work • Can we save space in the structure by making it more cache-like? • Other optimizations (eg, don’t stride, but take advantage of locality during replay) • What extra state needs to be stored for other schemes (eg MESI, MOESI, etc…) • A per-processor structure to ease parallel-hosted simulation • Caveats • If application threads get out of sync in real life, we’re modeling a correct execution, but not one we’d see in real life. • Don’t forget to sync/barrier when the applications sync. Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  20. Questions… Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  21. Reconstructing Caches and directory • Algorithm • reconstruct caches • “Stride” through MAR, looking at all cachelines that map to a set • For each line • Keep track of most recent write (per CPU) • Throw out all reads older than most recent write • If there are readers, then for each CPU • Insert tag and read time into a priority queue (sorted by time) • For each CPU • For each set of cache • copy the W most recent reads and writes • reconstruct directory from caches • foreach cpu’s cache, for each address, call add_sharer() sort of function • Note that cache-building step leaves us a consistent state (eg. Only one cache can be writing) • post process directory for subtleties • If only one reader, it’s E, not S. Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  22. Conclusion • Pros • Cons • Replaces most of the work in animation with a quick update and O(N) replay • Supports multisimulation of cache params and directory schemes (MSI vs MESI) • Memory (grows with number of touched cache lines) Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

  23. Results • Benchmarks: Splash2 paper shows the fraction spent in load/store/lock/sync/etc… Make graph for this • How far-off are certain metrics if you don’t warm up the directory? • Pretty graphs showing timing of my scheme vs default (on four splash2 apps) • Growth of structure over time with these apps Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab

More Related