230 likes | 358 Views
Accelerating Multiprocessor Simulation. Kenneth C. Barr 6.895 Final Project. Motivation. Detailed simulation of large benchmarks can take days… …and that’s just a uniprocessor. Parallel simulations are more complex. Cache coherence Interconnect / bus timing models N CPUs
E N D
Accelerating Multiprocessor Simulation Kenneth C. Barr6.895 Final Project
Motivation • Detailed simulation of large benchmarks can take days… • …and that’s just a uniprocessor. Parallel simulations are more complex. • Cache coherence • Interconnect / bus timing models • N CPUs • Memory Address Record (MAR): a structure to speed up simulation of directory-based cache coherent computers. Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Directory Based Cache Coherence:Review • Same state idea as snooping (e.g., MESI), but more scalable Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Directory Based Cache Coherence:Review • Same state idea as snooping (e.g., MESI), but more scalable • Add directory to hold state, and replace bus with network Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Directory Based Cache Coherence:Review • Same state idea as snooping (e.g., MESI), but more scalable • Add directory to hold state, and replace bus with network • Each cache line has state in directory • On load and store, contact the “home node” for line’s state • Exclusive + Owner • Shared + Readers • Invalid EXCL CPU17 SHRD 01101…01110 INV don’t care Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Directory Example • CPU1 prepares to modify word at 0x8000 • Contact home with “read-exclusive” request • Home replies: “<data> is shared by CPU2, CPU3, and CPU4” • Home transitions line to Exclusive • CPU1 sends invalidates to CPU2, CPU3, and CPU4 • Sharers invalidate their copies • CPU1 waits for all three replies • CPU1 puts data in cache (possibly evicting LRU line) • Not hard to see that this is intense simulation effort! • 11 hard-to-predict if-statements in top-level function! • Let’s skip it! Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Sampling Microarchitecture Simulation • MAR used in conjunction with “warming” • Long periods of fast “functional warming” • Short periods of detailed simulation • 35-60 times faster than detailed simulation • Less than 1% error in CPI - Wunderlich et al. 2003 DetailedSimulation FunctionalWarming FunctionalWarming DetailedWarming Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Proposal: Memory Address Record to enable fast warmup • Quick updates to MAR during warming • No detailed cache or directory model; all accesses straight to shared memory space (on simulator’s heap, so it’s easy and fast to access) • Everything looks like a hit • For each access, record {CPU, read/write, time} • Playback from MAR to enable detailed simulation DetailedSimulation FunctionalWarming FunctionalWarming DetailedWarming Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Proposal: Memory Address Record to enable fast warmup • For each access, record {CPU, read/write, time} Physical Memory struct mar_record{ int writer; stime_t writetime; vector<stime_t> readers; }; Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Proposal: Memory Address Record to enable fast warmup • For each access, record {CPU, read/write, time} • 3:07am, CPU1 issues “load r1, 0x8004” Physical Memory struct mar_record{ int writer; stime_t writetime; vector<stime_t> readers; }; Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Algorithm • Update • simply record <cpu, read/write, time> overwriting old values. • Playback • Two stages • Reconstruct Caches • Use caches to build directory Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
AlgorithmReconstructing Caches • Uses N w-deep priority queues and N “latest writers” for each set{ for each line in set { update latest w writers throw out all prior reads insert writers and remaining (subsequent) reads in queue } for each CPU{ empty priority queue (tag, dirty, valid bits) into set } Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
AlgorithmReconstructing Directory for each CPU{ for each set in current cache{ check other caches (N-1)W places to get correct state } } • All lines start as I. • Line present in one place (dirty) -> M • Line present in one place (clean) -> S (or E, but evictions may not let us distinguish). • Line present in many places (clean) -> S • Other cases should never arise Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Proof (or intuition) of correctness • We’ve got everything the cache has and more! • So how can MAR be faster? • No timing models, no state machines until playback. And when we do play-back, we’re only interested in the final state, not what happened in between • Default scheme time • all accesses ∙ T(simulate cache + simulate directory + simulate network) • MAR time • Update: all accesses ∙ T(hash table update) • Playback: touched lines ∙ (writes ∙ sharers + reads ∙ cpus) • In our favor • MAR “all access” step is fast • Touched lines are tiny compared to all accesses • Sharing should be minimal at the application level Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Measured sharing and cache lines Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Testing • UVSIM • Models SGI Origin 3000 CPU (R10K) and most structures • Used 4 CPUS • Added MAR update, playback, and stats Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Splash2 Benchmarks • Parallel Scientific Computing • Fast Fourier Transform (FFT) • Left-upper (LU) dense matrix factorization • Barnes-Hut n-body interaction • Ocean current simulation • Chosen subset has diverse behavior Data from Hennessy & Patterson, 2003 Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Remaining work • Time it! • Note, reduce space requirements • Simplify formulas Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Future work • Can we save space in the structure by making it more cache-like? • Other optimizations (eg, don’t stride, but take advantage of locality during replay) • What extra state needs to be stored for other schemes (eg MESI, MOESI, etc…) • A per-processor structure to ease parallel-hosted simulation • Caveats • If application threads get out of sync in real life, we’re modeling a correct execution, but not one we’d see in real life. • Don’t forget to sync/barrier when the applications sync. Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Questions… Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Reconstructing Caches and directory • Algorithm • reconstruct caches • “Stride” through MAR, looking at all cachelines that map to a set • For each line • Keep track of most recent write (per CPU) • Throw out all reads older than most recent write • If there are readers, then for each CPU • Insert tag and read time into a priority queue (sorted by time) • For each CPU • For each set of cache • copy the W most recent reads and writes • reconstruct directory from caches • foreach cpu’s cache, for each address, call add_sharer() sort of function • Note that cache-building step leaves us a consistent state (eg. Only one cache can be writing) • post process directory for subtleties • If only one reader, it’s E, not S. Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Conclusion • Pros • Cons • Replaces most of the work in animation with a quick update and O(N) replay • Supports multisimulation of cache params and directory schemes (MSI vs MESI) • Memory (grows with number of touched cache lines) Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab
Results • Benchmarks: Splash2 paper shows the fraction spent in load/store/lock/sync/etc… Make graph for this • How far-off are certain metrics if you don’t warm up the directory? • Pretty graphs showing timing of my scheme vs default (on four splash2 apps) • Growth of structure over time with these apps Kenneth C. Barr — MIT Computer Science and Artificial Intelligence Lab