1 / 19

Are We Trading Consistency Too Easily? A Case for Sequential Consistency

Madan Musuvathi Microsoft Research . Dan Marino Todd Millstein. Abhay Singh Satish Narayanasamy. UCLA. University of Michigan. Are We Trading Consistency Too Easily? A Case for Sequential Consistency. Memory Consistency Model. Abstracts the program runtime (compiler + hardware)

dana
Download Presentation

Are We Trading Consistency Too Easily? A Case for Sequential Consistency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Madan Musuvathi Microsoft Research Dan Marino Todd Millstein Abhay Singh Satish Narayanasamy UCLA University of Michigan Are We Trading Consistency Too Easily? A Case for Sequential Consistency

  2. Memory Consistency Model • Abstracts the program runtime (compiler + hardware) • Hides compiler transformations • Hides hardware optimizations, cache hierarchy, … • Sequential consistency (SC) [Lamport ‘79] “The result of any execution is the same as if the operations were executed in some sequential order, and the operations of each individual processor thread in this sequence appear in the program order”

  3. Sequential Consistency Explained int X = F = 0; // F = 1 implies X is initialized X = 1; F = 1; t = F; u = X; X = 1; X = 1; X = 1; t = F; t = F; t = F; X = 1; F = 1; u = X; t = F; t = F; X = 1; F = 1; F = 1; X = 1; u = X; t = F; u = X; t=1, u=1 t=0, u=1 t=0, u=1 t=0, u=0 t=0, u=1 t=0, u=1 F = 1; F = 1; F = 1; u = X; u = X; u = X; t=1 implies u=1

  4. Conventional Wisdom • SC is slow • Disables important compiler optimizations • Disables important hardware optimizations • Relaxed memory models are faster

  5. Conventional Wisdom X • SC is slow • Hardware speculation can hide the cost of SC hardware [Gharachorloo et.al. ’91, … , Blundell et.al. ’09] • Compiler optimizations that break SC provide negligible performance improvement [PLDI ’11] • Relaxed memory models are faster • Need fences for correctness • Programmers conservatively add more fences than necessary • Libraries use the strongest fence necessary for all clients • Fence implementations are slow • Efficient fence implementations require speculation support ?

  6. Implementing Sequential Consistency Efficiently asm: moveax [X]; src: t = X; This talk SC-Preserving Compiler Every SC behavior of the binary is a SC behavior of the source SC Hardware Every observed runtime behavior is a SC behavior of the binary

  7. Challenge: Important Compiler Optimizations are not SC-Preserving • Example: Common Subexpression Elimination (CSE) t,u,v are local variables X,Y are possibly shared L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t;

  8. Common Subexpression Elimination is not SC-Preserving Init: X = Y = 0; Init: X = Y = 0; L1: t = X*5; L2: u = Y; L3: v = X*5; M1: X = 1; M2: Y = 1; L1: t = X*5; L2: u = Y; L3: v = t; M1: X = 1; M2: Y = 1; possibly u == 1 && v == 0 u == 1 implies v == 5

  9. Implementing CSE in a SC-Preserving Compiler L1: t = X*5; L2: u = Y; L3: v = X*5; L1: t = X*5; L2: u = Y; L3: v = t; • Enable this transformation when • X is a local variable, or • Y is a local variable • In these cases, the transformation is SC-preserving • Identifying local variables: • Compiler generated temporaries • Stack allocated variables whose address is not taken

  10. A SC-preserving LLVM Compiler for C programs • Modify each of ~70 phases in LLVM to be SC-preserving • Without any additional analysis • Enable trace-preserving optimizations • These do not change the order of memory operations • e.g. loop unrolling, procedure inlining, control-flow simplification, dead-code elimination,… • Enable transformations on local variables • Enable transformations involving a single shared variable • e.g. t= X; u=X; v=X;  t=X; u=t; v=t;

  11. Average Performance overhead is ~2% 173 480 373 237 132 200 116 159 298 154 • Baseline: LLVM –O3 • Experiments on Intel Xeon, 8 cores, 2 threads/core, 6GB RAM

  12. How Far Can A SC-Preserving Compiler Go? float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (x[i]-y[i]) * (x[i]-y[i]); } float s, *x, *y; int i; s=0; for( i=0; i<n; i++ ){ s += (*(x + i*sizeof(float)) – *(y + i*sizeof(float))) * (*(x + i*sizeof(float)) – *(y + i*sizeof(float))); } no opt. SC pres float s, *x, *y; float *px, *py, *e, t; s=0;py=y; e = &x[n] for(px=x; px<e; px++, py++){ t = (*px-*py); s += t*t; } float s, *x, *y; float *px, *py, *e; s=0;py=y; e = &x[n] for(px=x; px<e; px++, py++){ s += (*px-*py) * (*px-*py); } full opt

  13. We Can Reduce the FaceSim Overhead (if we cheat a bit) • 30% overhead comes from the inability to perform CSE in • But argument evaluation in C is nondeterministic • The specification explicitly allows overlapped evaluation of function arguments return MATRIX_3X3<T>( x[0]*A.x[0]+x[3]*A.x[1]+x[6]*A.x[2], x[1]*A.x[0]+x[4]*A.x[1]+x[7]*A.x[2], x[2]*A.x[0]+x[5]*A.x[1]+x[8]*A.x[2], x[0]*A.x[3]+x[3]*A.x[4]+x[6]*A.x[5], x[1]*A.x[3]+x[4]*A.x[4]+x[7]*A.x[5], x[2]*A.x[3]+x[5]*A.x[4]+x[8]*A.x[5], x[0]*A.x[6]+x[3]*A.x[7]+x[6]*A.x[8], x[1]*A.x[6]+x[4]*A.x[7]+x[7]*A.x[8], x[2]*A.x[6]+x[5]*A.x[7]+x[8]*A.x[8] );

  14. Improving Performance of SC-Preserving Compiler • Request programmers to reduce shared accesses in hot loops • Use sophisticated static analysis • Infer more thread-local variables • Infer data-race-free shared variables • Use program annotations • Requires changing the program language • Minimum annotations sufficient to optimize the hot loops • Perform load-optimizations speculatively • Hardware exposes speculative-load optimization to the software • Load optimizations reduce the max overhead to 6%

  15. Conclusion • Hardware should support strong memory models • TSO is efficiently implementable [Mark Hill] • Speculation support for SC over TSO is not currently justifiable • Can we quantify the programmability cost for TSO? • Compiler optimizations should preserve the hardware memory model • High-level programming models can abstract TSO/SC • Further enable compiler/hardware optimizations • Improve programmer productivity, testability, and debuggability

  16. Eager-Load Optimizations L1: t = X*5; L2: u = Y; L3: v = X*5; L1: X = 2; L2: u = Y; L3: v = X*5; L1: L2: for(…) L3: t = X*5; L1: t = X*5; L2: u = Y; L3: v = t; L1: X = 2; L2: u = Y; L3: v = 10; L1: u = X*5; L2: for(…) L3: t = u; • Eagerly perform loads or use values from previous loads or stores Common Subexpression Elimination Constant/copy Propagation Loop-invariant Code Motion

  17. Performance overhead 173 480 373 237 132 200 116 159 298 154 Allowing eager-load optimizations alone reduces max overhead to 6%

  18. Correctness Criteria for Eager-Load Optimizations • Eager-loads optimizations rely on a variable remaining unmodified in a region of code • Sequential validity: No mods to X by the current thread in L1-L3 • SC-preservation: No mods to X by any other thread in L1-L3 Enable invariant “t == 5.X” L1: t = X*5; L2: *p = q; L3: v = X*5; Maintain invariant “t == 5.X” Use invariant “t == 5.X" to transform L3 to v = t;

  19. Speculatively Performing Eager-Load Optimizations • On monitor.load, hardware starts tracking coherence messages on X’s cache line • The interference check fails if X’s cache line has been downgraded since the monitor.load • In our implementation, a single instruction checks interference on up to 32 tags L1: t = monitor.load(X, tag) * 5; L2: u = Y; L3: v = t; C4: if (interference.check(tag)) C5: v = X*5; L1: t = X*5; L2: u = Y; L3: v = X*5;

More Related