330 likes | 457 Views
Sheriff : Precise Detection & Automatic Mitigation of False Sharing. Tongping Liu , Emery Berger University of Massachusetts, Amherst. Multi-core: expectation is awesome. int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i ++) count[id ]++; }. Reality is awful.
E N D
Sheriff:Precise Detection& Automatic Mitigationof False Sharing Tongping Liu, Emery Berger University of Massachusetts, Amherst
Multi-core: expectation is awesome int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; }
Reality is awful int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; } 13X count[id]++; False sharing kills scaling
False sharing = performance problem Core 2 Core 1 Thread 1 Thread 2 Invalidate Cache Cache Main Memory
False sharing = performance problem Core 2 Core 1 Thread 1 Thread 2 20X slower Invalidate Cache Cache Main Memory Interleaved writes cause cache invalidations
False sharing is invisible me = 1; you = 1; // globals me = new Foo; you = new Bar; // heap class X { int me; int you; }; // fields arr[me] = 12; arr[you] = 13; // array indices
False sharing detector: instrument every memory access Related work: • S.M.Guntheret.al. [WBIA 2009]. • C.Liu. [Master thesis 2009]. • Q.Zhaoet.al. [VEE2011]. • Shortcomings: • Slow • Noactionable output • False positives
+ 850 lines… False sharing detector: state of the art • Shortcomings: • Imprecise • Too many false positives PTU
No false positives Efficient (20%) Actionable output Object has 13767 interleaving writes. The object starts at 0xd5c8e160, length 32. Allocation call stack: 0: word_count.c: 136 1: word_count.c: 444 Sheriff-Detect
t1 = spawnf(x); t2 = spawn g(y); sync; if (!fork()) f(x); if (!fork()) g(y); Related work: Grace [OOPSLA 2009], Dthreads [SOSP 2011]
Sheriff: isolated execution Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Global State
Sheriff: isolated execution Pthreads Sheriff 1: Lock(); 2: XXX; 3: Unlock(); 4: YYY; 5: Lock(); Begin_isolated_execution Begin_isolated_execution XXX; //isolated execution YYY; //isolated execution Commit_local_changes Commit_local_changes
Sheriff-Detect: Find false sharing at commit points Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Interleaved writes Global State
Output: PTU vs. Sheriff-Detect kmeans 1916 2 reverse_index N/A 5 Total 2,664 15
Example case study: linear_regression Allocation call stack: 0: linear_regression-pthread.c: line number: 136 Step 1: find allocation site 136: tid_args = (lreg_args *)calloc(sizeof(lreg_args), num_procs); Step 2: find references 152: pthread_create(&tid_args[i].tid, &attr, linear_regression_pthread, (void*)&tid_args[i]) != 0);
Example case study: linear_regression void *linear_regression_pthread(void *args_in) { lreg_args* args =(lreg_args*)args_in; …… for (i = 0; i < args->num_elems; i++) { args->SX += args->points[i].x; args->SXX += args->points[i].x*args->points[i].x; …… “lreg_args” is not aligned
Example case study: linear_regression Step 3: fix false sharing using padding typedefstruct { ….. char padding[128]; // Padding to avoid false sharing } lreg_args; 9.2X
Sheriff-Detect performance 11.4 8.2 20% ?
Speedup due to isolation Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Global State
Prevents ALL false sharing Sheriff-Protect
Basis of Sheriff-Protect - = Sheriff-Protect Sheriff-Detect
8.2 11.4 13%
Sheriff libraries: easy to use Sheriff-Detect Sheriff-Protect % g++ myprog.cpp –lsheriffdetect–omyprog % g++ myprog.cpp–lsheriffprotect–omyprog
Workflow: using Sheriff original program modified program padding, alignment local variables Sheriff-Detect libpthread Degrade performance too much memory Sheriff-Detect No source code No time No false sharing Sheriff- Protect original program original program libpthread Sheriff-Protect
8.2 11.4 13%
Why no false positives? • actual interleaved writes (performance problem) • Word status – not true sharing (3) Avoid heap re-usage problems (4) The results of our experiment helps to exemplify the results.
Key Optimizations • Isolate small heap objects and globals • Adaptive false sharing prevention • Protect on long transaction only
Key Optimizations • Find sharing pages: false sharing objects shared page • Reduce overhead • Using sampling • Sampling only for long transactions ( > 5ms)