Sheriff : Precise Detection & Automatic Mitigation of False Sharing

Sheriff:Precise Detection& Automatic Mitigationof False Sharing Tongping Liu, Emery Berger University of Massachusetts, Amherst

Multi-core: expectation is awesome int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; }

Reality is awful int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; } 13X count[id]++; False sharing kills scaling

False sharing = performance problem Core 2 Core 1 Thread 1 Thread 2 Invalidate Cache Cache Main Memory

False sharing = performance problem Core 2 Core 1 Thread 1 Thread 2 20X slower Invalidate Cache Cache Main Memory Interleaved writes cause cache invalidations

False sharing is invisible me = 1; you = 1; // globals me = new Foo; you = new Bar; // heap class X { int me; int you; }; // fields arr[me] = 12; arr[you] = 13; // array indices

False sharing detector: instrument every memory access Related work: • S.M.Guntheret.al. [WBIA 2009]. • C.Liu. [Master thesis 2009]. • Q.Zhaoet.al. [VEE2011]. • Shortcomings: • Slow • Noactionable output • False positives

+ 850 lines… False sharing detector: state of the art • Shortcomings: • Imprecise • Too many false positives PTU

No false positives Efficient (20%) Actionable output Object has 13767 interleaving writes. The object starts at 0xd5c8e160, length 32. Allocation call stack: 0: word_count.c: 136 1: word_count.c: 444 Sheriff-Detect

t1 = spawnf(x); t2 = spawn g(y); sync; if (!fork()) f(x); if (!fork()) g(y); Related work: Grace [OOPSLA 2009], Dthreads [SOSP 2011]

Sheriff: isolated execution Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Global State

Sheriff: isolated execution Pthreads Sheriff 1: Lock(); 2: XXX; 3: Unlock(); 4: YYY; 5: Lock(); Begin_isolated_execution Begin_isolated_execution XXX; //isolated execution YYY; //isolated execution Commit_local_changes Commit_local_changes

Snapshot and diffing: local changes

Sheriff-Detect: Find false sharing at commit points Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Interleaved writes Global State

Output: PTU vs. Sheriff-Detect kmeans 1916 2 reverse_index N/A 5 Total 2,664 15

Example case study: linear_regression Allocation call stack: 0: linear_regression-pthread.c: line number: 136 Step 1: find allocation site 136: tid_args = (lreg_args *)calloc(sizeof(lreg_args), num_procs); Step 2: find references 152: pthread_create(&tid_args[i].tid, &attr, linear_regression_pthread, (void*)&tid_args[i]) != 0);

Example case study: linear_regression void *linear_regression_pthread(void *args_in) { lreg_args* args =(lreg_args*)args_in; …… for (i = 0; i < args->num_elems; i++) { args->SX += args->points[i].x; args->SXX += args->points[i].x*args->points[i].x; …… “lreg_args” is not aligned

Example case study: linear_regression Step 3: fix false sharing using padding typedefstruct { ….. char padding[128]; // Padding to avoid false sharing } lreg_args; 9.2X

Sheriff-Detect performance 11.4 8.2 20% ?

Speedup due to isolation Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Global State

Prevents ALL false sharing Sheriff-Protect

Basis of Sheriff-Protect - = Sheriff-Protect Sheriff-Detect

8.2 11.4 13%

Sheriff libraries: easy to use Sheriff-Detect Sheriff-Protect % g++ myprog.cpp –lsheriffdetect–omyprog % g++ myprog.cpp–lsheriffprotect–omyprog

Workflow: using Sheriff original program modified program padding, alignment local variables Sheriff-Detect libpthread Degrade performance too much memory Sheriff-Detect No source code No time No false sharing Sheriff- Protect original program original program libpthread Sheriff-Protect

8.2 11.4 13%

Why no false positives? • actual interleaved writes (performance problem) • Word status – not true sharing (3) Avoid heap re-usage problems (4) The results of our experiment helps to exemplify the results.

Key Optimizations • Isolate small heap objects and globals • Adaptive false sharing prevention • Protect on long transaction only

Key Optimizations • Find sharing pages: false sharing objects  shared page • Reduce overhead • Using sampling • Sampling only for long transactions ( > 5ms)

Sheriff : Precise Detection & Automatic Mitigation of False Sharing