1 / 32

Sheriff : Precise Detection & Automatic Mitigation of False Sharing

Sheriff : Precise Detection & Automatic Mitigation of False Sharing. Tongping Liu , Emery Berger University of Massachusetts, Amherst. Multi-core: expectation is awesome. int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i ++) count[id ]++; }. Reality is awful.

oma
Download Presentation

Sheriff : Precise Detection & Automatic Mitigation of False Sharing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sheriff:Precise Detection& Automatic Mitigationof False Sharing Tongping Liu, Emery Berger University of Massachusetts, Amherst

  2. Multi-core: expectation is awesome int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; }

  3. Reality is awful int count[8]; //Global array thread_func(int id) { for(i = 0; i < M; i++) count[id]++; } 13X count[id]++; False sharing kills scaling

  4. False sharing = performance problem Core 2 Core 1 Thread 1 Thread 2 Invalidate Cache Cache Main Memory

  5. False sharing = performance problem Core 2 Core 1 Thread 1 Thread 2 20X slower Invalidate Cache Cache Main Memory Interleaved writes cause cache invalidations

  6. False sharing is invisible me = 1; you = 1; // globals me = new Foo; you = new Bar; // heap class X { int me; int you; }; // fields arr[me] = 12; arr[you] = 13; // array indices

  7. False sharing detector: instrument every memory access Related work: • S.M.Guntheret.al. [WBIA 2009]. • C.Liu. [Master thesis 2009]. • Q.Zhaoet.al. [VEE2011]. • Shortcomings: • Slow • Noactionable output • False positives

  8. + 850 lines… False sharing detector: state of the art • Shortcomings: • Imprecise • Too many false positives PTU

  9. No false positives Efficient (20%) Actionable output Object has 13767 interleaving writes. The object starts at 0xd5c8e160, length 32. Allocation call stack: 0: word_count.c: 136 1: word_count.c: 444 Sheriff-Detect

  10. t1 = spawnf(x); t2 = spawn g(y); sync; if (!fork()) f(x); if (!fork()) g(y); Related work: Grace [OOPSLA 2009], Dthreads [SOSP 2011]

  11. Sheriff: isolated execution Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Global State

  12. Sheriff: isolated execution Pthreads Sheriff 1: Lock(); 2: XXX; 3: Unlock(); 4: YYY; 5: Lock(); Begin_isolated_execution Begin_isolated_execution XXX; //isolated execution YYY; //isolated execution Commit_local_changes Commit_local_changes

  13. Snapshot and diffing: local changes

  14. Sheriff-Detect: Find false sharing at commit points Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Interleaved writes Global State

  15. Output: PTU vs. Sheriff-Detect kmeans 1916 2 reverse_index N/A 5 Total 2,664 15

  16. Example case study: linear_regression Allocation call stack: 0: linear_regression-pthread.c: line number: 136 Step 1: find allocation site 136: tid_args = (lreg_args *)calloc(sizeof(lreg_args), num_procs); Step 2: find references 152: pthread_create(&tid_args[i].tid, &attr, linear_regression_pthread, (void*)&tid_args[i]) != 0);

  17. Example case study: linear_regression void *linear_regression_pthread(void *args_in) { lreg_args* args =(lreg_args*)args_in; …… for (i = 0; i < args->num_elems; i++) { args->SX += args->points[i].x; args->SXX += args->points[i].x*args->points[i].x; …… “lreg_args” is not aligned

  18. Example case study: linear_regression Step 3: fix false sharing using padding typedefstruct { ….. char padding[128]; // Padding to avoid false sharing } lreg_args; 9.2X

  19. Sheriff-Detect performance 11.4 8.2 20% ?

  20. Speedup due to isolation Core 2 Core 1 Process 1 Process 2 Cache Cache Process 1 Main Memory Process 2 Global State

  21. Prevents ALL false sharing Sheriff-Protect

  22. Basis of Sheriff-Protect - = Sheriff-Protect Sheriff-Detect

  23. 8.2 11.4 13%

  24. Sheriff libraries: easy to use Sheriff-Detect Sheriff-Protect % g++ myprog.cpp –lsheriffdetect–omyprog % g++ myprog.cpp–lsheriffprotect–omyprog

  25. Workflow: using Sheriff original program modified program padding, alignment local variables Sheriff-Detect libpthread Degrade performance too much memory Sheriff-Detect No source code No time No false sharing Sheriff- Protect original program original program libpthread Sheriff-Protect

  26. 8.2 11.4 13%

  27. Why no false positives? • actual interleaved writes (performance problem) • Word status – not true sharing (3) Avoid heap re-usage problems (4) The results of our experiment helps to exemplify the results.

  28. Key Optimizations • Isolate small heap objects and globals • Adaptive false sharing prevention • Protect on long transaction only

  29. Key Optimizations • Find sharing pages: false sharing objects  shared page • Reduce overhead • Using sampling • Sampling only for long transactions ( > 5ms)

More Related