360 likes | 579 Views
Predator : Predictive False Sharing Detection. Tongping Liu* , Chen Tian , Ziang Hu , Emery Berger*. *University of Massachusetts Amherst Huawei US Research Center. Parallelism: Expectation is Awesome. Parallel Program. Expectation. int count[8]; i nt W; void increment( int S)
E N D
Predator: Predictive False Sharing Detection Tongping Liu*, Chen Tian, ZiangHu, Emery Berger* • *University of Massachusetts Amherst Huawei US Research Center
Parallelism: Expectation is Awesome Parallel Program Expectation int count[8]; int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++; } int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i); } Runtime (s)
Parallelism: Reality is Awful Parallel Program Reality Expectation int count[8]; int W; void increment(int S) { for(in=S; in<S+W; in++) for(j=0; j<1M; j++) count[in]++; } int main(int THREADS) { W=8/THREADS; for(i=0; i<8; i+=W) spawn(increment,i); } False sharing Runtime (s) False sharing slows the program by 13X
False Sharing in Real Applications False sharing slows MySQL by 50%
False Sharing vs. True Sharing Cache Line
False Sharing vs. True Sharing Task 3 Task 1 Task 1 False Sharing Task 2 Task 2 Task 4 True Sharing
False Sharing Causes Performance Problems Core 2 Core 1 Thread 1 Thread 2 Invalidate Cache Cache Main Memory Cache line: basic unit of data transfer
False Sharing Causes Performance Problems Core 2 Core 1 Thread 1 Thread 2 Invalidate Cache Cache Main Memory Interleaved accesses cause cache invalidations
False Sharing is Everywhere me = 1; you = 1; // globals me = new Foo; you = new Bar; // heap class X { int me; int you; }; // fields array[me] = 12; array[you] = 13; // array indices
False Sharing is Hard to Diagnose Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)
Problems of Existing Tools • No precise information/false positives • WIBA’09, VEE’11, EuroSys’13, SC’13 • Accurate & Precise • OOPSLA’11 ( Cannot detect read-write FS) Shared problem: only detect observed false sharing
False Sharing Causes Performance Problems Core 1 Core 2 Interleaved accesses Task 1 Task 2 Cache invalidations Invalidate Cache Cache Performance problems Main Memory Detect false sharing causing performance problems Find cache lines with many cache invalidations
Find Lines with Many Invalidations Memory: Global, Heap . . . . . . . …… Track cache invalidations on each cache line
Track Cache Invalidations • Conservative Assumptions • Each thread runs on a different core with its private cache. • Infinite cache capacity. • Hardware-based approach • Needs hardware support • No portability • Simulation-based approach • Needs hardware info such as cache hierarchy, cache capacity • Very slow Predator: based on memory access history of each cache line
Track Cache Invalidations Each Entry: { Thread ID, Access Type} 0 0 3 1 2 0 w w w r r r w r # of invalidations Time 0 0 0 0 0 0 T1 T2 T1 T1 T2 r w w r r T2 T2 w w T1 T2
Predator Components Instruments every memory read/write access Compiler Instrumentation Collects memory accesses and reports false sharing Runtime System
Detect Problems Correctly & Precisely • Correctly: • No false alarms Task 3 Task 1 Task 1 False Sharing Task 2 Task 4 Task 2 Track memory accesses on each word • Precisely • Global variables • Heap objects: pinpoint the line of memory allocation True Sharing
Necessity of False Sharing Prediction Thread 1 Thread 2 Cache line 1 Cache line 1 Cache line 1 Cache line 2 Cache line 2 False Sharing False Sharing
Properties Affecting False Sharing Occurrence • Change of memory layout • 32-bit platform 64-bit platform • Different memory allocator • Different compiler or optimization • Different allocation order by changing the code, e.g., printf • Runon hardware with different cache line size
Example of False Sharing Sensitivity Cache line size = 64 bytes Memory Offset = 56 Offset = 8 Offset = 0 …… Colors represent threads
Example of False Sharing Sensitivity Predator predicts false sharing problems without occurrence
Prediction Based on Virtual Cache Lines Thread 1 Thread 2 Real case Virtual cache line 1 Cache line 1 Virtual cache line 1 Virtual cache line 2 Cache line 2 False Sharing Prediction 1 False Sharing Prediction 2
Track Invalidations on Virtual Cache Lines X Y d • d < the cache line size - sz • (X, Y) from different threads && one of them is write Non-tracked virtual lines Tracked virtual line (sz-d)/2 (sz-d)/2
Real Applications Results • MySQL • Problem: False sharing occurs when different threads update the shared bitmap simultaneously. • Performance improves 180% after fixes. • Boost library: • Problem: “there will be 16 spinlocks per cache line” • Performance improves about 100%.
Compiler Instrumentation Thread 1 Thread 2 Core 1 Core 2 Real case Runtime System Thread 1 Thread 2 Virtual cache line 1 Virtual cache line 1 Virtual cache line 2 Cache line 1 Cache line 2 False Sharing Prediction 1 Invalidate Cache Cache Precise report Main Memory False Sharing Prediction 2
False Sharing is Hard to Diagnose Multiple experts worked together to diagnose MySQL scalability issue (1.5M LOC)
Detailed Prediction Algorithm 1. Find suspected cache lines
Detailed Prediction Algorithm 1. Find suspected cache lines 2. Track detailed memory accesses
Detailed Prediction Algorithm d < sz && (X, Y) from different threads, potential false sharing X Y 1. Find suspected cache lines 2. Track detailed memory accesses d 3. Predict based on hot accesses
4: Tracking Cache • Invalidations on the Virtual Line X Y d Non-tracked virtual lines Tracked virtual line (sz-d)/2 (sz-d)/2