600 likes | 716 Views
Efficient Data Race Detection for Distributed Memory Parallel Programs CS267 Spring 2012 (3/8) O riginally presented at Supercomputing 2011 (Seattle, WA). Chang-Seo Park and Koushik Sen University of California Berkeley Paul Hargove and Costin Iancu Lawrence Berkeley Laboratory.
E N D
Efficient Data Race Detection for Distributed Memory Parallel ProgramsCS267 Spring 2012 (3/8) Originally presented at Supercomputing 2011 (Seattle, WA) Chang-Seo Park and KoushikSen University of California Berkeley Paul Hargove and CostinIancu Lawrence Berkeley Laboratory
Current State of Parallel Programming • Parallelism everywhere! • Top supercomputer has 500K+ cores • Quad-core standard on desktop / laptops • Dual-core smartphones • Parallelism and concurrency make programming harder • Scheduling non-determinism may cause subtle bugs • But, limited usage of testing and correctness tools • We like hero programmers • Hero programmers can find bugs (in sequential code) • Tools are hard to find and use
Outline • Introduction • Example Bug and Motivation • Efficient Data Race Detection with Active Testing • Prediction phase • Confirmation phase • HOWTO: Primer on using UPC-Thrille • Conclusion • Q&A and Project Ideas
Example Parallel Program • Simple matrix vector multiply c = A × b
Example Parallel Program in UPC 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) ? ? 1 11 1 11 = C A B
Example Parallel Program in UPC 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) 2 2 1 11 1 11 = C A B
UPC Example: Problem? 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) 2 2 1 11 1 11 = C A B No apparent bug in this program.
UPC Example: Data Race 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) 1 1 1 11 1 11 = B A B Data Race! • No apparent bug in this program. But, if we call matvec(A,B,B)?
UPC Example: Data Race 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) 2 1 1 11 1 21 = B A B Data Race! • No apparent bug in this program. But, if we call matvec(A,B,B)?
UPC Example: Data Race 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) 2 3 1 11 1 23 = B A B Data Race! • No apparent bug in this program. But, if we call matvec(A,B,B)?
UPC Example: Trace Example Trace: 4: sum[0] = 0; 4: sum[1] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[1]+= A[1][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 6: sum[1]+= A[1][1]*B[1]; 9: B[0] = sum[0]; 9: B[1] = sum[1]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Data Race?
UPC Example: Trace Would be nice to have a trace exhibiting the data race Example Trace: 4: sum[0] = 0; 4: sum[1] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 6: sum[1]+= A[1][0]*B[0]; 9: B[0] = sum[0]; 6: sum[1]+= A[1][1]*B[1]; 9: B[1] = sum[1]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Data Race!
UPC Example: Trace Would be nice to have a trace exhibiting the assertion failure Example Trace: 4: sum[0] = 0; 4: sum[1] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 9: B[0] = sum[0]; 6: sum[1]+= A[1][0]*B[0]; 6: sum[1]+= A[1][1]*B[1]; 9: B[1] = sum[1]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Data Race! C != A*B
Desiderata • Would be nice to have a trace • Showing a data race (or some other concurrency bug) • Showing an assertion violation due to a data race (or some other visible artifact)
Active Testing • Would be nice to have a trace • Showing a data race (or some other concurrency bug) • Showing an assertion violation due to a data race (or some other visible artifact) • Leverage program analysis to make testing quickly find real concurrency bugs • Phase 1: Use imprecise static or dynamic program analysis to find bug patternswhere a potential concurrency bug can happen (Race Detector) • Phase 2: Directed testingto confirm potential bugs as real(Race Tester)
Active Testing: Phase 1 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B)
Active Testing: Phase 1 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) 1. Insert instrumentation at compile time
Active Testing: Phase 1 Generated Trace: 4: sum[0] = 0; 4: sum[1] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[1]+= A[1][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 6: sum[1]+= A[1][1]*B[1]; 9: B[0] = sum[0]; 9: B[1] = sum[1]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) 2. Run instrumented program normally and obtain trace
Active Testing: Phase 1 Generated Trace: 4: sum[0] = 0; 4: sum[1] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[1]+= A[1][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 6: sum[1]+= A[1][1]*B[1]; 9: B[0] = sum[0]; 9: B[1] = sum[1]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) • 3. Algorithm detects data races
Active Testing: Phase 1 Generated Trace: 4: sum[0] = 0; 4: sum[1] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[1]+= A[1][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 6: sum[1]+= A[1][1]*B[1]; 9: B[0] = sum[0]; 9: B[1] = sum[1]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) • 3. Potential race between statements 6 and 9
Active Testing: Phase 2 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Goal 1. Confirm racesGoal 2. Create assertion failure
Active Testing: Phase 2 Generate this execution: 4: sum[0] = 0; 4: sum[1] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 9: B[0] = sum[0]; 6: sum[1]+= A[1][0]*B[0]; 6: sum[1]+= A[1][1]*B[1]; 9: B[1] = sum[1]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Data Race! Goal 1. Confirm racesGoal 2. Create assertion failure
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Control scheduler knowing that (6,9) could race
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 4: sum[0] = 0; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Control scheduler knowing that (6,9) could race
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 4: sum[0] = 0; 6: sum[1]+= A[1][0]*B[0]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Control scheduler knowing that (6,9) could race
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 4: sum[0] = 0; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Do not postpone if there is a deadlock Control scheduler knowing that (6,9) could race Postponed: { 6: sum[1]+= A[1][0]*B[0]; }
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 4: sum[0] = 0; 6: sum[0]+= A[0][0]*B[0]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Do not postpone if there is a deadlock Control scheduler knowing that (6,9) could race Postponed: { 6: sum[1]+= A[1][0]*B[0]; }
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 4: sum[0] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Do not postpone if there is a deadlock Control scheduler knowing that (6,9) could race Postponed: { 6: sum[1]+= A[1][0]*B[0]; }
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 4: sum[0] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 9: B[0] = sum[0]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Control scheduler knowing that (6,9) could race Postponed: { 6: sum[1]+= A[1][0]*B[0]; }
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 4: sum[0] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 9: B[0] = sum[0]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Race? yes Control scheduler knowing that (6,9) could race Postponed: { 6: sum[1]+= A[1][0]*B[0]; }
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 4: sum[0] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) 9:B[0]=sum[0]; 6:sum[1]+=A[1][0]*B[0]; Control scheduler knowing that (6,9) could race Postponed: {}
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 4: sum[0] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 9: B[0] = sum[0]; 6: sum[1]+= A[1][0]*B[0]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) Racing Statements Temporally Adjacent Achieved Goal 1:Confirmed race.
Active Testing: Phase 2 Trace: 4: sum[1] = 0; 4: sum[0] = 0; 6: sum[0]+= A[0][0]*B[0]; 6: sum[0]+= A[0][1]*B[1]; 9: B[0] = sum[0]; 6: sum[1]+= A[1][0]*B[0]; 6: sum[1]+= A[1][1]*B[1]; 9: B[1] = sum[1]; 1: void matvec(shared [N] double A[N][N],shared double B[N],shared double C[N]) { 2: double sum[N]; 3: upc_forall(inti = 0; i < N; i++; &C[i]) { 4: sum[i] = 0; 5: for(int j = 0; j < N; j++) 6: sum[i] += A[i][j] * B[j]; 7: } 8: upc_forall(inti = 0; i < N; i++; &C[i]) { 9: C[i] = sum[i]; 10: } 11:} // assert (C == A*B) C != A*B Achieved Goal 2:Assertion failure.
UPC-Thrille • Thread Interposition Library and Lightweight Extensions • Framework for active testing UPC programs • Instrument UPC source code at compile time • Using macro expansions, add hooks for analyses • Phase 1: Race detector • Observe execution and predict which accesses may potentially have a data race • Phase 2: Race tester • Re-execute program while controlling the scheduler to create actual data race scenarios predicted in phase 1
UPC-Thrille • Extension to the Berkeley UPC compiler and runtime • Unfortunately, disabled by default on NERSC clusters • Fortunately, compilers with Thrille enabled are available • /global/homes/p/parkcs/hopper/bin/thrille_upcc • /global/homes/p/parkcs/franklin/bin/thrille_upcc • You can also build Berkeley upcc with Thrille enabled from source by following the steps at http://upc.lbl.gov/thrille • Add switch “-thrille=[mode]” to (thrille enabled) upcc where [mode] is • empty (default, no instrumentation) • racer (phase 1 of race detection, predicts racy statement pairs) • tester (phase 2, tries to create race on a given statement pair) • Can also add “default_options = -thrille=[mode]” to ~/.upccrc
UPC-Thrille • No changes needed to source file(s) • Separate binary for each analysis phase • Including one “empty” uninstrumened version • Run each phase with corresponding binary upcc a.out hello.upc upcc b.out upcrun -thrille=racer upcc c.out -thrille=tester
UPC-Thrille: racer • -thrille=racer • Finds potential data race pairs • Records them in upct.race.N files (N=1,2,3,…) • Only works with static # of threads now (needs –T n) • This limitation will be lifted soon • Example $ upcc-T4 -thrille=racermatvec.upc -o matvec-racer $ upcrunmatvec-racer (in an interactive batch job) … [2] Potential race #1 found: [2] Read from [0x3ff7004,0x3ff7008) by thread 2 at phase 4 (matvec.upc:17) [3] Write to [0x3ff7004,0x3ff7008) by thread 3 at phase 4 (matvec.upc:26) …
UPC-Thrille: tester • -thrille=tester • Confirms data races predicted in phase 1 • Reads in upct.race.N files (N=1,2,3,…) and tests individually • A script upctrun is provided to automatically test all races and skip equivalent ones • One could also test a specific race with env. UPCT_RACE_ID=n • Example $ upcc-T4 -thrille=testermatvec.upc -o matvec-tester $ upctrunmatvec-tester … ('matvec.upc:17', 'matvec.upc:26') : (8, 1, True)… # of equivalent races True if race confirmed # pairs tested
Limitations • Limitations of prediction phase • Dynamic analysis can only analyze collected data • Cannot predict races on parts of code that was not executed • Cannot predict races on binary-only libraries whose source were not instrumented • Limitations of confirmation phase • Non-confirmation does not guarantee race freedom • “Benign” data races
Conclusion • Active testing for finding bugs in parallel programs • Combines dynamic analysis with testing • Observe executions for potential concurrency bugs • Re-execute to confirm bugs • UPC-Thrille is a efficient, scalable, and extensible analysis framework for UPC • Currently provides race detection analysis • Other analyses in progress (class projects?) http://upc.lbl.gov/thrille parkcs@cs.berkeley.edu
Optimization 1: Distributed Checking • Minimize interaction between threads • Store shared memory accesses locally • At barrier boundary, send access information to respective owner of memory • Conflict checking distributed among threads T1 T2 T1 T2 T1 T2 barrier barrier notify notify notify notify wait wait barrier barrier wait wait notify notify access access access wait wait barrier barrier notify notify notify notify barrier barrier wait wait wait wait Shared access between barriers Shared access after wait Shared access after notify
Optimization 2: Filter Redundancy • Information collected up to synchronization point may be redundant • Reading and writing to same memory address • Accessing same memory in different sizes or different locksets
Optimization 2: Filter Redundancy • Information collected up to synchronization point may be redundant • Reading and writing to same memory address • Accessing same memory in different sizes or different locksets • (Extended) Weaker-than relation • Only keep the least protected accesses • Prune provably redundant accesses [Choi et al ’02] • Also reduces superfluous race reports • e1 ⊑ e2 (access e1 is weaker-than e2) iff • larger memory range (e1.m ⊇ e2.m) • accessed by more threads (e1.t = * ∨ e1.t = e2.t) • smaller lockset (e1.L ⊆ e2.L) • weaker access type (e1.a = Write ∨ e1.a = e2.a)
Optimization 3: Sampling • Scientific applications have tight loops • Same computation and communication pattern each time • Inefficient to check for races at every loop iteration • Reduce overhead by sampling • Probabilistically sample each instrumentation point • Reduce probability at each unsuccessful check • Set probability to 0 when race found (disable check)
How Well Does it Scale? • Maximum 8% slowdown at 8K cores • Franklin Cray XT4 Supercomputer at NERSC • Quad-core 2.3GHz CPU and 8GB RAM per node • Portals interconnect • Optimizations for scalability • Efficient Data Structures • Minimize Communication • Sampling with Exponential Backoff T1 T2 T1 T2 T1 T2 barrier barrier notify notify notify notify wait wait barrier barrier wait wait notify notify access access access wait wait barrier barrier notify notify notify notify barrier barrier wait wait wait wait
Active Testing Cartoon: Phase I Potential Collision
New Landscape for HPC • Shared memory for scalability and utilization • Hybrid programming models: MPI + OpenMP • PGAS: UPC, CAF, X10, etc. • Asynchronous access to shared data likely to cause bugs • Unified Parallel C (UPC) • Parallel extensions to ISO C99 standard for shared and distributed memory hardware • Single Program Multiple Data (SPMD) +Partitioned Global Address Space (PGAS) • Shared memory concurrency • Transparent access using pointers to shared data (array) • Bulk transfers with memcpy, memput, memget • Fine-grained (lock) and bulk (barrier) synchronization
Phase 1: Checking for Conflicts • To predict possible races, • Need to check all shared accesses for conflicts • Collect information through instrumentation