Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models

Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models Based on an LCTES 2012 paper. Fang Yu National Cheng Chi University Shun-Ching Yang Guan-Cheng Chen Che-Chang Chan National Taiwan University Farn Wang National Taiwan University & Academia Sinica

Outline • Introduction • Motivation • Parallel program correctness • Related work • 2-step program consistency checking • Step 1: Static race constraint solution • Step 2: Guided simulation • Extended finite-state machine (EFSM), relaxed memory models • Implementation • Experiments • Conclusion

Motivation (1/4) • Parallel Programming • Multi-cores, • General purpose computation on GPU (GPGPU) • Distributed computing, cloud computing • Challenges: • Parallel loops, chunk sizes, # threads, schedules • Arrays, pointer aliases, • Relaxed memory models

Motivation (2/4) A Running example of C & OpenMP for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,1) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j]= M[i][j] – L[i-1][k]*M[k][j] } } }

Motivation (3/4) for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j]= M[i][j] – L[i-1][k]*M[k][j] } } } Thread1: k+1, … , k+1+c-1, Thread2: k+1+c , … , k+1+2c-1 Thread3: k+1+2c , … , k+1+3c-1 Thread4: k+1+3c, … , k+1+4c-1 Thread1: k+1+4c, … , k+1+5c-1 …….

Motivation (4/4) Many programming supports • forks & joins • P-threads • Open Multi-Processing (OpenMP) • Thread Building Blocks • Microsoft …

Parallel Program Correctness (1/4) • Program level, what users care about • Determinism: • For all input, all executions yield the same output. • Consistency: • All executions yield the same output as the sequential execution. • Race-freedom: • Parallel executions do not yield different results. • All seemingly equivalent at program level. • unless sequential execution is not a parallel execution.

Parallel Program Correctness (2/4) • Checking the correctness property of each parallel region (PR) • Correctness at PRs  correctness of the program parallel for parallel while parallel for parallel for

Parallel Program Correctness (3/4) In practice • It may be unclear what the program result is. • Instead, properties for correctness at PR level are usually checked. • determinism • consistency • race-freedom • At RW schedule levels, values do not count. • linearizability (transaction levels)

Parallel Program Correctness (4/4) Linearizability (Transaction level)  race-freedom (PR RW level)  determinism (PR level) = consistency (PR level)  race-freedom (program level) = determinism (program level) = consistency (program level)  program correctness

Related Work (1/4) • Thread analyzer of Sun Studio [Lin 2008] • Static race detection, no arrays • Intel Thread Checker [Petersen & Shah 2003] • Dynamic approach • Instrumentation approach on client-server for race detection [Kang et al. 2009] • Run-time monitoring in OpenMP programs • OmpVerify [Basupalli et al. 2011] • Polyhedral analysis for Affine Control Loops

Related Work in PLDI 2012 (2/4)no simulation as the 2nd step • Detect races via liquid effects [Kawaguchi, Rondon, Bakst, Jhala] • type inferencing for precise race detection. • no arrays. • Speculative Linearizability [Guerraoui,Kuncak,Losa] • Reasoning about Relaxed Programs [Carbin, Kim, Misailovic, Rinard] • Parallelizing Top-Down Interprocedural Analysis [Albarghouthi, Kumar, Nori, Rajamani]

Related Work in PLDI 2012 (3/4)no simulation as the 2nd step • Sound and Precise Analysis of Parallel Programs through Schedule-Specialization [Wu, Tang, Hu, et al] • Race Detection for Web Applications [Petrov, Vechev, Sridharan, Dolby] • Concurrent Data Representation Synthesis [Hawkins, Aiken, Fisher2, et al] • Dynamic Synthesis for Relaxed Memory Models [Liu, Nedev, Prisadnikov, et al]

Related Work in PLDI 2012 (4/4)no simulation as the 2nd step Tools: • Parcae [Raman, Zaks, Lee 3, et al] • Chimera [Lee, Chen, Flinn, Narayanasamy] • Janus [Tripp1, Manevich, Field, Sagiv] • Reagents [Turon]

Methodology (1/2) Assumptions: • Arrays do not overlap. • No pointers other than arrays. • Fixed #threads, chunk size, scheduling policy. • We analyze consistency of program implementation. • Focusing on OpenMP. • The techniques should be applicable to other frameworks. • Output result prescribed by users.

Why OpenMP ? • Complicate enough • Practical enough • Parallelizes programs automatically; • Is an industry standard of application programming interface (API); • Is supported by Sun Studio, Intel Parallel Studio, Visual C++, GNU Compiler Collection (GCC).

Methodology (2/2) 2-step program consistency checking. Program Consistency checking Potential race analysis at PR level Potential race report Guided simulation for program consistency violations end

Step 1: Potential Races at PR level Necessary constraints as Presburger formulas • A race constraint between each pair of memory references to the same location by different threads. • Solution of the pairwise constraints via Presburger formula solving.

Step 1: Potential Race Analysis C program with OpenMP Pairwise Constraints Generator Pairwise Race Constraints Consraint Solver Race-freedom No Yes Potential races (Truth Assignment) Sat?

Potential Race Constraint A Potential Race Constraint = Thread Path Condition Λ Race Condition • Thread Path Condition • Necessary for a thread to access a memory location in a statement • Obtained by symbolic postcondition analysis • Race Condition • The necessary condition of an access by two threads in a parallel region

Running example for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j]= M[i][j] – L[i-1][k]*M[k][j] } } } Thread1: k+1, … , k+1+c-1, Thread2: k+1+c , … , k+1+2c-1 Thread3: k+1+2c , … , k+1+3c-1 Thread4: k+1+3c, … , k+1+4c-1 Thread1: k+1+4c, … , k+1+5c-1 …….

Thread Path Condition of L[i][k] for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j]= M[i][j] – L[i-1][k]*M[k][j] } } } Thread 1: it1-(k+1)%4=0 Λ k+1≤ it1< size

Thread Path Conditions of L[i-1][k] for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j]= M[i][j] – L[i-1][k]*M[k][j] } } } Thread 2: it2-(k+1)-1 % 4 = 0 Λ k+1 ≤ it2 < size Λ k+1 ≤ jt2 < size

Race Condition of L[i][k] & L[i-1][k] for(k=0;k<size-1;k++){ #pragma omp parallel for default (none) shared(M,L,size,k) private(i,j) schedule(static,c) num_thread(4) for(i=k+1,i<size;i++){ L[i][k] = M[i][k]/M[k][k] for(j=k+1;j<size;j++){ M[i][j]= M[i][j] – L[i-1][k] *M[k][j] } } } it1-(k+1) % 4 = 0 Λ k+1 ≤ it1 < size Λ it2-(k+1)-1 % 4=0 Λ k+1 ≤ it2 < size Λ k+1 ≤ jt2 < size Λ k = k Λ it1 = it2 -1

Potential Race Constraint Solving All Presburger Potential races (Omega lib.): . . i_1 = k+1+4alpha . . i_2 = k+2+4alpha . . i_2 = i_1+1 . . i_1 < size . . i_2 < size . . k+1 <= i_1 . . k+1 <= i_2 . . k+1 <= j_2 . . j_2 < size i_1 – 0 [0,), not_tight i_2 – 0 [0,), not_tight it1-(k+1) % 4 = 0 Λ k+1 ≤ it1 < size Λ it2-(k+1)-1 % 4=0 Λ k+1 ≤ it2 < size Λ k+1 ≤ jt2 < size Λ k = k Λ it1 = it2 -1

Step 2: Guided symbolic simulation • Program models: • Extended finite-state machine (EFSM) • Relaxed memory model • Simulator of EFSM • Stepwise, backtrack, fixed point • Witness of program consistency violations • comparison with the sequential execution result.

Guided Simulation C program with OpenMP Model Generator Model (EFSM) Potential races (from step 1) No Simulation Yes fixed point ? Consistency ? No Yes Consistency violations Consistency (w. benign races)

C Program Model Construction (1/2) Example: #pragma omp for schedule(Static, c) num_threads(m) for(x=i;x<=j;x++) S start y is an auxiliary local variable for chunk. t is the serial number of the thread. (true) x=(t-1) *c +i; y=0; (x<jy<c-1) x++; y++; (x-y+m*c j  y=c-1) x=x-y+m*c; y=0; S (x> j) (x-y+m*c>j  y=c-1) stop

C Program Model Construction (2/2) To model races in a C statement: y = f(x1, x2, …, xn) assume reads x1, x2,…, xn in order. • other orders can also be modeled. Translate to the following n+1 EFSM transitions: a1=x1; a2=x2; …; an=xn; y=f(a1,…,an); a1, a2, …, an are auxiliary variables in EFSM.

Relaxed Memory Models • Out-of-order execution of accesses to the memory for hardware efficiency. • local caches, multiprocessors • for customized synchronizations, controlled races • May lead to unexpected result. A classical example: initiallyx=0 y = 0 thread 1: x=1; thread 2: y = 1; z = y; w = x; assertz=1w=1

Relaxed Memory Models A classical example: initiallyx=0 y = 0 x.c1=1 y.c1=1 load(w.c2,x) load(z.c1,y) store(x.c1) x=x.c1 store(y.c2) y=y.c2 thread 1: x=1; z=y; cache 1 store memory load store thread 2: y=1; w=x; cache 2 load assertz=1w=1

Relaxed Memory Models Total store order (TSO) • From SPARC • Adapted to Intel 80x86 series • Description: • Local reads can use pending writes in the local store. • Problem: Peer reads are not aware of the local pending writes. • Local stores must be FIFO.

Modeling TSO w. m threads (1/4) • An array x[0..m] for each shared variable x • x[0] is the memory copy. • x[i] is the cache copy of x of thread i [1,m] • x now becomes an address variable instead of the value variable for x.

Modeling TSO w. m threads (2/4) • An arrays ls[0..n] of objects for load-store (LS) buffer of size n+1. • ls_st[k]: status of load-store buffer cell k • 0: not used, 1:load, 2: store • ls_th[k]: thread that use load-store buffer cell k. • ls_dst[k], ls_src[k]: destination and source addresses • ls_value: value to store Purely for convenience. Can be changed to m load-store buffers for each thread. Need know mappings from threads to cores

Modeling TSO w. m threads (3/4) Load a x by thread j, ‘a’ is private.

Modeling TSO w. m threads (4/4) Store a x by thread j, ‘a’ is private.

Guided Simulation • For each pairwise race condition truth assignment, perform a simulation session. • Use a stack to explore the simulation paths. • Explore all paths compatible with the truth assignment. • Check consistency at the end of each path. • Mark benign races.

Implementation Pathg – path generator • Pontential race condition solving • Presburger Omega library • Model construction: • REDLIB for EFSM with synchronizations, arrays, variable declarations, address arithmetics • Guided EFSM simulation • REDLIB semi-symbolic simulator • step, backtrack, check fixpoint/consistency

ImplementationGuided Symbolic Simulation Guided Multi-Threaded Simulation Sequential execution(Golden model) Memory Accessing Sequence Master Thread Memory Accessing Sequence Master Thread Parallel Task 1 Read:L[2][1] Write:L[2][1] Read:L[2][1] Read:L[2][1] Write:L[2][1] . . . . Read:L[2][1] Read:L[2][1] Write:L[2][1] Read:L[2][1] Write:L[2][1] . . . . Parallel Task 1 Parallel Task 2 Parallel Task 3 Parallel Task 2 Parallel Task 3 Master Thread Master Thread output output

ImplementationPotential Race Report tg indicates threads involved in the race. tw indicates threads WRITE the Memory address. Race is where the race condition is. We enumerate variables to limit the solution ===tg:i_4,i_1=====tw:i_4 Race::L[5][1] ===tg:i_3,i_4=====tw:i_3 Race::L[4][1] ===tg:i_2,i_3=====tw:i_2 Race:: L[3][1] ===tg:i_1,i_2=====tw:i_1 Race:: L[2][1]

Experiments • Environment • Ubuntu 9.10 64bit • i5-760 2.8GHz and 2GB RAM • Benchmarks • OpenMP Source Code Repository (OmpSCR) • NAS Parallel Benchmarks (NPB)

Constraint Solving of OmpSCR • Bug v1: Races manually introduced (between any two threads dealing with the consecutive iterations) • Bug v2: Rare races introduced (only between two specific threads on a particular share memory) • Fixed: A barrier statement manually inserted (remove the race in Bug v2)

Symbolic Simulation of OmpSCR • Blindly simulation needs to explore (much) more traces to hit a consistency violation! • Standard OpenMP tools fail to report races of these benchmarks.

NAS Parallel Benchmarks • Middle-size benchmarks (1200+~3500+ loc) • Efficient race constraint solving • e.g., 150000+ race constraints solved in 38 minutes by omega library • Rare satisfiable constraints • 8/85067 constraints of nas_lu.c

nas_lu.c • Slice the program to the segment of the paralleled region with satisfiable race conditions • Construct the symbolic model of the sliced segment: • 35 Modes (EFSM) • Reaching the fixed point without consistency violation after 205 steps and 16.93secs • Benign races • All of them are used as mutual exclusion semaphores • nas_lu.c is consistent

Conclusion • Static analysis of program consistency • for real C/C++ program with OpenMP directives • Highly automated solution • Constraint solving • Symbolic simulation • High precision: relaxed memory models • High efficiency • Extension to TBB, other memory models ? • Partial order reduction ?

Conclusion Symbolic approach for static consistency checking • Detect and identify races by solving race constraints (Presburger formulas) • Construct symbolic models and perform guided simulation with races • Support relaxed memory models • Find consistency violations effectively (when existing)

Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models

Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models

Presentation Transcript

Relaxed Consistency models and software distributed memory

Parallel Processing with OpenMP

Memory Consistency Models

Formalizing Memory Consistency Models for Program Analysis

Relaxed Consistency Models

Memory Consistency Models

Memory Consistency Models

Parallel Programming with OpenMP

Memory Consistency Models

Memory Consistency Models (III)

Effective Program Verification for Relaxed Memory Models

Abstractions for Relaxed Memory Models

Memory consistency models

Lecture 11: Relaxed Consistency Models

Shared Memory Consistency Models

Memory Consistency Models

Memory Consistency Models

Lecture 12: Relaxed Consistency Models

Parallel Programming with OpenMP

Memory Consistency Models