Thread Contracts for Safe Parallelism

Thread Contracts for Safe Parallelism Rajesh Karmani, P Madhusudan, Brandon Moore University of Illinois at Urbana-Champaign PPoPP 2011 NSF

What’s the problem? Data-races in parallel programs is just plain wrong. • For example, C++ gives no semantics for programs with races! But no mechanism currently for building large data-race-free programs, without significantly changing the programming style Proposal of this paper: • An annotation mechanism for data-parallel programs that helps building large data-race-free programs • Annotation augments existing code • Annotation  Race-freedom (automatically checked) • Program  Annotation (tested)

Data-race and language semantics “simultaneous” accesses to a memory location by two different threads, where one of them is a write. • Memory models for programming languages • Specifies what exactly will be ensured for a read instruction in the program • Java memory model [Manson et al., POPL ‘05] • A complex , buggy model for programs with races • C++ (new version) [Boehm, Adve, PLDI ’08] • No semantics for programs with races! Consensus – Data-Race-Free (DRF) Guarantee: Memory model assures that Data-race-free programs  sequentially consistent

ACCORD (Annotations for Concurrent Co-ORDination) • Light-weight annotation that help develop parallel software • Formally express the coordination strategy that the programmer has in mind • Check the strategy (for data-race-freedom) and its implementation • Annotations have been successful in avoiding memory errors in sequential programs • Eiffel, JML, Microsoft Code Contracts, Cofoja • NOT a type system to express memory regions or program synchronization

In this work… Annotations for data race freedom in • data-parallel programs that access arrays, vectors, matrices • with fork-join synchronization and locks • Typical for OpenMP applications Annotations express the sharing strategy

Illustration: Parallel Matrix Multiplication void mm (int[m,n] A, int[n,p] B) { for (inti:=0; i < m; i := i + 1) for (int j:=0; j < n; j := j + 1) foreach (int k:=0; k < p; k := k+1) { C[i,k] := C[i,k] + (A[i,j] * B[j,k]) ; } } • foreach loop: • semantically creates p threads; one for each iteration of the loop • implicit barrier at end of foreach loop

Illustration: Parallel Matrix Multiplication ACCORDannotation void mm (int[m,n] A, int[n,p] B) { for (inti:=0; i < m; i := i + 1) for (int j:=0; j < n; j := j + 1) foreach (int k:=0; k < p; k := k+1) reads A[i,j], B[j,k], C[i,k] writes C[i,k] { C[i,k] := C[i,k] + (A[i,j] * B[j,k]) ; } } Annotation specifies the readand write sets for kth thread, for each k. Sets can be an over-approximation. Annotation utilizes currentvalues of the program variables (i,j) and thread-id (k)

Fully Parallel Matrix Multiplication void mm ( int [m,n] A, int[n,p] B ) { foreach (inti:=0; i < m; i := i + 1) readsA[i,$x], B[$x, $y], C[i, $y] writes C[i, $y] where 0 <=$x and $x< n and 0 <= $y and $y <p { for (int j:=0; j < n; j := j + 1) foreach (int k:=0; k < p ; k := k+1) reads A[i,j], B[j,k], C[i,k] writes C[i,k] { C[i,k] := C[i,k] + (A[i,j] * B[j,k]) ; } } Aux variables ($x,$y) are quantified over specified ranges whereclause: arithmetic and logical constraints

Checking data-race-freedom Given a program with an ACCORD annotation, program has no data-races iff: I. ACCORD annotations imply race-freedom? • Check the strategy, independent of the program II. Program satisfies the ACCORD annotations? • Check program implementation against strategy

Task I: Annotations imply race-freedom? • Annotation aLogical Formula alpha (automatically) • alpha is satisifiableiff annotations allow a race • Is alpha satisfiable? Ask an SMT solver Thread 0 Thread 1 foreach (inti=0; i < 2; i++) writes A[$x] where ($x % 2 = i) Can two threads i1 and i2 write to the same position in A[] ? • i1, i2, x1, x2.(i1  i2  x1%2 = i1  x2%2 = i2  x1 = x2) Write to the Same index Thread ids are different Constraints on $x

SMT solver: What is that? • Satisfiability Modulo Theory solvers • Solvers for logical and arithmetic constraints • Extremely fast automatic decision procedures for logic developed in the verification community • We use the Z3 SMT Solver from Microsoft Research

Buggy Parallel Matrix Multiplication void mm ( int [m,n] A, int [n,p] B ) { foreach (inti:=0; i < m; i := i + 1) reads A[i,x], B[x,y], C[i,y] writes C[i,y] where 0 <=x and x< n and 0 <= y and y <p foreach (int j:=0; j < n; j := j + 1) reads A[i,j], B[j,$y], C[i,$y] writes C[i,$y] where 0 <= $y and $y <p foreach (int k:=0; k < p ; k := k+1) reads A[i,j], B[j,k], C[i,k] writes C[i,k]{ C[i,k] := C[i,k] + (A[i,j] * B[j,k]) ; } } • j1, j2 (j1  j2  0  y  y < p  i=i  y=y) • Satisfiable; hence annotation does not imply race-freedom

Task II: Program satisfies annotations? • Does a program P satisfy its ACCORDannotations? • Transformation (automatic): Program P with ACCORD annotations  Program P’ with assertions • P’ checks whether annotations hold at run-time

Task II: Program satisfies annotations? void mm ( int [m,n] A, int[n,p] B ) { foreach (inti:=0; i < m; i := i + 1) // reads A[i,$x], B[$x, $y], C[i, $y] writes C[i, $y] // where 0 <=$x and $x< n and 0 <= $y and $y <p { n’ := n; p’ := p; for (int j:=0; j < n; j := j + 1) for(int k:=0; k < p ; k := k+1){ assert ( i=i and 0 <= j and j < n’); // A[i,$x] assert ( 0 <= j and j < n’ and 0 <= k and k < p’); // B[$x,$y] assert ( i=i and 0 <= k and k < p’); // C[i, $y] C[i,k] := C[i,k] + (A[i,j] * B[j,k]) ; } } Memoize variables Convert annotation to assertions

is P + ACCORD data-race-free? TASK I Yes Transformer I SMT Solver (Z3) Formula ACCORDAnnotation No P No Transformer II Runtime Testing P’ with Assertions TASK II

Successive Over-Relaxation Checker board pattern -Phase Green: write to green elements, read blue elements, in parallel -Barrier -Phase Blue: write to blue elements, read green elements, in parallel

Successive Over-Relaxation // Phase Green: write to greens, read from blues foreach (id:=0; id < m; id := id + 1) readsw, A[id,$j], A[id-1,$j], A[id+1,$j], A[id,$j-1], A[id,$j+1] writesA[id,$j] where0 <= $j and $j < n and ((id+$j) % 2 = 0) { for (k := 1-(id%2); k<n; k := k + 2) { A[id,k] := (1-w)*A[id,k] + w*0.25*(A[id-1,k]+ A[id+1,k]+A[id,k-1]+A[id,k+1]); } } // Phase Blue: write to blues, read from greens … Modulo arithmetic constraints

Montecarlo void mc (int[nTasks] tasks) { int slice := (nTasks+nThreads-1)/nThreads; foreach (inti:=0;i<nThreads;i:=i+1) reads next under lockgl, tasks[$k] writesnext,results[$j] under lockgl where (i*slice)<= $k and $k<((i+1)*slice) and 0 <= $j and $j < nTasks { intilow := i*slice; intiupper := (i+1)*slice; if (i = nThreads-1) iupper := nTasks; for(int run:=ilow; run<iupper; run:=run+1) { int result := simulate(tasks[run]); synchronized (gl) { next := next + 1; results[next] := result; } } } } Annotations for expressing accesses protected by a lock Threads read, write to a global location by first acquiring a lock

Quicksort void qsort(int[n] A, inti, int j) reads i, j writes A[$k] where (i <= $k and $k < j) { if (j-i < 2) return; int pivot := A[i]; //first element intp_index := partition(A, pivot, i, j); // swap 1st element and element at p_index par requires p_index >= i and p_index < j { thread writes A[$k] where (i<=$k and $k<p_index) { qsort(A, i, p_index); } with thread writes A[$k] where (p_index<$k and $k<j) { qsort(A, p_index + 1, j); } } }

Sparse Matrix Vector Multiplicationsparsematmult-jgf writes SparseMatmult.yt[row[$j]] where $j >= low[t-id] and $j <= high[t-id] requires forall$i1, $i2,$x, $y. ($i1 != $i2 and low[$i1] <= $x and $x <= high[$i1] and low[$i2] <= $y and $y <= high[$i2] implies row[$x] != row [$y]) Thread 0 Low Thread 1 High SparseMatmult.yt Row

Experience with ACCORD QF_LIA – Quantifier Free Linear Integer Arithmetic, QF_NIA – Quantifier Free Non-linear Integer Arithmetic, QF_UFLIA+MA –Quantifier Free Linear Integer Arithmetic with Uninterpreted Functions and MultiplicationAxioms, AUFLIA – Arrays and Linear Integer Arithmetic with Uninterpreted Functions

Experience with ACCORD (Spec OMP) QF_LIA – Quantifier Free Linear Integer Arithmetic, QF_NIA – Quantifier Free Non-linear Integer Arithmetic, QF_UFLIA+MA –Quantifier Free Linear Integer Arithmetic with Uninterpreted Functions and MultiplicationAxioms, AUFLIA – Arrays and Linear Integer Arithmetic with Uninterpreted Functions

Experience with ACCORD (Spec OMP) • Used the ACCORD methodology to discover long-unknown and erroneous data races in applu_l benchmark in Spec OMP2001 suite (v3.1c) • Recovered the sharing strategy of complex parallel fma3d benchmark from Spec OMP2001 suite (v3.1c) • Summarized and tested the sharing strategy of 10k+ lines using 7 lines of annotation

Applications • Serve as formal specification and documentation of data sharing strategy among parallel threads • Incrementally parallelizing sequential code • May enable runtime and hardware simplification (DeNovo project at Illinois, other architectures with simple or no cache-coherence) • Safe and efficient sharing of large data among Actors and MPI processes on shared memory

Take home message “Thread Contracts for Safe Parallelism,” Rajesh Karmani, P Madhusudan, Brandon Moore. • ACCORD: a light-weight, formal annotation language that allows programmers to document the sharing strategy among threads • Reduces the task of checking race-freedom to two simpler problems of constraint satisfaction and program testing • Does NOT change the style of programming • Only for fork-join parallelism on arrays/matrices now • Future work: • Combine Static regions of dynamic heaps (See work on DPJ) and Dynamic regions (ACCORD) • Support aliasing and richer synchronization idioms • Help in efficiently testing parallel programs

Thread Contracts for Safe Parallelism

Thread Contracts for Safe Parallelism

Presentation Transcript

Chapter 4: Multiprocessors and Thread-Level Parallelism

CPE 631: Multiprocessors and Thread-Level Parallelism

Chapter 6 Multiprocessors and Thread-Level Parallelism

CPE 731 Advanced Computer Architecture Thread Level Parallelism

Chapter 5: Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

Encoding H.264 by Thread Level Parallelism

Designing a thread-safe class

Multiprocessors and Thread-Level Parallelism

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs

Programming Explicit Thread-level Parallelism

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5: Multiprocessors (Thread-Level Parallelism)– Part 2

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Single Thread Parallelism

Chapter 5: Thread Level Parallelism and Cache Coherence

Thread Level Parallelism (TLP)

CPE 432 Computer Design 11 – Thread Level Parallelism

Chapter 5 Multiprocessors and Thread-Level Parallelism

CPE 731 Advanced Computer Architecture Thread Level Parallelism

Chapter 5 Thread-Level Parallelism