A Framework for Verifying Scientific Software

Gauss: A Framework for Verifying Scientific Software Robert PalmerSteve Barrus, Yu Yang, Ganesh Gopalakrishnan, Robert M. KirbyUniversity of Utah Supported in part by NSF Award ITR-0219805

Motivations “One of the simulations will run for 30 days. A Cray supercomputer built in 1995 would take 60,000 years to perform the same calculations.” 12,300 GFLOPS Need permission/grant to use it.

Motivations 136,800 GFLOPS Max $10k/week on Blue Gene (180 GFLOPS)at IBM’s Deep Computing Lab

Motivations • 50% of development of parallel scientific codes spent in debugging [Vetter and deSupinski 2000] • Programmers from a variety of backgrounds—often not computer science

Overview • What Scientific programs look like • What challenges are faced by scientific code developers • How formal methods can help • The Utah Gauss project

SPMD Programs • Single Program Multiple Data • Same image runs on each node in the grid • Processes do different things based on rank • Possible to impose a virtual topology within the program

MPI Library for communication • MPI is to HPC what PThreads to systems or OpenGL is to Graphics • More than 60% of HPC applications use MPI Libraries in some form • There are proprietary and open source implementations • Provides both communication primitives and virtual topologies in MPI-1

Concurrency Primitives • Point to point communications that • Don’t specify system buffering (but might have it in some implementations) and • Block • Don’t’ block • Use user program provided buffering (with possibly hard or soft limitations) and • Block • Don’t block • Collective communications that • “can (but are not required to) return as soon as their participation in the collective communication is complete.” [MPI-1.1 Standard pg 93, lines 10-11]

MPI Tutorial #include<mpi.h> #define CNT 1 #define TAG 1 int main(int argc, char ** argv){ int mynode = 0, totalnodes = 0, recvdata0 = 0, recvdata1 = 0; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &totalnodes); MPI_Comm_rank(MPI_COMM_WORLD, &mynode); if(mynode%2 == 0){ MPI_Send(&mynode,CNT,MPI_INT,(mynode+1)%totalnodes,TAG,MPI_COMM_WORLD); MPI_Send(&mynode,CNT,MPI_INT,(mynode-1+totalnodes)%totalnodes,TAG,MPI_COMM_WORLD); } else { MPI_Recv(&recvdata0,CNT,MPI_INT,(mynode-1+totalnodes)%totalnodes,TAG,MPI_COMM_WORLD,&status); MPI_Recv(&recvdata1,CNT,MPI_INT,(mynode+1)%totalnodes,TAG,MPI_COMM_WORLD,&status); } MPI_Barrier(MPI_COMM_WORLD); if(mynode%2 == 1){ MPI_Send(&mynode,CNT,MPI_INT,(mynode+1)%totalnodes,TAG,MPI_COMM_WORLD); MPI_Send(&mynode,CNT,MPI_INT,(mynode-1+totalnodes)%totalnodes,TAG,MPI_COMM_WORLD); } else { MPI_Recv(&recvdata0,CNT,MPI_INT,(mynode-1+totalnodes)%totalnodes,TAG,MPI_COMM_WORLD,&status); MPI_Recv(&recvdata1,CNT,MPI_INT,(mynode+1)%totalnodes,TAG,MPI_COMM_WORLD,&status); } MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); }

P0 P1 P2 P3 MPI Tutorial #include<mpi.h> #define CNT 1 #define TAG 1 int main(int argc, char ** argv){ int mynode = 0, totalnodes = 0, recvdata0 = 0, recvdata1 = 0; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &totalnodes); MPI_Comm_rank(MPI_COMM_WORLD, &mynode); if(mynode%2 == 0){ MPI_Send(&mynode,CNT,MPI_INT,(mynode+1)%totalnodes,TAG,MPI_COMM_WORLD); MPI_Send(&mynode,CNT,MPI_INT,(mynode-1+totalnodes)%totalnodes,TAG,MPI_COMM_WORLD); } else { MPI_Recv(&recvdata0,CNT,MPI_INT,(mynode-1+totalnodes)%totalnodes,TAG,MPI_COMM_WORLD,&status); MPI_Recv(&recvdata1,CNT,MPI_INT,(mynode+1)%totalnodes,TAG,MPI_COMM_WORLD,&status); } MPI_Barrier(MPI_COMM_WORLD); if(mynode%2 == 1){ MPI_Send(&mynode,CNT,MPI_INT,(mynode+1)%totalnodes,TAG,MPI_COMM_WORLD); MPI_Send(&mynode,CNT,MPI_INT,(mynode-1+totalnodes)%totalnodes,TAG,MPI_COMM_WORLD); } else { MPI_Recv(&recvdata0,CNT,MPI_INT,(mynode-1+totalnodes)%totalnodes,TAG,MPI_COMM_WORLD,&status); MPI_Recv(&recvdata1,CNT,MPI_INT,(mynode+1)%totalnodes,TAG,MPI_COMM_WORLD,&status); } MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize(); }

Why is parallel scientific programming hard? • Portability • Scaling • Performance

Variety of bugs that are common in parallel scientific programs • Deadlock • Race Conditions • Misunderstanding the semantics of MPI procedures • Resource related assumptions • Incorrectly matched send/receives

State of the art in Debugging • TotalView • Parallel debugger – trace visualization • Parallel DBX • gdb • MPICHECK • Does some deadlock checking • Uses trace analysis

Related work • Verification of wildcard free models [Siegel, Avrunin, 2005] • Deadlock free with length zero buffers ==> deadlock free with length > zero buffers. • SPIN models of MPI programs [Avrunin, Seigel, Seigel, 2005] and [Seigel, Mironova, Avrunin, Clarke, 2005] • Compare serial and parallel versions of numerical computations for numerical equivelnace.

Automatic Formal Analysis • Can prove it correct by hand in a theorem prover • Don’t want to spend time making models • Approach should be completely automatic (Intended for use by the scientific community at large)

The Big Picture • Automatic model extraction • Improved static analysis • Model checking • Better partial-order reduction • Parallel state-space enumeration • Symmetry • Abstraction Refinement • Integration with existing tools • Visual Studio • TotalView

#include <mpi.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char** argv){ int myid; int numprocs; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid); if(myid == 0){ int i; for(i = 1; i < numprocs; ++i){ MPI_Send(&i, 1, MPI_INT, i, 0, MPI_COMM_WORLD); } printf("%d Value: %d\n", myid, myid); } else { int val; MPI_Status s; MPI_Recv(&val, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &s); printf("%d Value: %d\n", myid, val); } MPI_Finalize(); return 0; } 10010101000101010001010100101010010111 00100100111010101101101001001001001100 10011100100100001111001011001111000111 10010101000101010001010100101010010111 00100100111010101101101001001001001100 10011100100100001111001011001111000111 10010101000101010001010100101010010111 00100100111010101101101001001001001100 10011100100100001111001011001111000111 10010101000101010001010100101010010111 00100100111010101101101001001001001100 10011100100100001111001011001111000111 10010101000101010001010100101010010111 00100100111010101101101001001001001100 10011100100100001111001011001111000111 10010101000101010001010100101010010111 00100100111010101101101001001001001100 10011100100100001111001011001111000111 00100100111010101101101001001001001100 MPI Program MPI Binary The Big Picture proctype MPI_Send(chan out, int c){ out!c; } proctype MPI_Bsend(chan out, int c){ out!c; } proctype MPI_Isend(chan out, int c){ out!c; } typedef MPI_Status{ int MPI_SOURCE; int MPI_TAG; int MPI_ERROR; } … MPI LibraryModel int y; active proctype T1(){ int x; x = 1; if :: x = 0; :: x = 2; fi; y = x; } active proctype T2(){ int x; x = 2; if :: y = x + 1; :: y = 0; fi; assert( y == 0 ); } Compiler + ProgramModel Model Generator + EnvironmentModel Error Simulator Refinement Abstractor Zing Result Analyzer MC Server MC Client MC Client MC Client MC Client MC Client MC Client … OK MC Client MC Client MC Client

Environment modeling • C • Very prevalent among HPC developers • Want to analyze code as it is written for performance • Zing has it all. • Numeric types • Pointers, Arrays, Casting, Recursion, … • Missing one thing only: “&” (not bitwise and) • We provide a layer that makes this possible • Can also track pointer arithmetic and unsafe casts • Also provide a variety of stubs for system calls

Environment Example class pointer{ object reference; static object addressof(pointer p){ pointer ret; ret = new pointer; ret.reference = p; return ret; } … } • Data encapsulated in Zing objects makes it possible to handle additional Cisms

MPI Library • MPI Library modeled carefully by hand from the MPI Specification • Preliminary shared memory based implementation • Send, Recv, Barrier, BCast, Init, Rank, Size, Finalize.

Library Example integer MPI_Send(pointer buf, integer count, integer datatype, integer dest, integer tag, integer c){ … comm = getComm(c); atomic { ret = new integer; msg1 = comm.create(buf, count, datatype, _mpi_rank, dest, tag, true); msg2 = comm.find_match(msg1); if (msg1 != msg2) { comm.copy(msg1, msg2); comm.remove(msg2); msg1.pending = false; } else { comm.add(msg1); } ret.value = 0; } select{ wait(!msg1.pending) -> ; }

Model Extraction • Map C onto Zing (using CIL) • First through the cpp • Processes to Zing Threads • File to Zing Class • Structs and Unions also extracted to Classes • Integral data types to environment layer • All numeric types • Pointer Class

Extraction Example __cil_tmp45 = integer.addressof(recvdata1); __cil_tmp46 = integer.create(1); __cil_tmp47 = integer.create(6); __cil_tmp48 = integer.create(1); __cil_tmp49 = integer.add(mynode, __cil_tmp48); __cil_tmp50 = integer.mod(__cil_tmp49, totalnodes); __cil_tmp51 = integer.create(1); __cil_tmp52 = integer.create(91); __cil_tmp53 = __anonstruct_MPI_Status_1.addressof(status); MPI_Recv(__cil_tmp45, __cil_tmp46, __cil_tmp47, __cil_tmp50, __cil_tmp51, __cil_tmp52,__cil_tmp53); MPI_Recv(&recvdata1,CNT,MPI_INT,(mynode+1)%totalnodes,TAG,MPI_COMM_WORLD,&status);

Experimental Results • Correct example • 2 processes: 12,882 states • 4 processes: Does not complete • Deadlock example • 24 processes: 2,522 states

Possible Improvements • Atomic regions • Constant reuse • More formal extraction semantics

Looking ahead (in the coming year) • Full MPI-1.1 Library Model in Zing • All point to point and collective communication primitives • Virtual Topologies • ANSI C Capable Model Extractor • Dependencies: CIL, GCC, CYGWIN • Preliminary Tool Integration • Error visualization and simulation • Text book and validation suite examples

Looking ahead (beyond) • Better Static Analysis through • Partial-order reduction • MPI library model is intended to leverage transaction based reduction • Can improve by determining transaction independence

Looking ahead (beyond) • Better Static Analysis through • Abstraction Refinement • Control flow determined mostly by rank • Nondeterministic over-approximation

Looking ahead (beyond) • Better Static Analysis through • Distributed computing • Grid based • Client server based

Looking ahead (beyond) • More library support • MPI-2 Library • One sided communication • PThread Library • Mixed MPI/PThread concurrency • More languages • FORTRAN • C++ • Additional static analysis techniques

Can we get more performance? • Can we phrase a performance bug as a safety property? • There does not exist a communication chain longer than N • Is there a way to leverage formal methods to reduce synchronizations? • Can formal methods help determine the right balance between MPI and PThreads for concurrency?

Questions?

A Framework for Verifying Scientific Software