http://www.cs.utah.edu/fv

Dynamic Formal Verification Methods for ConcurrencyGanesh GopalakrishnanSchool of Computing, University of Utah,Salt Lake City, UT 84112, USA http://www.cs.utah.edu/fv Supported by NSF CNS 0509379, CCF 0811429, CCF 0903408, SRC tasks TJ 1847.001 and TJ 1993, and Microsoft

Acknowledgements for this talk • This talk is based on research done by the following students:SarvaniVakkalanka, Anh Vo, Subodh Sharma, Ben Meakin, Michael DeLisi, Alan Humphrey, Chris Derrick, SriramAananthakrishnan, Guodong Li, GrzegorzSzubzda, Simone Atzeni, Wei-Fan Chiang, Carson Jones, GeofSawaya

Computing Used to Enjoy the “Free Lunch” – no more! • A Quote from Bill Joy from the early 1990s (paraphrased) “If you have to solve a compute intensive problem that would take 4 years to run, you are probably better off waiting 3 years doing nothing, then buy a machine of a modern vintage, and solve the problem in 6 months” I.e. by doing NOTHING to the code base, magic happened: • The code got faster! (of course, bit-rot had to be reckoned with, but that is not a show-stopper) • This was called “the Free Lunch” by Herb Sutter of Microsoft • It is over! Now we have to do actual work to speed up code!

Alas, the Cessation of “The Free Lunch” has led to a “Feeding Frenzy” of solutions in concurrency Some of today’s proposals: • Threads (various) • Message Passing (various) • Transactional Memory (various) • OpenMP • MPI • MCAPI • Intel’s Ct • RapidMind (now owned by Intel) • Microsoft’sTask Parallel Library • Axum • Cilk (now owned by Intel) • Intel’s TBB • Nvidia’sCuda • OpenCL (photo courtesy of Intel Corporation.)

This causes more ways in which codes crash! • Sequential Program Bugs Remain a Challenge • Null pointer dereferences • Array out of bounds • Resource leaks • Wrong computations • Concurrency bugs get introduced • Deadlocks • Livelocks • Race conditions • Concurrency situates sequential bugs in a hard-to-reproduce state space • Exponential interleaving space • “Heisenbugs”

Two general styles of process / thread interaction : Shared Memory Threading Message Passing Let us intuitively understand what bug-density to expect in these styles

Message Passing • More Isolation (processes operate in separate memory spaces) • Hence less bug density • Most Interleavings (schedule changes) do not trigger new bugs • So we have to go after those elusive (hard-to-find) schedules using much more semantic information!

How testing can miss errors in message passing P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33);

How testing can miss errors P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33);

How testing can miss errors P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); unlucky

How testing can miss errors P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); lucky

Deterministic Msg Passing Steps Should Not be Interleaved in Alternate Ways to eke out Msg. Passing Bugs (low priority) Don’t interleave these deterministic send/receive actions, as they are independent P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33);

We MUST go after those interleavings (or message matches) which affect the behavior (in this case, non-detmsg matches) Don’t interleave these deterministic send/receive actions, as they are independent P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); But do consider these cases. Perform dynamic rewriting of * into P0 and P2, in turn (to force match these cases in the MPI runtime), and verify both these cases.

The table of results in the URL below shows how badly testing fares with respect to message passing programs! http://www.cs.utah.edu/fv/ISP_Tests

Interesting Study of Schedule Perturbation Testing @InProceedings{PADTAD2006:JitterBug, author = {Richard Vuduc and Martin Schulz and Dan Quinlan and Bronis de Supinski and Andreas S{\ae}bj{\"o}rnsen}, title = {Improving distributed memory applications testing by message perturbation}, booktitle = {Proc.~4th Parallel and Distributed Testing and Debugging (PADTAD) Workshop, at the International Symposium on Software Testing and Analysis}, address = {Portland, ME, USA}, month = {July}, year = {2006} } http://vuduc.org/research/jitterbugindex.html Image without perturbations Image with perturbations

Trouble with Jitterbug Approach for Message Passing: Density of Deterministic Msg Matches is higher; hence Most Perturbations Likely to be Unproductive! Don’t interleave these deterministic send/receive actions, as they are independent P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); But do consider these cases. Perform dynamic rewriting of * into P0 and P2, in turn (to force match these cases in the MPI runtime), and verify both these cases.

The probability of perturbing non-det message matches would also be very low (back of envelope argument). REALLY GO AFTER non-det variety! Leave det. Matches alone! Don’t interleave these deterministic send/receive actions, as they are independent P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then error1 else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); But do consider these cases. Perform dynamic rewriting of * into P0 and P2, in turn (to force match these cases in the MPI runtime), and verify both these cases.

Another illustration of schedule perturbation in message passing

MPI program ‘lucky.c’ (tester catches bug; gets a raise) Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws…

MPI program ‘unlucky.c’ (tester misses bug; gets fired!) Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 // Sleep(3); S(to:0, r1); All the Ws… Process P2 Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws…

‘lucky.c’ Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws…

‘lucky.c’ Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… deadlock

‘unlucky.c’ Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 // Sleep(3); S(to:0, r1); All the Ws… Process P2 Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws…

‘unlucky.c’ Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 // Sleep(3); S(to:0, r1); All the Ws… Process P2 Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… No deadlock

Shared Memory “Threading” • Less Isolation (every shared memory update is potentially “seen” by other threads) • Hence schedule perturbation is going to be much more productive • Still one must use a CONTROLLED or GUIDED way of schedule perturbation – not totally random • Use Dynamic Partial Order Reduction • Exact, and TRULY tries to GO AFTER DEPENDENCIES • Use Preemption Bounded Searching • Much more easy to realize across multiple platforms, but still produces wasted interleavings if the bug-density is low

Relevant Interleavings for Shared Memory init: x = 0; y = 0; t0: x++; if (x > 1) assert(0); t1: y++; x++; Question : How can you shuffle the actions of threads t0 and t1 so as to cause the “assert(0)” to be reached?

Relevant Interleavings for Shared Memory init: x = 0; y = 0; t0: x++; if (x > 1) assert(0); t1: y++; x++; Question : How can you shuffle the actions of threads t0 and t1 so as to cause the “assert(0)” to be reached? Ans: See the arrows! BEGIN HERE !

Relevant Interleavings for Shared Memory init: x = 0; y = 0; t0: x++; if (x > 1) assert(0); t1: y++; x++; Question : How can you shuffle the actions of threads t0 and t1 so as to cause the “assert(0)” to be reached? MAIN IDEA : You can avoid playing un-necessary schedules out! That is what kills testing !! I.e. no need to try x++ ; y++ ; x++ ; .. no need to try x++ ; if ; y++ ; x++

Card Deck 1 Card Deck 0 All Interleavings versus Relevant Interleavings illustrated 0: 1: 2: 3: 4: 5: 0: 1: 2: 3: 4: 5: • Suppose only the interleavings of the red cards matter • Then don’t try all riffle-shuffles (12!) / ((6!) (6!)) = 924 • Just do TWOshuffles !!

How about Combined Models ?! • It is less well understood how best to mix shared memory and message passing • Yet quite important going forward • Making Message Passing Codes more efficient by eliminating blatant copying of large arrays • Emerging embedded heterogeneous multi-core communication standards from MCA (see next) • However, without programming practices, we will be in a debugging arena of extreme pain!

The “big picture” of multi-core computing, and where the Multi-core Association APIs are situated

Background: Concurrency Space in Multicore Era

Background/Motivation: Formalize Emerging Communications API in the Embedded Space Demonstrate and EvaluatePrototype Solutions Formalize Standards, build Query Oracle, Derive Tests Build Dynamic Formal Verifier for Applications

Background: What is the MulticoreCommunication API? • An API specification from MCA (Multicore Association) • To program embedded systems like mobile phones, PDAs, routers, servers, etc. • Not restricted to SPMD (like MPI) or multi threaded style of programming.

Inter API interactions Scope of work: Formalize Related APIs MTAPI – Task Management API • Specification work yet to begin • Thread Pooling, Work Stealing queues e.g. CILK, TBB, TPL etc. MCAPI – Communication API • Message based • Packet/Scalar Channel based MRAPI – Resource Management API • Semaphores, mutexes • Shared memory segment allocation, deallocation

instrumentation compile request/permit request/permit MCC – MCAPI Checker (in progress) MCAPI C Program Executable Scheduler Instrumented Program thread 1 thread n MCAPI Library Wrapper Workflow of MCC.

We have built a tool for Thread App. Verification - Inspect Multithreaded C Program Program Analyzer Program Instrumentor Analysis result Executable Instrumented Program compile request/permit Scheduler thread 1 request/permit Thread Library Wrapper thread n

Our tool for Msg Passing App Verification - ISP Executable Proc1 Proc2 …… Procn Scheduler Run MPI Program Interposition Layer MPI Runtime • Hijack MPI Calls • Scheduler decides how they are sent to the MPI runtime • Scheduler plays out only the RELEVANT interleavings • (to detect safety violations such as deadlocks and • assertion violations)

Research Portfolio • Dynamic Verification of MPI programs • ISP tool released – runs on multiple platforms • GEM – Graphical Explorer of Message Passing – released • Built using Eclipse PTP and CDT • Will ship with Eclipse PTP v 3.0 • Of relevance to Multi-core Association in their Tool Infrastructure effort • Distributed ISP being built • Search heuristics • Bounded Mixing • Other bounded search methods • Deterministic replay of mixed MPI / Thread programs • Teaching MPI and Message Passing principles (e.g. Happens Before of Message Passing)

Research Portfolio • Dynamic Verification of Pthread programs • Inspect tool released • Scaling techniques for Dynamic Verification studied • Significant uptake of Inspect at NEC Research • Useful for unit-testing of Pthread codes • Useful for teaching

Research Portfolio • Dynamic Verification of MCAPI applications • MCC tool under construction • MCAPI Applications being developed • Formal specification of MCAPI in progress • Specification driven platform test generation • Putative query oracle to be built • Study of mixed API interactions

Research Portfolio • Distributed Verification • Eddy Murphi – Murphi on Clusters – for verifying Cache Coherence Protocols • Inspect can be parallelized on clusters (promising results) • This may be of significant value, given that people don’t find the time to write unit tests and/or want end-to-end push-button formally directed “tests”

Research Portfolio • Hardware design and verification • Design of XUM in progress • MCAPI in a MIPS Core • Network on Chips • May be a good platform for SixthSense based verification

Research Portfolio • Verifying CUDA Programs for Races / Assertions • CUDA and OpenCL programs are important going forward • Use of Vector Parallelism • Hand-in-hand growth of programming models such as Intel’s Ct • Introduction of CPUs such as Larrabee / Fermi makes this space very interesting • Predictions are that future HPC systems will use these • We have built a tool-flow that takes CUDA • Translate CUDA using LLNL’s Rose Compiler into Yices SMT formulae • Formulate Race Checking and Assertion Checking queries, and solve using an SMT solver • Optimizations for handling loops, barriers, etc. in progress • May be able to calculate bank conflicts using SMT (in progress)

XUM: Evaluation Platform for work on MCA APIs • Modern embedded multicore Systems on Chip (SoC) have scalable on-chip interconnects • Communication between cores done through sending/receiving packets • MCAPI should have low level control of this type of hardware • Greater communication performance than existing software methods • Lower power by giving control of the network to MCAPI • Prototype hardware is called • eXtensible Utah Multicore (XUM) 48

Focus of the Rest of the talk : ISP

MPI is dominant in the high-performance computing world

http://www.cs.utah.edu/fv