Distributed Dynamic Partial Order Reduction based Verification of Threaded Software

Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student; summer intern at IBM) Ganesh Gopalakrishnan Robert M. Kirby School of Computing University of Utah SPIN 2007 Workshop Presentation Supported by: Microsoft HPC Institutes NSF CNS 0509379

Thread Programming will become more prevalentFV of thread programs will grow in importance

Why FV for Threaded Programs > 80% of chips shipped will be multi-core (photo courtesy of Intel Corporation.)

Model Checking will Increasingly be thru Dynamic Methods Also known as Runtimeor In-Situ methods

Why Dynamic Verification Methods • Even after early life-cycle modeling and validation, • the final code will have far more details • Early life-cycle modeling is often impossible • Use of libraries (API) such as MPI, OpenMP, Shmem, … • Library function semantics can be tricky • The bug may be in the library function implementation

Model Checking will often be “stateless”

Why Stateless • One may not be able to access a lot of the state • e.g. state of the OS • . It is expensive to hash and lookup revisits • . Stateless is easier to parallelize

Partial Order Reduction is Crucial !

Why POR? Process P0: ------------------------------- 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Process P1: ------------------------------- 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize ONLY DEPENDENT OPERATIONS • 504 interleavings without POR (2 * (10!)) / (5!)^2 • 2 interleavings with POR !!

Dynamic POR is almost a “must” ! ( Dynamic POR as in Flanagan and Godefroid, POPL 2005)

Why Dynamic POR ? a[ k ]-- a[ j ]++ • Ample Set depends on whether j == k • Can be very difficult to determine statically • Can determine dynamically

Why Dynamic POR ? The notion of action dependence (crucial to POR methods) is a function of the execution

Computation of “ample” sets in Static POR versus in DPOR { BT }, { Done } Add Red Process to “Backtrack Set” This builds the Ample set incrementally based on observed dependencies Blue is in “Done” set Ample determined using “local” criteria Nearest Dependent Transition Looking Back Current State Next move of Red process

Putting it all together … • We target C/C++ PThread Programs • Instrument the given program (largely automated) • Run the concurrent program “till the end” • Record interleaving variants while advancing • When # recorded backtrack points reaches a soft limit, spill work to other nodes • In one larger example, a 11-hour run was finished in 11 minutes using 64 nodes • Heuristic to avoid recomputations was essential for speed-up. • First known distributed DPOR

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {}

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {} t0: lock

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {} t0: lock t0: unlock

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {} t0: lock t0: unlock t1: lock

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock t1: lock

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {}, {} t1: lock t1: unlock t2: lock

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1}

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1,t2}, {t0} t0: lock t0: unlock {}, {t1, t2} t2: lock

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1,t2}, {t0} t0: lock t0: unlock {}, {t1, t2} t2: lock t2: unlock …

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1,t2}, {t0} t0: lock t0: unlock {}, {t1, t2}

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t2}, {t0,t1}

A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t2}, {t0, t1} t1: lock t1: unlock …

For this example, all the paths explored during DPORFor others, it will be a proper subset

Idea for parallelization: Explore computations from the backtrack set in other processes.“Embarrassingly Parallel” – it seems so, anyway !

instrumentation compile request/permit request/permit We first built a sequential DPOR explorer for C / Pthreads programs, called “Inspect” Multithreaded C/C++ program executable instrumented program scheduler thread 1 thread n Thread library wrapper

We then made the following observations • Stateless search does not maintain search history • Different branches of an acyclic space can be explored concurrently • Simple master-slave scheme can work here • one load balancer + workers

Request unloading report result idle node id work description We then devised a work-distribution scheme… load balancer

We got zero speedup! Why? Deeper investigation revealed that multiple nodes ended up exploring the same interleavings

Illustration of the problem (1 of 5) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock

Illustration of the problem (2 of 5) {t1}, {t0} To Node 1 t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock Heuristic : Handoff DEEPEST backtrack point for another node to explore Reason : Largest number of paths emanate from there t2: lock t2: unlock

Detail of (2 of 5) Node 0 {t1}, {t0} { }, {t0,t1} t0: lock t0: lock t0: unlock t0: unlock {t2}, {t1} {t2}, {t1} t1: lock t1: lock t1: unlock t1: unlock t2: lock t2: lock t2: unlock t2: unlock

Detail of (2 of 5) Node 0 Node 1 {t1}, {t0} { }, {t0,t1} {t1}, {t0} t0: lock t0: lock t0: lock t0: unlock t0: unlock {t2}, {t1} {t2}, {t1} t1: lock t1: lock t1: unlock t1: unlock t2: lock t2: lock t2: unlock t2: unlock

t1 is forced into DONE set before work handed to Node 1 Detail of (2 of 5) Node 0 Node 1 {t1}, {t0} { }, { t0,t1 } { t1 }, {t0} t0: lock t0: lock t0: lock t0: unlock t0: unlock Node 1 keeps t1 in backtrack set {t2}, {t1} {t2}, {t1} t1: lock t1: lock t1: unlock t1: unlock t2: lock t2: lock t2: unlock t2: unlock

Illustration of the problem (3 of 5) {t1}, {t0} To Node 1 t0: lock Decide to do THIS work at Node 0 itself… t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock

Illustration of the problem (4 of 5) {}, {t0,t1} {t1}, {t0} t0: lock Being expanded by Node 1 t0: unlock {t2}, {t1} Being expanded by Node 0

Illustration of the problem (5 of 5) {t2}, {t0,t1} t0: lock t0: unlock {}, {t2} t2: lock t2: unlock

Illustration of the problem (5 of 5) {t2}, {t0,t1} {t1}, {t0} t0: lock t1: lock t0: unlock t1: unlock {}, {t2} t2: lock t2: unlock

Illustration of the problem (5 of 5) {t2}, {t0,t1} {t2}, {t0, t1} t0: lock t1: lock t0: unlock t1: unlock {}, {t2} {}, {t2} t2: lock t2: lock t2: unlock t2: unlock Redundancy!

New Backtrack Set Computation: Aggressively mark up the stack! • Update the backtrack sets of ALL dependent operations! • Forms a good allocation scheme • Does not involve any synchronizations • Redundant work may still be performed • Likelihood is reduced because a node aggressively “owns” one operation and all its dependants {t1,t2}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock

Implementation and Evaluation • Using MPI for communication among nodes • Did experiments on a 72-node cluster • 2.4 GHz Intel XEON process, 2GB memory/node • Two (small) benchmarks Indexer & file system benchmark used in Flanagan and Godefoid’s DPOR paper • Aget -- a multithreaded ftp client • Bbuf – an implementation of bounded buffer

Sequential Checking Time

Speedup on indexer & fs (small exs);so diminishing returns > 40 nodes…

Speedup on aget

Distributed Dynamic Partial Order Reduction based Verification of Threaded Software

Distributed Dynamic Partial Order Reduction based Verification of Threaded Software

Presentation Transcript

Realization of solver based techniques for Dynamic Software Verification

Partial Order Planning

Partial Order Relations

Partial Order Reduction for Scalable Testing of SystemC TLM Designs

Automated Software Verification with Implicit Dynamic Frames

Regression Verification for Multi-Threaded Programs

Dynamic Partial-Order Reduction for Model Checking Software

Analysis of Concurrent Software Models Using Partial Order Views

Partial Order Relations

Reduction of Order

Dynamic Software Architectures Verification using DynAlloy

Partial Order Planning

CS 267: Automated Verification Lecture 11: Partial Order Reduction Instructor: Tevfik Bultan

Partial Order Reduction Assisted Parallel Model-Checking

Distributed Verification of Multi-threaded C++ Programs

Application Study of EAPR based Partial Dynamic Reconfiguration

Dynamic Verification of Sequential Consistency

Verification of Early Reduction Credits

Partial Order Reduction

Partial Order Planning

Semantics Driven Dynamic Partial-order Reduction of MPI-based Parallel Programs

Distributed Dynamic Partial Order Reduction based Verification of Threaded Software