720 likes | 842 Views
Distributed Dynamic Partial Order Reduction based Verification of Threaded Software. Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student; summer intern at IBM) Ganesh Gopalakrishnan Robert M. Kirby School of Computing University of Utah
E N D
Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student; summer intern at IBM) Ganesh Gopalakrishnan Robert M. Kirby School of Computing University of Utah SPIN 2007 Workshop Presentation Supported by: Microsoft HPC Institutes NSF CNS 0509379
Thread Programming will become more prevalentFV of thread programs will grow in importance
Why FV for Threaded Programs > 80% of chips shipped will be multi-core (photo courtesy of Intel Corporation.)
Model Checking will Increasingly be thru Dynamic Methods Also known as Runtimeor In-Situ methods
Why Dynamic Verification Methods • Even after early life-cycle modeling and validation, • the final code will have far more details • Early life-cycle modeling is often impossible • Use of libraries (API) such as MPI, OpenMP, Shmem, … • Library function semantics can be tricky • The bug may be in the library function implementation
Why Stateless • One may not be able to access a lot of the state • e.g. state of the OS • . It is expensive to hash and lookup revisits • . Stateless is easier to parallelize
Why POR? Process P0: ------------------------------- 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Process P1: ------------------------------- 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize ONLY DEPENDENT OPERATIONS • 504 interleavings without POR (2 * (10!)) / (5!)^2 • 2 interleavings with POR !!
Dynamic POR is almost a “must” ! ( Dynamic POR as in Flanagan and Godefroid, POPL 2005)
Why Dynamic POR ? a[ k ]-- a[ j ]++ • Ample Set depends on whether j == k • Can be very difficult to determine statically • Can determine dynamically
Why Dynamic POR ? The notion of action dependence (crucial to POR methods) is a function of the execution
Computation of “ample” sets in Static POR versus in DPOR { BT }, { Done } Add Red Process to “Backtrack Set” This builds the Ample set incrementally based on observed dependencies Blue is in “Done” set Ample determined using “local” criteria Nearest Dependent Transition Looking Back Current State Next move of Red process
Putting it all together … • We target C/C++ PThread Programs • Instrument the given program (largely automated) • Run the concurrent program “till the end” • Record interleaving variants while advancing • When # recorded backtrack points reaches a soft limit, spill work to other nodes • In one larger example, a 11-hour run was finished in 11 minutes using 64 nodes • Heuristic to avoid recomputations was essential for speed-up. • First known distributed DPOR
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {}
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {} t0: lock
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {} t0: lock t0: unlock
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {}, {} t0: lock t0: unlock t1: lock
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock t1: lock
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {}, {} t1: lock t1: unlock t2: lock
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1}, {t0} t0: lock t0: unlock {t2}, {t1}
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1,t2}, {t0} t0: lock t0: unlock {}, {t1, t2} t2: lock
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1,t2}, {t0} t0: lock t0: unlock {}, {t1, t2} t2: lock t2: unlock …
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t1,t2}, {t0} t0: lock t0: unlock {}, {t1, t2}
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t2}, {t0,t1}
A Simple DPOR Example t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) {t2}, {t0, t1} t1: lock t1: unlock …
For this example, all the paths explored during DPORFor others, it will be a proper subset
Idea for parallelization: Explore computations from the backtrack set in other processes.“Embarrassingly Parallel” – it seems so, anyway !
instrumentation compile request/permit request/permit We first built a sequential DPOR explorer for C / Pthreads programs, called “Inspect” Multithreaded C/C++ program executable instrumented program scheduler thread 1 thread n Thread library wrapper
We then made the following observations • Stateless search does not maintain search history • Different branches of an acyclic space can be explored concurrently • Simple master-slave scheme can work here • one load balancer + workers
Request unloading report result idle node id work description We then devised a work-distribution scheme… load balancer
We got zero speedup! Why? Deeper investigation revealed that multiple nodes ended up exploring the same interleavings
Illustration of the problem (1 of 5) {t1}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock
Illustration of the problem (2 of 5) {t1}, {t0} To Node 1 t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock Heuristic : Handoff DEEPEST backtrack point for another node to explore Reason : Largest number of paths emanate from there t2: lock t2: unlock
Detail of (2 of 5) Node 0 {t1}, {t0} { }, {t0,t1} t0: lock t0: lock t0: unlock t0: unlock {t2}, {t1} {t2}, {t1} t1: lock t1: lock t1: unlock t1: unlock t2: lock t2: lock t2: unlock t2: unlock
Detail of (2 of 5) Node 0 Node 1 {t1}, {t0} { }, {t0,t1} {t1}, {t0} t0: lock t0: lock t0: lock t0: unlock t0: unlock {t2}, {t1} {t2}, {t1} t1: lock t1: lock t1: unlock t1: unlock t2: lock t2: lock t2: unlock t2: unlock
t1 is forced into DONE set before work handed to Node 1 Detail of (2 of 5) Node 0 Node 1 {t1}, {t0} { }, { t0,t1 } { t1 }, {t0} t0: lock t0: lock t0: lock t0: unlock t0: unlock Node 1 keeps t1 in backtrack set {t2}, {t1} {t2}, {t1} t1: lock t1: lock t1: unlock t1: unlock t2: lock t2: lock t2: unlock t2: unlock
Illustration of the problem (3 of 5) {t1}, {t0} To Node 1 t0: lock Decide to do THIS work at Node 0 itself… t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock
Illustration of the problem (4 of 5) {}, {t0,t1} {t1}, {t0} t0: lock Being expanded by Node 1 t0: unlock {t2}, {t1} Being expanded by Node 0
Illustration of the problem (5 of 5) {t2}, {t0,t1} t0: lock t0: unlock {}, {t2} t2: lock t2: unlock
Illustration of the problem (5 of 5) {t2}, {t0,t1} {t1}, {t0} t0: lock t1: lock t0: unlock t1: unlock {}, {t2} t2: lock t2: unlock
Illustration of the problem (5 of 5) {t2}, {t0,t1} {t2}, {t0, t1} t0: lock t1: lock t0: unlock t1: unlock {}, {t2} {}, {t2} t2: lock t2: lock t2: unlock t2: unlock Redundancy!
New Backtrack Set Computation: Aggressively mark up the stack! • Update the backtrack sets of ALL dependent operations! • Forms a good allocation scheme • Does not involve any synchronizations • Redundant work may still be performed • Likelihood is reduced because a node aggressively “owns” one operation and all its dependants {t1,t2}, {t0} t0: lock t0: unlock {t2}, {t1} t1: lock t1: unlock t2: lock t2: unlock
Implementation and Evaluation • Using MPI for communication among nodes • Did experiments on a 72-node cluster • 2.4 GHz Intel XEON process, 2GB memory/node • Two (small) benchmarks Indexer & file system benchmark used in Flanagan and Godefoid’s DPOR paper • Aget -- a multithreaded ftp client • Bbuf – an implementation of bounded buffer
Speedup on indexer & fs (small exs);so diminishing returns > 40 nodes…