130 likes | 231 Views
Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000. Umpire: Making MPI Programs Safe. Umpire. Writing correct MPI programs is hard Unsafe or erroneous MPI programs Deadlock Resource errors Umpire Automatically detect MPI programming errors
E N D
Bronis R. de Supinski and Jeffrey S. VetterCenter for Applied Scientific ComputingAugust 15, 2000 Umpire: Making MPI Programs Safe
Umpire • Writing correct MPI programs is hard • Unsafe or erroneous MPI programs • Deadlock • Resource errors • Umpire • Automatically detect MPI programming errors • Dynamic software testing • Shared memory implementation
MPI Application Umpire Manager Task 0 Task 1 Task 2 Task N-1 Task 0 ... Task 1 Task 2 Task N-1 Interposition using MPI profiling layer Transactions via Shared Memory Task 0 Task 1 Task 2 Task N-1 ... MPI Runtime System Umpire Architecture Verification Algorithms
Collection system • Calling task • Use MPI profiling layer • Perform local checks • Communicate with manager if necessary • Call parameters • Return program counter (PC) • Call specific information (e.g. Buffer checksum) • Manager • Allocate Unix shared memory • Receive transactions from calling tasks
Manager • Detects global programming errors • Unix shared memory communication • History queues • One per MPI task • Chronological lists of MPI operations • Resource registry • Communicators • Derived datatypes • Required for message matching • Perform verification algorithms
Configuration Dependent Deadlock • Unsafe MPI programming practice • Code result depends on: • MPI implementation limitations • User input parameters • Classic example code: Task 0 Task 1 MPI_Send MPI_Send MPI_Recv MPI_Recv
Mismatched Collective Operations • Erroneous MPI programming practice • Simple example code: Tasks 0, 1, & 2 Task 3 MPI_Bcast MPI_Barrier MPI_Barrier MPI_Bcast • Possible code results: • Deadlock • Correct message matching • Incorrect message matching • Mysterious error messages
Deadlock detection • MPI history queues • One per task in Manager • Track MPI messaging operations • Items added through transactions • Remove when safely matched • Automatically detect deadlocks • MPI operations only • Wait-for graph • Recursive algorithm • Invoke when queue head changes • Also support timeouts
Task 0 Task 1 Task 2 Task 3 Deadlock Detection Example Barrier Barrier Barrier Bcast Bcast Bcast Barrier Task 1: MPI_Bcast Task 0: MPI_Bcast Task 2: MPI_Bcast Task 2: MPI_Barrier Task 0: MPI_Barrier Task 3: MPI_Barrier Task 1: MPI_Barrier ERROR! Report it!
Resource Tracking Errors • Many MPI features require resource allocations • Communicators, datatypes and requests • Detect “leaks” automatically • Simple “lost request” example: MPI_Irecv (..., &req); MPI_Irecv (..., &req); MPI_Wait (&req,…) • Complicated by assignment • Also detect errant writes to send buffers
Conclusion • First automated MPI debugging tool • Detect deadlocks • Eliminates resource leaks • Assure correct non-blocking sends • Performance • Low overhead (21% for sPPM) • Located deadlock in code set-up • Limitations • MPI_Waitany and MPI_Cancel • Shared memory implementation • Prototype only
Future Work • Further prototype testing • Improve user interface • Handle all MPI calls • Tool distribution • LLNL application group testing • Exploring mechanisms for wider availability • Detection of other errors • Datatype matching • Others? • Distributed memory implementation
UCRL-VG-139184 Work performed under the auspices of the U. S. Department of Energy by University of California Lawrence Livermore National Laboratory under Contract W-7405-Eng-48