120 likes | 278 Views
VolpexMPI : Performance Evaluation of VolpexMPI over Infiniband. Stephen Herbein Mentors: Jaspal Subhlok & Edgar Gabriel. Volpex : Parallel Execution on Volatile Nodes. Fault tolerance: why ? Node failures on machines with thousands of processors ( large cluster)
E N D
VolpexMPI: Performance Evaluation of VolpexMPI over Infiniband Stephen Herbein Mentors: JaspalSubhlok& Edgar Gabriel
Volpex: Parallel Execution on Volatile Nodes • Fault tolerance: why ? • Node failures on machines with thousands of processors (large cluster) • Node and communication failure in distributed environments (volunteer environment) • VolpexProject Goals: • Execution on failure prone platforms • Key problem: High failure rates ANDcommunicating parallel programs
VolpexMPI • MPI library for execution of parallel application on volatile nodes • Key features: • controlled redundancy: each MPI process can have multiple replicas • Receiver based direct communication between processes • Distributed sender logging to support slow processes
Managing Replicated MPI processes • Only need one copy for program to execute successfully
Bandwidth comparison • 4 byte latency over Gigabit Ethernet: • Open MPI v1.4.1: ~50us • VolpexMPI: ~1.8ms
NAS Parallel Benchmarks • VolpexMPI execution times are comparable to reference OpenMPI execution times (100)
Overhead of redundancy and processor failures • Performance impact of executing with replicas (left side) • Performance impact of processor failures (right side) • Both run with 16 processes
Use in High Performance Clusters • Not just limited to volunteer computing • Tested on small cluster using Ethernet • Yet to be tested on Large Scale Cluster with high performance communication, like Infiniband • Evaluate and Validate the use of VolpexMPI on High Performance Clusters • Specifically clusters that use Infiniband
What is Infiniband • High speed fiber connection • Associated protocols designed to remove the overhead associated Ethernet and IP • Leads to higher bandwidth, lower latency, and lower CPU usage • Widespread use in HPC • Most used interconnect in the TOP500 (42%)
How to Run VolpexMPI over Infiniband • Ways to use Infiniband • IPoIB • Sockets Direct Protocol (SDP) • IPoIB • High Bandwidth • High Latency • SDP • Higher Bandwidth • Low Latency • Bypasses TCPStack
Summary • Status • Currently implementing SDP in underlying Socket Library of VolpexMPI • Challenges • Parallel programs are notoriously hard to debug • Not familiar with network or socket programming • Goals • Re-run bandwidth and latency tests using SDP • Re-run NAS benchmarks using SDP • Evaluate and validate use of VolpexMPI on High Performance clusters