200 likes | 380 Views
Interconnect and MPI. Bill Saphir. What this talk will cover. Infiniband fabric Overview of Infiniband – past, present, future Configuration on Jacquard MPI How to use it Limitations/workarounds Plans. Infiniband. Industry standard high performance network.
E N D
Interconnect and MPI Bill Saphir NERCS Users’ Group, Oct. 3, 2005
What this talk will cover • Infiniband fabric • Overview of Infiniband – past, present, future • Configuration on Jacquard • MPI • How to use it • Limitations/workarounds • Plans NERCS Users’ Group, Oct. 3, 2005
Infiniband • Industry standard high performance network. • Many years in development. Near death in 2000. Has come roaring back • Originally seen as PCI replacement • Retains ability to connect directly to disk controllers • High performance • Direct user space access to hardware • Kernel not involved in the transfer • No memory-to-memory copies needed • Protected access ensures security, safety • Supports RDMA (put/get) and Send/Receive models NERCS Users’ Group, Oct. 3, 2005
IB Link Speeds • Single channel – 2.5 Gb/s in each direction, simultaneously • “4X” link is 10 Gb/s • 10 bits encode 8 – error correction/detection • 1 GB/s bidirectional per 4X link • PCI-X is 1 GB/s total bandwidth, so not possible to fully utilize IB 4X with PCI-X • 12X link is three 4X links • One fat pipe to routing logic rather than three separate links • Double Data Rate (DDR) is 2X the speed (5 Gb/s links) • Just now becoming available • Very short cable lengths (3M) [XXX] • Quad Data Rate (QDR) is envisioned • Cable length issues with copper • Optical possible but expensive NERCS Users’ Group, Oct. 3, 2005
IB Switching • Large switches created by connecting smaller switches together • Current small switch building block is 24 ports • Usual configuration is fat tree • N port-switch can be used to build N2/2-port fat tree using 3N/2 switches (max 288 ports for N=24) • Larger switches available from some vendors are actually a 2-level fat tree based on 24-port switches. • A fat tree has “full bisection bandwidth” – supports all nodes communicating at the same time NERCS Users’ Group, Oct. 3, 2005
All small switches have 24 ports (7 switches total) Each L1 switch has 4 “up” and 20 “down” (to nodes) L2 Each L1 switch has 4connections toL2 switch L1 20 120 connections to nodes Example: 120-port thin tree NERCS Users’ Group, Oct. 3, 2005
All small switches have 24 ports (12 switches total) Each L1 switch has 12 “up” and 12 “down” (to nodes) L2 Each L1 switch has 3connections to eachL2 switch L1 12 96 connections to nodes Example: 96-port fat tree (Clos) NERCS Users’ Group, Oct. 3, 2005
Infiniband Routing • Infiniband is “destination routed” • Switches make decisions based on destination of a packet • Even though a fat tree has full bisection bandwidth, hot spots are possible. • Routing scheme makes it more difficult to avoid network “hot spots” (not yet clear if Jacquard users impacted) • Workarounds available – will be addressed in future versions of MPI NERCS Users’ Group, Oct. 3, 2005
Jacquard configuration • Jacquard is a “2-level” fat tree • 24-port switches on L1 (to nodes) • 96-port switches on L2 • Really a 3-level tree because 96-port switches are 2-level trees internally • 4X connections (1 GB/s) to all nodes • Innovation: 12X uplinks from L1 to L2 – smaller number of fat pipes • Full bisection bandwidth • supports all nodes communicating at the same time • network supports 2X what PCI-X busses can sustain NERCS Users’ Group, Oct. 3, 2005
Infiniband Software • IB software interface originally called “Virtual Interface Architecture” (VI Architecture or VIA) • NERSC wrote the first MPI for VIA (MVICH) – basis for current MPI implementation on Jacquard • Microsoft derailed API in the standard • De-facto current standard is VAPI – from Mellanox (part of OpenIB generation 1 software) • OpenIB Gen 2 will have slightly different interface NERCS Users’ Group, Oct. 3, 2005
MPI For Infiniband • Jacquard uses MVAPICH (MPICH + VAPI) • Based on MVICH from NERSC (MPICH + VIA) and MPICH from ANL • OSU: Porting to VAPI + performance improvements • Support path • OSU->Mellanox->LNXI->NERSC • Support mechanisms/responsibilities being discussed • MPI-1 functionality • NERSC is tracking OpenMPI for Infiniband NERCS Users’ Group, Oct. 3, 2005
Compiling/Linking MPI • MPI versioning controlled by modules • “module load mvapich” in default startup files • compiler loaded independently • mpicc/mpif90 • mpicc –o myprog myprog.c • mpif90 –o myprog myprog.f • uses the currently loaded pathscale module • automatically finds MPI include files • automatically find MPI libraries • latest version uses shared libraries NERCS Users’ Group, Oct. 3, 2005
Running MPI programs • Always use the “mpirun” command • written by NERSC • integrates PBS and MPI • runs with processor affinity enabled • Inside a PBS job: • mpirun ./a.out • runs a.out on all processors allocated by PBS • no need for “$PBS_NODEFILE” hack • make sure to request ppn=2 with PBS • “-np N” optional. Can be used to run on fewer processors • On a login node • “mpirun –np 32 a.out” just works • internally: creates a PBS script (on 32 processors); runs the script interactively using “qsub –I” and expect • Max wallclock time: 30 minutes NERCS Users’ Group, Oct. 3, 2005
mpirun current limitations • Currently propagates these environment variables: • FILENV, LD_LIBRARY_PATH, LD_PRELOAD • To propagate other variables: ask NERSC • Does not directly support MPMD • To run different binaries on different nodes, use a starter script that “execs” the correct binary based on the value of MPIRUN_RANK • Does not allow redirection of standard input, e.g. • mpirun a.out < file • Does not propagate $PATH, so “./a.out” needed even if “.” is in $PATH NERCS Users’ Group, Oct. 3, 2005
Orphan processes • mpirun (using ssh) has a habit of leaving “orphan” processes on nodes when a program fails • PBS (NERSC additions) goes to great lengths to clean these up between jobs • mpirun detects whether it has been previously called in the same PBS job. If so, it first tried to clean up orphan processes in case previous run failed NERCS Users’ Group, Oct. 3, 2005
Peeking inside mpirun • mpirun currently uses ssh to start up processes (internal starter is called “mpirun_rsh” – do not use this yourself) • NERSC expects to move to PBS-based startup (internal starter called “mpiexec”) • may help with orphan processes, accounting, ability to redirect standard input, direct mpmd support • Do not use mpirun_rsh or mpiexec directly. They are not supported by NERSC NERCS Users’ Group, Oct. 3, 2005
MPI Memory Use • Current MVAPICH uses a lot of memory per process – linear in number of MPI processes • Per process: • 64MB + • 276KB/process up to 64 • 1.2 MB/process above 64 • Due to limitation in VI Architecture that does not exist in Infiniband but was carried forward • Future implementations of MPI will have lower memory use • Note: getrusage() doesn’t report memory use under Linux. NERCS Users’ Group, Oct. 3, 2005
MPI Performance • ping pong bandwidth: • 800 MB/s (Seaborg: 320 MB/s) • drops to 500 MB/s above 200k messages • theoretical peak 1000 MB/s • ping pong latency: • 5.6 us between nodes (Seaborg: 24us default; 21us with MPI_SINGLE_THREAD) • 0.6 us within a node • “random ring bandwidth”: • 184 MB/s (Seaborg: ~43 MB/s at 4 nodes) • measures contention in network • theoretical peak 250 MB/s NERCS Users’ Group, Oct. 3, 2005
MPI Futures • Different startup mechanism – fewer orphans, faster startup, full environment propagated • Lower memory use • More control over memory registration cache • Higher bandwidth NERCS Users’ Group, Oct. 3, 2005
Summary • All you need to know” • mpicc/mpif77/mpif90/mpicxx • mpirun –np N ./a.out NERCS Users’ Group, Oct. 3, 2005