330 likes | 487 Views
IBM SP Switch: Theory and Practice. Carlo Fantozzi ( carlo.fantozzi@dei.unipd.it ). Summary of presentation. Theory Architecture of the SP node Architecture of the SP switch (detailed) The software layer Practice Performance measurement using MPI The PUB library MPI vs. PUB.
E N D
IBM SP Switch:Theory and Practice Carlo Fantozzi (carlo.fantozzi@dei.unipd.it)
Summary of presentation • Theory • Architecture of the SP node • Architecture of the SP switch (detailed) • The software layer • Practice • Performance measurement using MPI • The PUB library • MPI vs. PUB
SP System Overview • Flexibility: db server, web server, storage server, “number cruncher” • Scalability: up to 512 nodes • Modularity; building blocks: • SMP nodes • 16-ports switches • Many different building blocks available • Result: a cluster of SMPs
Scalability • 8192 processors • November 2002: #4 on the TOP 500 list
POWER3 SMP Thin Node • 4 processors, disks & AIX o.s. • CPU: 64-bit POWER3-II @375MHz • 2 FP units, 1500 MFLOPS (peak performance) • L1 cache: 32 KB inst, 64 KB data • L2 cache: 8 MB @ 250 MHz per CPU • 256 bit width, dedicated bus • Main memory: up to 16 GB @ 62.5 MHz • Shared among processors; 512 bit bus
Thin node: 6xx-MX bus • 64 bit wide; runs @ 60 MHz • Independent from memory buses • Peak bandwidth: 480 MB/s • 6xx-MX bus shared by: • ultra SCSI disks • 100 Mbps Ethernet • PCI expansion slots • SP Switch adapter
The SP Switch • Scalable, reconfigurable • Redundant Reliable • High bandwidth • Low latency • Split into 2 parts: • the SP Switch Board • the SP Switch Adapter (on each node)
SP Switch Port • Synchronous • Bidirectional • Phit size: 1 byte • Flit size: 2 bytes • Flow control:credits and tokens • Peak bw: 150 MB/sper direction
BOP P1 P2 P3 payload EOP SP Switch Basics • Link-level flow control • Buffered wormhole routing • Cut-through switching • Routing strategy is: • source-based • designed to choose shortest paths • non-adaptive • “non-deterministic”
SP Switch Chip • 16x16 unbuffered crossbar • Conflict resolution:least recently served (LRS) • Packet refusedby output port if… • port is already transmitting • input port is not the LRS • Refused packets go toa unified buffer • Dynamic allocation of chunks • Multiple next-chunk lists
SP Switch Board • 2-stage BMIN • 4 shortest paths foreach (src, dst) pair • 16 ports on the rightfor (up to) 16 nodes • 16 ports on the leftfor multiple boards 2-dimensional 4-ary butterfly
SP Switch Adapter • On the 6XX-MX Bus(480 MB/s peak bandwidth) • $ 12,500 (?!)
SP Switch: the Software • IP or User Space • IP: “easy” but slow • Kernel extensions to handle IP packets • Any IP application can use the switch • TCP/IP and system call overheads • User Space: high performance solution • DirectX-like: low overhead • Portability/complexity force a protocol stack
User Space Protocols • MPI: industry standard message passing interface • MPCI: point-to-point transport;hides multi-thread issues • PIPE: byte-stream transport;split messages into flits;chooses among 4 routes;does buffering/DMA;manages flow control/ECC • MPI implementation is native • Anything faster? LAPI User application MPI MPCI PIPE UDP (5+ apps) HW Abstraction Layer SP Switch
Testing the Switch • ANSI C • Non-threaded MPI 1.2 library • Switch tuning: tuning.scientific • Inside a node: shared memory • Outside a node: switch+User Space • Expect hierarchy effects • Beware: extraneous loads beyond LoadLeveler control ! 50%
P1 P2 Measuring Latency • Latency measured as “round trip time” • Packet size: 0 bytes • Unloaded latency (best scenario) • Synchronous calls: they give better results
Latency: Results • Same latency for intra- and inter-node comms • Best latency ever seen: 7.13 ms(inter-node!) • Worst latency ever seen: 330.64 ms • “Catastrophic events” (> 9 ms) happen! • What about Ethernet? [verona:~] bambu% ping -q -r -c 10 pandora PING pandora.dei.unipd.it (147.162.96.156): 56 data bytes --- pandora.dei.unipd.it ping statistics --- 10 packets transmitted, 10 packets received, 0% packet loss round-trip min/avg/max = 0.379/0.477/0.917 ms
Latency for P1, Job #7571 Latency for P4, Job #7571 160 40 140 35 120 30 100 25 80 Frequency 20 Frequency 60 15 40 10 20 5 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 10 30 50 70 90 More 110 130 150 170 190 210 230 250 270 Latency (microseconds) Latency (microseconds) Latency: the True Story • Data taken from a 16-CPU job • Average for the job: 81 ms • At least 6 times better than Ethernet… 2 different events? 3 different events?
Measuring Bandwidth • Bigpackets (tens of MB)to overcome buffers and latency effects • Only one sender/receiver pair active:P0 sends, Pi receives • Separate buffers for send and receive • Unidirectional BW: MPI_Send & MPI_Recv • Bidirectional BW: MPI_Sendrecv is bad!Use MPI_Isend and MPI_Irecv instead
Bandwidth: Results (1) • 2-level hierarchy:no advantage for same-switch-chip nodes • Unidirectional results • Some catastrophic events, as usual • Intra-node: best 338.8, worst 250.1 MB/sTypically over 333 MB/s • Inter-node: best 133.9, worst 101.3 MB/sTypically over 133 MB/s (89% of p.p.)
Bandwidth Oscillation in Job # 7571 400 350 300 250 P1 Bandwidth (MB/s) 200 P4 P8 150 100 50 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Run # Bandwidth: Results (2) • Bidirectional results • Intra-node: best 351.3, worst 274.6 MB/s • Inter-node: best 183.5, worst 106.7 MB/s;61% of p.p. or even less
MPI_Barrier() times in different runs 400,0 Run 7736 350,0 Run 7737 Run 7738 300,0 Run 7741 Run 7742 250,0 Run 7743 Run 7744 Sync time (microseconds) 200,0 150,0 100,0 50,0 0,0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Processors involved Barrier Synchronization • Data averaged over hundreds of calls • Not influenced by node allocation, but…
UNFILTERED MPI_Barrier() times in different runs 900,0 800,0 Run 7736 Run 7737 700,0 Run 7738 Run 7741 600,0 Run 7742 Run 7743 Run 7744 500,0 Filtered Sync time (microseconds) 400,0 300,0 200,0 100,0 0,0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Processors involved Barrier: the True Story • For 24 processors: 325 vs 534 ms • Which value should we use?
PUB Library: Facts • PUB = “Paderborn University BSP” • Alternative to MPI (higher level); GPL’ed • Uses the BSP computational model • Put messages in buffer • Call barrier synchronization • Messages are now at their destinations • Native implementation for old architectures, runs over TCP/IP or MPI otherwise
MPI_Send MPI_Isend MPI_Barrier MPI_Comm_split MPI_Reduce MPI_Scan MPI_Bcast bsp_send bsp_hpsend bsp_sync bsp_partition bsp_reduce bsp_scan bsp_gsend MPI vs. PUB: Interfaces • MPI: scatter/gather, topologies • PUB: bsp_oblsync, virtual processors
Latency for P4, Job #7835 140 120 100 80 Frequency 60 40 20 0 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 More Latency (microseconds) MPI vs. PUB: latency • Measuring latency is against the spirit of PUB! • Best is 42 ms, but average is 86 ms (like MPI)
MPI vs. PUB: bandwidth • Unidirectional results • Intra-node: best 124.9, worst 74.0 MB/sTypically over 124 MB/s (noshared memory?) • Inter-node: best 79.4, worst 74.8 MB/sTypically over 79 MB/s (53% of p.p.) • Bidirectional results • Intra-node: best 222.4, worst 121.4 MB/s • Inter-node: best 123.7, worst 82.3 MB/s • Much slower than MPI, but… • …sometimes bw is higher than link capacity ?!
MPI_Barrier() versus bsp_sync(), filtered 400,0 7810 (MPI) 350,0 7811 (MPI) 50 ms 7812 (MPI) 300,0 7813 (MPI) 7810 (PUB) 250,0 7811 (PUB) 200,0 Sync time (microseconds) 7812 (PUB) 7813 (PUB) 150,0 100,0 50,0 0,0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Processors involved MPI vs PUB: barrier • PUB is slower for few processors, then faster • PUB also needs filtering for >16 processors
Further Results (1) • Further performance measureswith more complex communication patterns • 10% difference in results from job to job • MPI: previous figures still holdif we consider aggregate bandwidth per node • PUB is at least 20% slower than MPI(much more for bidirectional patterns) • Some PUB figures are, again, meaningless
N1 N3 N5 1/3 N6 1/2?! 3 streams 2/3 1/2?! Further Results (2) • Switch topology emergesin complex patterns • Main bottleneck:nodeswitch channel • Other effects present(e.g. concurrency handling) 2/3?
Conclusions • Variability in results……due to load effects? Due to SW? • Variability makes a switch model impossible • PUB benefits? • If I had more time/resources… • Higher level: collective communications • Lower level: the LAPI interface • GigaEthernet, Myrinet, CINECA’s SP4 • Native PUB library on the SP