Three Topics in Parallel Communications

Three Topics in Parallel Communications Thesis presentation by Emin Gabrielyan Emin Gabrielyan, Three Topics in Parallel Communications

Parallel communications: bandwidth enhancement or fault-tolerance? • We do not know if parallel communications were first used for fault-tolerance or for bandwidth enhancement • In 1964 Paul Baran proposed parallel communications for fault-tolerance (inspiring the design of ARPANT and Internet) • 1981 IBM invented the 8-bit parallel port for faster communication Emin Gabrielyan, Three Topics in Parallel Communications

Bandwidth enhancement by parallelizing the sources and sinks • Bandwidth enhancement can be achieved by adding parallel paths • But a greater capacity enhancement is achieved if we can replace the senders and destinations with parallel sources and sinks • This is possible in parallel I/O (first topic of the thesis) Emin Gabrielyan, Three Topics in Parallel Communications

Parallel transmissions in coarse-grained networks cause congestions • In coarse-grained circuit-switched HPC networks uncoordinated parallel transmissions cause congestions • The overall throughput degrades due to access conflicts on shared resources • Coordination of parallel transmissions is covered by the second topic of my thesis (liquid scheduling) Emin Gabrielyan, Three Topics in Parallel Communications

Classical backup parallel circuits for fault-tolerance • Typically the redundant resource remains idle • As soon as there is a failure with the primary resource • The backup resource replaces the primary one Emin Gabrielyan, Three Topics in Parallel Communications

Parallelism in living organisms • Parallelism is observed in almost every living organisms • Duplication of organs primarily serves for fault-tolerance • And as a secondary purpose, for capacity enhancement Emin Gabrielyan, Three Topics in Parallel Communications

Simultaneous parallelism for fault-tolerance in fine-grained networks • A challenging bio-inspired solution is to use simultaneously all available paths for achieving fault-tolerance • This topic is addressed in the last part of my presentation (capillary routing) Emin Gabrielyan, Three Topics in Parallel Communications

Fine Granularity Parallel I/O for Cluster Computers SFIO, a Striped File parallel I/O Emin Gabrielyan, Three Topics in Parallel Communications

Why is parallel I/O required • Single I/O gateway for cluster computer saturates • Does not scale with the size of the cluster Emin Gabrielyan, Three Topics in Parallel Communications

What is Parallel I/O for Cluster Computers • Some or all of the cluster computers can be used for parallel I/O Emin Gabrielyan, Three Topics in Parallel Communications

Objectives of parallel I/O • Resistance to concurrent access • Scalability as the number of I/O nodes increases • High level of parallelism and load balance for all application patterns and all types of I/O requests Emin Gabrielyan, Three Topics in Parallel Communications

Parallel I/O Subsystem Concurrent Access by Multiple Compute Nodes • No concurrent access overheads • No performsne degradation • When the number of compute nodes increases Emin Gabrielyan, Three Topics in Parallel Communications

Scalable throughput of the parallel I/O subsystem • The overall parallel I/O throughput should increase linearly as the number of I/O nodes increases Throughput Number of I/O Nodes Parallel I/O Subsystem Emin Gabrielyan, Three Topics in Parallel Communications

Concurrency and Scalability = Scalable All-to-All Communication Compute Nodes • Concurrency and Scalability (as the number of I/O nodes increases) can be represented by scalable overall throughput when the number of compute and I/O nodes increases All-to-All Throughput Number of I/O and Compute Nodes I/O Nodes Emin Gabrielyan, Three Topics in Parallel Communications

High level of parallelism and load balance • Balanced distribution across parallel disks must be ensured: • For all types of application patterns: • Using small or large I/O requests • Continuous or fragmented I/O request patterns Emin Gabrielyan, Three Topics in Parallel Communications

How parallelism is achieved? • Split the logical file into stripes • Distribute the stripes cyclically across the subfiles Logical file file2 file3 Subfiles file1 file4 file6 file5 Emin Gabrielyan, Three Topics in Parallel Communications

The POSIX-like Interface of Striped File I/O #include <mpi.h> #include "/usr/local/sfio/mio.h" int _main(int argc, char *argv[]) { MFILE *f; int r=rank(); //Collective open operation f=mopen("p1/tmp/a.dat;p2/tmp/a.dat;", 5); //each process writes 8 to 14 characters at its own position if(rank==0) mwritec(f,0,"Good*morning!",13); if(rank==1) mwritec(f,13,"Bonjour!",8); if(rank==2) mwritec(f,21,"Buona*mattina!",14); mclose(f); //Collective close operation } • Using SFIO from MPI • Simple Posix like interface Emin Gabrielyan, Three Topics in Parallel Communications

0 21 13 G G o o o o d d * * m m o o r r n n i i n n g g ! ! B B o o n n j j o o u u r r ! ! B B u u o o n n a a * * m m a a t t t t i i n n a a ! ! Distribution of the global file data across the subfiles • Example with three compute nodes and two I/O nodes First subfile Global file Second subfile Emin Gabrielyan, Three Topics in Parallel Communications

Impact of the stripe unit size on the load balance I/O Request • When the stripe unit size is large there is no guarantee that an I/O request will be well parallelized Logical file subfiles Emin Gabrielyan, Three Topics in Parallel Communications

Fine granularity striping with good load balance I/O Request • Low granularity ensures good load balance and high level of parallelism • But results in high network communication and disk access cost Logical file subfiles Emin Gabrielyan, Three Topics in Parallel Communications

Fine granularity striping is to be maintained • Most of the HPC parallel I/O solutions are optimized only for large I/O blocks (order of Megabytes) • But we focus on maintaining fine granularity • The problem of the network communication and disk access are addressed by dedicated optimizations Emin Gabrielyan, Three Topics in Parallel Communications

Overview of the implemented optimizations • Disk access requests aggregation (sorting, cleaning-overlaps and merging) • Network communication aggregation • Zero-copy streaming between network and fragmented memory patterns (MPI derived datatypes) • Support of the multi-block interface efficiently optimizes application related file and memory fragmentations (MPI-I/O) • Overlapping of network communication with disk access in time (at the moment write operation only) Emin Gabrielyan, Three Topics in Parallel Communications

Disk access optimizations • Sorting • Cleaning the overlaps • Merging • Input: striped user I/O requests • Output: optimized set of I/O requests • No data copy Multi-block I/O request block 1 bk. 2 block 3 6 I/O access requests are merged into 2 access1 access2 Local subfile Emin Gabrielyan, Three Topics in Parallel Communications

Network Communication Aggregation without Copying From: application memory • Striping across 2 subfiles • Derived datatypes on the fly • Contiguous streaming Logical file To: remote I/O nodes Remote I/O node 1 Remote I/O node 2 Emin Gabrielyan, Three Topics in Parallel Communications

MPI MPI MPI MPI Functional Architecture mread mreadc mreadb mwritec mwriteb mwrite mrw (cyclic distribution) sfp_rflush sfp_wflush • Blue: Interface functions • Green: Striping functionality • Red: I/O request optimizations • Orange: Network communication and relevant optimizations • bkmerge: overlapping and aggregation • mkbset: creates on the fly MPI derived datatypes sfp_readc sfp_writec SFIO library on compute node sfp_rdwrc (request caching) sfp_read sortcache sfp_write flushcache bkmerge sfp_readb sfp_writeb sfp_waitall mkbset SFP_CMD _WRITE I/O Node I/O Listener SFP_CMD_ BREAD SFP_CMD_ BWRITE SFP_CMD _READ Emin Gabrielyan, Three Topics in Parallel Communications

Optimized throughput as a function of the stripe unit size • 3 I/O nodes • 1 compute node • Global file size: 660 Mbytes • TNET • About 10 MB/s per disk Emin Gabrielyan, Three Topics in Parallel Communications

All-to-all stress test on Swiss-Tx cluster supercomputer • Stress test is carried out on Swiss-Tx machine • 8 full crossbar 12-port TNet switches • 64 processors • Link throughput is about 86 MB/s Emin Gabrielyan, Three Topics in Parallel Communications

SFIO on the Swiss-Tx cluster supercomputer • MPI-FCI • Global file size: up to 32 GB • Mean of 53 measurements for each number of nodes • Nearly linear scaling with 200 bytes stripe unit ! • Network is a bottleneck above 12 nodes Emin Gabrielyan, Three Topics in Parallel Communications

Liquid scheduling for low-latency circuit-switched networks Reaching liquid throughput in HPC wormhole switching and in Optical lightpath routing networks Emin Gabrielyan, Three Topics in Parallel Communications

Upper limit of the network capacity • Given is a set of parallel transmissions • and a routing scheme • The upper limit of network’s aggregate capacity is its liquid throughput Emin Gabrielyan, Three Topics in Parallel Communications

Distinction: Packet Switching versus Circuit Switching • Packet switching is replacing circuit switching since 1970 (more flexible, manageable, scalable) • New circuit switching networks are emerging (HPC clusters, Optical switching) • In HPC wormhole routing targets extremely low latency requirements • In optical network packet switching is not possible due to lack of technology Emin Gabrielyan, Three Topics in Parallel Communications

Message Sink Message Source Coarse-Grained Networks • In circuit switching the large messages are transmitted entirely (coarse-grained switching) • Low latency • The sink starts receiving the message as soon as the sender starts transmission Fine-Grained Packet switching Coarse-grained Circuit switching Emin Gabrielyan, Three Topics in Parallel Communications

Parallel transmissions in coarse-grained networks • When the nodes transmit in parallel across a coarse-grained network in uncoordinated fashion congestion may occur • The resulting throughput can be far below the expected liquid throughput Emin Gabrielyan, Three Topics in Parallel Communications

Congestions and blocked paths in wormhole routing Source3 • When the message encounters a busy outgoing port it waits • The previous portion of the path remains occupied Sink2 Source1 Source2 Sink1 Sink3 Emin Gabrielyan, Three Topics in Parallel Communications

Hardware solution in Virtual Cut-Through routing Source3 • In VCT when the port is busy • The switch buffers the entire message • Much more expensive hardware than in wormhole switching Sink2 Source1 buffering Source2 Sink1 Sink3 Emin Gabrielyan, Three Topics in Parallel Communications

Other hardware solutions • In optical networks OEO conversion can be used • Significant impact on the cost (vs. memory-less wormhole switch and MEMS optical switches) • Affecting the properties of the network (e.g. latency) Emin Gabrielyan, Three Topics in Parallel Communications

Application level coordinated liquid scheduling • Liquid scheduling is a software solution • Implemented at the application level • No investments in network hardware • Coordination between the edge nodes is required • Network topology knowledge is assumed Emin Gabrielyan, Three Topics in Parallel Communications

Example of a simple traffic pattern • 5 sending nodes (above) • 5 receiving nodes (below) • 2 switches • 12 links of equal capacity • Traffic consist of 25 transfers Emin Gabrielyan, Three Topics in Parallel Communications

Round robin schedule of all-to-all traffic pattern • First, all nodes simultaneously send the message to the node in front • Then, simultaneously, to the next node • etc Emin Gabrielyan, Three Topics in Parallel Communications

Throughput of round-robin schedule • 3rd and 4th phases require each two timeframes • 7 timeframes are needed in total • Link throughput = 1Gbps • Overall throughput = 25/7x1Gbps = 3.57Gbps Emin Gabrielyan, Three Topics in Parallel Communications

A liquid schedule and its throughput • 6 timeframes of non-congesting transfers • Overall throughput = 25/6x1Gbps = 4.16Gbps Emin Gabrielyan, Three Topics in Parallel Communications

Problem of liquid scheduling • Building liquid schedule for arbitrary traffic of transfers • Problem of partitioning of the traffic into minimal number of subsets consisting of non-congesting transfers • Timeframe = a subset of non-congesting transfers Emin Gabrielyan, Three Topics in Parallel Communications

Definitions of our mathematical model • Transfer is a set of links lying on the path of the transmission • Load of a link is the number of transfers in the traffic using that link • Most loaded links are called bottlenecks • Duration of the traffic is the load of its bottlenecks Emin Gabrielyan, Three Topics in Parallel Communications

Teams = non-congesting transfers using all bottleneck links • The shortest possible time to carry out the traffic is the active time of the bottleneck links • Then the schedule must keep the bottleneck links busy all the time • Therefore the timeframes of a liquid schedule must consist of transfers using all bottlenecks team bottlenecks not a team Emin Gabrielyan, Three Topics in Parallel Communications

Retrieval of teams without repetitions by subdivisions • Teams can be retrieved without repetitions by recursive partitioning • By a choice of a transfer all teams are divided into teams using that transfer and teams not using it • Each halves can be similarly sub divided until individual teams are retrieved Emin Gabrielyan, Three Topics in Parallel Communications

Teams use all bottlenecks: retrieving teams of traffic skeleton • Since teams must use transfers using the bottleneck links • We can first create teams using only such transfers (traffic skeleton) • Chart: fraction of the traffic skeleton Emin Gabrielyan, Three Topics in Parallel Communications

Optimization by first retrieving the teams of the skeleton • Speedup: by skeleton optimization • Reducing the search space 9.5 times Emin Gabrielyan, Three Topics in Parallel Communications

Liquid schedule assembling from retrieved teams • By relying on efficient retrieval of full teams (subsets of non-congesting transfers using all bottlenecks) • We assemble liquid schedule by trying together different combinations of teams • Until all transfers of the traffic are used Emin Gabrielyan, Three Topics in Parallel Communications

Liquid schedule assembling optimizations (reduced traffic) • Proved. If we remove a team from a traffic, new bottlenecks can emerge • New bottlenecks add additional constraints on the teams of the reduced traffic • Proved. A liquid schedule can be assembled if we use teams of the reduced traffic (instead of constructing teams of the initial traffic from the remaining transfers) • Proved. A liquid schedule can be assembled by considering only saturated full teams Emin Gabrielyan, Three Topics in Parallel Communications

Liquid schedule construction speed with our algorithm • 360 traffic patterns across Swiss-Tx network • Up to 32 nodes • Up to 1024 transfers • Comparison of our optimized construction algorithm with MILP method (optimized for discrete optimization problems) Emin Gabrielyan, Three Topics in Parallel Communications

Three Topics in Parallel Communications