1 / 76

Three Topics in Parallel Communications

Three Topics in Parallel Communications. Thesis presentation by Emin Gabrielyan. Parallel communications: bandwidth enhancement or fault-tolerance?. We do not know if parallel communications were first used for fault-tolerance or for bandwidth enhancement

Download Presentation

Three Topics in Parallel Communications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Three Topics in Parallel Communications Thesis presentation by Emin Gabrielyan Emin Gabrielyan, Three Topics in Parallel Communications

  2. Parallel communications: bandwidth enhancement or fault-tolerance? • We do not know if parallel communications were first used for fault-tolerance or for bandwidth enhancement • In 1964 Paul Baran proposed parallel communications for fault-tolerance (inspiring the design of ARPANT and Internet) • 1981 IBM invented the 8-bit parallel port for faster communication Emin Gabrielyan, Three Topics in Parallel Communications

  3. Bandwidth enhancement by parallelizing the sources and sinks • Bandwidth enhancement can be achieved by adding parallel paths • But a greater capacity enhancement is achieved if we can replace the senders and destinations with parallel sources and sinks • This is possible in parallel I/O (first topic of the thesis) Emin Gabrielyan, Three Topics in Parallel Communications

  4. Parallel transmissions in coarse-grained networks cause congestions • In coarse-grained circuit-switched HPC networks uncoordinated parallel transmissions cause congestions • The overall throughput degrades due to access conflicts on shared resources • Coordination of parallel transmissions is covered by the second topic of my thesis (liquid scheduling) Emin Gabrielyan, Three Topics in Parallel Communications

  5. Classical backup parallel circuits for fault-tolerance • Typically the redundant resource remains idle • As soon as there is a failure with the primary resource • The backup resource replaces the primary one Emin Gabrielyan, Three Topics in Parallel Communications

  6. Parallelism in living organisms • Parallelism is observed in almost every living organisms • Duplication of organs primarily serves for fault-tolerance • And as a secondary purpose, for capacity enhancement Emin Gabrielyan, Three Topics in Parallel Communications

  7. Simultaneous parallelism for fault-tolerance in fine-grained networks • A challenging bio-inspired solution is to use simultaneously all available paths for achieving fault-tolerance • This topic is addressed in the last part of my presentation (capillary routing) Emin Gabrielyan, Three Topics in Parallel Communications

  8. Fine Granularity Parallel I/O for Cluster Computers SFIO, a Striped File parallel I/O Emin Gabrielyan, Three Topics in Parallel Communications

  9. Why is parallel I/O required • Single I/O gateway for cluster computer saturates • Does not scale with the size of the cluster Emin Gabrielyan, Three Topics in Parallel Communications

  10. What is Parallel I/O for Cluster Computers • Some or all of the cluster computers can be used for parallel I/O Emin Gabrielyan, Three Topics in Parallel Communications

  11. Objectives of parallel I/O • Resistance to concurrent access • Scalability as the number of I/O nodes increases • High level of parallelism and load balance for all application patterns and all types of I/O requests Emin Gabrielyan, Three Topics in Parallel Communications

  12. Parallel I/O Subsystem Concurrent Access by Multiple Compute Nodes • No concurrent access overheads • No performsne degradation • When the number of compute nodes increases Emin Gabrielyan, Three Topics in Parallel Communications

  13. Scalable throughput of the parallel I/O subsystem • The overall parallel I/O throughput should increase linearly as the number of I/O nodes increases Throughput Number of I/O Nodes Parallel I/O Subsystem Emin Gabrielyan, Three Topics in Parallel Communications

  14. Concurrency and Scalability = Scalable All-to-All Communication Compute Nodes • Concurrency and Scalability (as the number of I/O nodes increases) can be represented by scalable overall throughput when the number of compute and I/O nodes increases All-to-All Throughput Number of I/O and Compute Nodes I/O Nodes Emin Gabrielyan, Three Topics in Parallel Communications

  15. High level of parallelism and load balance • Balanced distribution across parallel disks must be ensured: • For all types of application patterns: • Using small or large I/O requests • Continuous or fragmented I/O request patterns Emin Gabrielyan, Three Topics in Parallel Communications

  16. How parallelism is achieved? • Split the logical file into stripes • Distribute the stripes cyclically across the subfiles Logical file file2 file3 Subfiles file1 file4 file6 file5 Emin Gabrielyan, Three Topics in Parallel Communications

  17. The POSIX-like Interface of Striped File I/O #include <mpi.h> #include "/usr/local/sfio/mio.h" int _main(int argc, char *argv[]) { MFILE *f; int r=rank(); //Collective open operation f=mopen("p1/tmp/a.dat;p2/tmp/a.dat;", 5); //each process writes 8 to 14 characters at its own position if(rank==0) mwritec(f,0,"Good*morning!",13); if(rank==1) mwritec(f,13,"Bonjour!",8); if(rank==2) mwritec(f,21,"Buona*mattina!",14); mclose(f); //Collective close operation } • Using SFIO from MPI • Simple Posix like interface Emin Gabrielyan, Three Topics in Parallel Communications

  18. 0 21 13 G G o o o o d d * * m m o o r r n n i i n n g g ! ! B B o o n n j j o o u u r r ! ! B B u u o o n n a a * * m m a a t t t t i i n n a a ! ! Distribution of the global file data across the subfiles • Example with three compute nodes and two I/O nodes First subfile Global file Second subfile Emin Gabrielyan, Three Topics in Parallel Communications

  19. Impact of the stripe unit size on the load balance I/O Request • When the stripe unit size is large there is no guarantee that an I/O request will be well parallelized Logical file subfiles Emin Gabrielyan, Three Topics in Parallel Communications

  20. Fine granularity striping with good load balance I/O Request • Low granularity ensures good load balance and high level of parallelism • But results in high network communication and disk access cost Logical file subfiles Emin Gabrielyan, Three Topics in Parallel Communications

  21. Fine granularity striping is to be maintained • Most of the HPC parallel I/O solutions are optimized only for large I/O blocks (order of Megabytes) • But we focus on maintaining fine granularity • The problem of the network communication and disk access are addressed by dedicated optimizations Emin Gabrielyan, Three Topics in Parallel Communications

  22. Overview of the implemented optimizations • Disk access requests aggregation (sorting, cleaning-overlaps and merging) • Network communication aggregation • Zero-copy streaming between network and fragmented memory patterns (MPI derived datatypes) • Support of the multi-block interface efficiently optimizes application related file and memory fragmentations (MPI-I/O) • Overlapping of network communication with disk access in time (at the moment write operation only) Emin Gabrielyan, Three Topics in Parallel Communications

  23. Disk access optimizations • Sorting • Cleaning the overlaps • Merging • Input: striped user I/O requests • Output: optimized set of I/O requests • No data copy Multi-block I/O request block 1 bk. 2 block 3 6 I/O access requests are merged into 2 access1 access2 Local subfile Emin Gabrielyan, Three Topics in Parallel Communications

  24. Network Communication Aggregation without Copying From: application memory • Striping across 2 subfiles • Derived datatypes on the fly • Contiguous streaming Logical file To: remote I/O nodes Remote I/O node 1 Remote I/O node 2 Emin Gabrielyan, Three Topics in Parallel Communications

  25. MPI MPI MPI MPI Functional Architecture mread mreadc mreadb mwritec mwriteb mwrite mrw (cyclic distribution) sfp_rflush sfp_wflush • Blue: Interface functions • Green: Striping functionality • Red: I/O request optimizations • Orange: Network communication and relevant optimizations • bkmerge: overlapping and aggregation • mkbset: creates on the fly MPI derived datatypes sfp_readc sfp_writec SFIO library on compute node sfp_rdwrc (request caching) sfp_read sortcache sfp_write flushcache bkmerge sfp_readb sfp_writeb sfp_waitall mkbset SFP_CMD _WRITE I/O Node I/O Listener SFP_CMD_ BREAD SFP_CMD_ BWRITE SFP_CMD _READ Emin Gabrielyan, Three Topics in Parallel Communications

  26. Optimized throughput as a function of the stripe unit size • 3 I/O nodes • 1 compute node • Global file size: 660 Mbytes • TNET • About 10 MB/s per disk Emin Gabrielyan, Three Topics in Parallel Communications

  27. All-to-all stress test on Swiss-Tx cluster supercomputer • Stress test is carried out on Swiss-Tx machine • 8 full crossbar 12-port TNet switches • 64 processors • Link throughput is about 86 MB/s Emin Gabrielyan, Three Topics in Parallel Communications

  28. SFIO on the Swiss-Tx cluster supercomputer • MPI-FCI • Global file size: up to 32 GB • Mean of 53 measurements for each number of nodes • Nearly linear scaling with 200 bytes stripe unit ! • Network is a bottleneck above 12 nodes Emin Gabrielyan, Three Topics in Parallel Communications

  29. Liquid scheduling for low-latency circuit-switched networks Reaching liquid throughput in HPC wormhole switching and in Optical lightpath routing networks Emin Gabrielyan, Three Topics in Parallel Communications

  30. Upper limit of the network capacity • Given is a set of parallel transmissions • and a routing scheme • The upper limit of network’s aggregate capacity is its liquid throughput Emin Gabrielyan, Three Topics in Parallel Communications

  31. Distinction: Packet Switching versus Circuit Switching • Packet switching is replacing circuit switching since 1970 (more flexible, manageable, scalable) • New circuit switching networks are emerging (HPC clusters, Optical switching) • In HPC wormhole routing targets extremely low latency requirements • In optical network packet switching is not possible due to lack of technology Emin Gabrielyan, Three Topics in Parallel Communications

  32. Message Sink Message Source Coarse-Grained Networks • In circuit switching the large messages are transmitted entirely (coarse-grained switching) • Low latency • The sink starts receiving the message as soon as the sender starts transmission Fine-Grained Packet switching Coarse-grained Circuit switching Emin Gabrielyan, Three Topics in Parallel Communications

  33. Parallel transmissions in coarse-grained networks • When the nodes transmit in parallel across a coarse-grained network in uncoordinated fashion congestion may occur • The resulting throughput can be far below the expected liquid throughput Emin Gabrielyan, Three Topics in Parallel Communications

  34. Congestions and blocked paths in wormhole routing Source3 • When the message encounters a busy outgoing port it waits • The previous portion of the path remains occupied Sink2 Source1 Source2 Sink1 Sink3 Emin Gabrielyan, Three Topics in Parallel Communications

  35. Hardware solution in Virtual Cut-Through routing Source3 • In VCT when the port is busy • The switch buffers the entire message • Much more expensive hardware than in wormhole switching Sink2 Source1 buffering Source2 Sink1 Sink3 Emin Gabrielyan, Three Topics in Parallel Communications

  36. Other hardware solutions • In optical networks OEO conversion can be used • Significant impact on the cost (vs. memory-less wormhole switch and MEMS optical switches) • Affecting the properties of the network (e.g. latency) Emin Gabrielyan, Three Topics in Parallel Communications

  37. Application level coordinated liquid scheduling • Liquid scheduling is a software solution • Implemented at the application level • No investments in network hardware • Coordination between the edge nodes is required • Network topology knowledge is assumed Emin Gabrielyan, Three Topics in Parallel Communications

  38. Example of a simple traffic pattern • 5 sending nodes (above) • 5 receiving nodes (below) • 2 switches • 12 links of equal capacity • Traffic consist of 25 transfers Emin Gabrielyan, Three Topics in Parallel Communications

  39. Round robin schedule of all-to-all traffic pattern • First, all nodes simultaneously send the message to the node in front • Then, simultaneously, to the next node • etc Emin Gabrielyan, Three Topics in Parallel Communications

  40. Throughput of round-robin schedule • 3rd and 4th phases require each two timeframes • 7 timeframes are needed in total • Link throughput = 1Gbps • Overall throughput = 25/7x1Gbps = 3.57Gbps Emin Gabrielyan, Three Topics in Parallel Communications

  41. A liquid schedule and its throughput • 6 timeframes of non-congesting transfers • Overall throughput = 25/6x1Gbps = 4.16Gbps Emin Gabrielyan, Three Topics in Parallel Communications

  42. Problem of liquid scheduling • Building liquid schedule for arbitrary traffic of transfers • Problem of partitioning of the traffic into minimal number of subsets consisting of non-congesting transfers • Timeframe = a subset of non-congesting transfers Emin Gabrielyan, Three Topics in Parallel Communications

  43. Definitions of our mathematical model • Transfer is a set of links lying on the path of the transmission • Load of a link is the number of transfers in the traffic using that link • Most loaded links are called bottlenecks • Duration of the traffic is the load of its bottlenecks Emin Gabrielyan, Three Topics in Parallel Communications

  44. Teams = non-congesting transfers using all bottleneck links • The shortest possible time to carry out the traffic is the active time of the bottleneck links • Then the schedule must keep the bottleneck links busy all the time • Therefore the timeframes of a liquid schedule must consist of transfers using all bottlenecks team bottlenecks not a team Emin Gabrielyan, Three Topics in Parallel Communications

  45. Retrieval of teams without repetitions by subdivisions • Teams can be retrieved without repetitions by recursive partitioning • By a choice of a transfer all teams are divided into teams using that transfer and teams not using it • Each halves can be similarly sub divided until individual teams are retrieved Emin Gabrielyan, Three Topics in Parallel Communications

  46. Teams use all bottlenecks: retrieving teams of traffic skeleton • Since teams must use transfers using the bottleneck links • We can first create teams using only such transfers (traffic skeleton) • Chart: fraction of the traffic skeleton Emin Gabrielyan, Three Topics in Parallel Communications

  47. Optimization by first retrieving the teams of the skeleton • Speedup: by skeleton optimization • Reducing the search space 9.5 times Emin Gabrielyan, Three Topics in Parallel Communications

  48. Liquid schedule assembling from retrieved teams • By relying on efficient retrieval of full teams (subsets of non-congesting transfers using all bottlenecks) • We assemble liquid schedule by trying together different combinations of teams • Until all transfers of the traffic are used Emin Gabrielyan, Three Topics in Parallel Communications

  49. Liquid schedule assembling optimizations (reduced traffic) • Proved. If we remove a team from a traffic, new bottlenecks can emerge • New bottlenecks add additional constraints on the teams of the reduced traffic • Proved. A liquid schedule can be assembled if we use teams of the reduced traffic (instead of constructing teams of the initial traffic from the remaining transfers) • Proved. A liquid schedule can be assembled by considering only saturated full teams Emin Gabrielyan, Three Topics in Parallel Communications

  50. Liquid schedule construction speed with our algorithm • 360 traffic patterns across Swiss-Tx network • Up to 32 nodes • Up to 1024 transfers • Comparison of our optimized construction algorithm with MILP method (optimized for discrete optimization problems) Emin Gabrielyan, Three Topics in Parallel Communications

More Related