1 / 33

Collective Buffering: Improving Parallel I/O Performance

Collective Buffering: Improving Parallel I/O Performance. By Bill Nitzberg and Virginia Lo. Outline. Introduction Concepts Collective parallel I/O algorithms Collective buffering experiments Conclusion Question. Introduction.

solana
Download Presentation

Collective Buffering: Improving Parallel I/O Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo

  2. Outline • Introduction • Concepts • Collective parallel I/O algorithms • Collective buffering experiments • Conclusion • Question

  3. Introduction • Existing parallel I/O system evolved directly from I/O system for serial machines • Serial I/O systems are heavily tuned for: • Sequential, large accesses, limited file sharing between processes • High degree of both spatial and temporal locality

  4. Introduction (cont.) • This paper presents a set of algorithms known as Collective Buffering algorithms • These algorithms seeks to improve I/O performance on distributed memory machines by utilizing global knowledge of the I/O operations

  5. Concepts • Global data structure • Global data structure is the logical view of the data from the application’s point of view • Scientific applications generally use global data structures consisting of arrays distributed in one, two, or three dimensions

  6. Concepts (cont.) • Data distribution • The global data structure is distributed among node memories by cutting it into data chunks. • The HPF BLOCK distribution partitions the global data structure into P equally sized pieces • The HPF CYCLIC divides the global data structure into small pieces (by distribution size or block size) and deals these pieces out to the P nodes in a round-robin fashion

  7. Concepts (cont.)

  8. Concepts (cont.) • File layout • File layout is another form of data distribution • The file represents a linearization of the global data structures, such as the row-major ordering of a three-dimensional array • This linearization is called canonical file • The file are distributed among I/O nodes

  9. Concepts (cont.)

  10. Collective parallel I/O algorithm • Naïve algorithm • Naïve algorithm treats parallel I/O the same as workstation I/O • The order of writes is dependent on data layout in node’s memory which as no relation to the layout of data on disks • The unit of data transferred in each I/O operation is the data block – the smallest unit of local data that is contiguous with respect to the canonical file

  11. Collective parallel I/O algorithm (cont.) • Naïve algorithm (cont.) • The size of the data block is very small and is unrelated to the size of a file block because of the disparity between data distributions and file layout parameters • The overall effect are: • The network is flood with many small messages • Messages arrive at I/O nodes in an uncoordinated fashion resulting in highly inefficient disk writes

  12. Collective parallel I/O algorithms (cont.)

  13. Collective parallel I/O algorithms (cont.) • Collective buffering algorithm • This method rearranges the data on compute nodes prior to issuance of I/O operations to minimize the number of disk operations • The permutation can be performed “in place” where nodes transpose data among them self • It can also be performed “on auxiliary nodes” where the compute nodes transpose the data by sending it to a set of auxiliary buffering nodes

  14. Collective parallel I/O algorithms (cont.)

  15. Collective parallel I/O algorithms (cont.)

  16. Collective parallel I/O algorithms (cont.) • Four techniques are developed and evaluated: 1 - All compute nodes are used to permute the data to a simple HPF BLOCK intermediate distribution in a single step 2 – Refine the first technique by realistically limiting the amount of buffer space and using a distribution which matches the file layout

  17. Collective parallel I/O algorithms (cont.) • Four techniques (cont.): • This technique uses HPF CYCLIC intermediate distribution • This method uses scatter/gather hardware to eliminate the latency dominated overhead of the permutation phase

  18. Collective buffering experiments • Experiment systems: • The Paragon consists of 224 processing nodes connected in a 16x32 mesh. • Application space-share 208 compute nodes with 32 MB of memory each. • Nine I/O nodes each with one SCSI-1 RAID-3 disk array consisting of 5 disks, 2 gigabytes each. • The parallel file system, PFS is configured to use 6 of the 9 I/O nodes

  19. Collective buffering experiments (cont.) • Experiments systems: • The SP2 consists of 160 nodes. Each node is an IBM RS6000/590 with 128 MB of memory and a SCSI-1 attached 2 GB disk • The Parallel file system, IBM AIX Parallel I/O File System (PIOFS) is configured with 8 I/O nodes (semi-dedicated servers) and 150 compute nodes

  20. Collective buffering experiments (cont.)

  21. Collective buffering experiments (cont.)

  22. Collective buffering experiments (cont.)

  23. Collective buffering experiments (cont.)

  24. Collective buffering experiments (cont.)

  25. Collective buffering experiments (cont.)

  26. Collective buffering experiments (cont.)

  27. Collective buffering experiments (cont.)

  28. Collective buffering experiments (cont.)

  29. Collective buffering experiments (cont.)

  30. Collective buffering experiments (cont.)

  31. Conclusion • Collective buffering significantly improves Naïve parallel I/O performance by two orders of magnitude for small data block sizes • Peak performance can be obtained with minimal buffer space (approximately 1 megabyte per I/O node) • Performance is dependent on intermediate distribution (up to a factor of 2)

  32. Conclusion (cont.) • There is no single intermediate distribution which provides the best performance for all cases, but a few come close • Collective buffering with scatter/gather can potentially deliver peak performance for all data block sizes.

  33. Question • What is the advantages and disadvantages of the Naïve algorithm ? • What is Collective Buffering and how this technique may improve parallel I/O performance ?

More Related