1 / 38

I/O Strategies for the T3E

I/O Strategies for the T3E. Jonathan Carter NERSC User Services. T3E Overview. T3E is a set of Processing Elements (PE) connected by a fast 3D torus. PEs do not have local disk All PEs access all filesystems equivalently Path for I/O generally looks like: user buffer space

bryant
Download Presentation

I/O Strategies for the T3E

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I/O Strategies for the T3E Jonathan Carter NERSC User Services

  2. T3E Overview • T3E is a set of Processing Elements (PE) connected by a fast 3D torus. • PEs do not have local disk • All PEs access all filesystems equivalently • Path for I/O generally looks like: • user buffer space • system buffer space • I/O device buffer space

  3. Filesystems • /usr/tmp • fast • subject to 14 day purge, not backed up • check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes) • $TMPDIR • fast • purged at end of job or session • shares quota with /usr/tmp • $HOME • slower • permanent, backed up • check quota with quota (usually 2Gb and 3500 inodes)

  4. Types of I/O • Language I/O: Fortran or C (ANSI or POSIX) • Cray FFIO library (can be used from Fortran or C) • MPI I/O • Cray extensions to Fortran and C I/O (mostly for compatibility with PVP systems)

  5. I/O Strategies - Exclusive access files • Each PE reads and writes to a separate file • Language I/O • MPI I/O • Increase language I/O performance with FFIO library (C must use POSIX style calls)

  6. I/O Strategies - Communication and I/O PE • One PE coordinates reading and writing and communicates data back and forth between other PEs via message passing • Language I/O • MPI I/O • Increase language I/O performance with FFIO library

  7. I/O Strategies - Shared files • All PEs read and write the same file simultaneously • Language I/O with FFIO library global layer • MPI I/O • Language I/O with FFIO library global layer and Cray extensions for additional flexibility

  8. Cray FFIO library • FFIO is a set of I/O layers tuned for different I/O characteristics • Buffering of data (configurable size) • Caching of data (configurable size) • Available to regular Fortran I/O without reprogramming • Available for C through POSIX-like calls, e.g. ffopen, ffwrite

  9. The assign command • the assign command controls • controls which FFIO layer is active • striping across multiple partitions • lots more • scope of assign • File name • Fortran unit number • File type (e.g. all sequential unformatted files)

  10. assign Examples • read and write to file restart.file from all PEs by using the FFIO library global layer assign -F global:128:2 f:restart.file • use the FFIO library bufa layer to improve performance for file opened on Fortran unit 10 assign -F bufa:128:2 u:10 • use the FFIO library bufa layer to improve performance for all unformatted sequential Fortran files assign -F bufa:128:2 g:su

  11. assign Examples • To see all active assigns assign -V • To remove all active assigns assign -R

  12. bufa FFIO layer • bufa is an asynchronous buffering layer • performs read-ahead, write-behind • specify buffer size with -F bufa:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers • buffer space increases your applications memory requirements

  13. global FFIO layer • global is a caching and buffering layer which enables multiple PEs to read and write to the same file • if one PE has already read the data, an additional read request from another PE will result in a remote memory copy • file open is a synchronizing event • By default, all PEs must open a global file, this can be changed by calling GLIO_GROUP_MPI(comm) • specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE

  14. File positioning with the global FFIO layer • Positioning of a read or write is your responsibility • File pointers are private • Fortran • Use a direct access file, and read/write(rec=num) • Use Cray extensions setpos and getpos to position file pointer (not portable) • C • Use ffseek

  15. FFIO considerations • Examples above use an unblocked file structure, normal Fortran files are blocked. To read the file without the global or bufa layers you must use assign -s unblocked f:filename • bufa and global do not allow backspace, or skipping over a partially read record. You can allow this behavior by using the cos layer in addition to bufa or global, but then setpos doesn’t work. assign -s cos:128,bufa:128:2 f:filename

  16. More on FFIO • There are many other FFIO layers, some pretty obscure • cache and cachea layers, good for random access files • man intro_ffio for a terse description • Cray Publication - Application Programmer’s I/O Guide

  17. More on assign • Many text processing options • Switch between Fortran 77 and Fortran 90 namelist • File pre-allocation • File striping

  18. Further Information • I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials • Cray Publication - Application Programmer’s I/O Guide • Cray Publication - Cray T3E Fortran Optimization Guide • man assign

  19. MPI I/O • Part of MPI-2 • Interface for High Performance Parallel I/O • data partitioning • collective I/O • asynchronous I/O • portability and interoperability

  20. MPI I/O Definitions • An MPI file is an ordered collection of MPI types. • A file may be opened individually or collectively by a group of processes • The fileview defines a template for accessing the file and is used to partition the file amongst processes

  21. Fileviews • A fileview is composed of three pieces: • a displacement (in bytes) form the beginning of the file • an elementary datatype (etype), which is the unit of data access and positioning within the file • an filetype, which defines a template for accessing the file. A filetype can contain etypes or holes of the same extent as etypes.

  22. Fileviews (cont.) • The filetype pattern is repeated, “tiling” the file • Only the non-empty slots are available to read or write

  23. Fileview (cont.) • Each process can have a different filetype Process 0 Process 1 Process 2

  24. MPI_File_set_view • Called after MPI_File_open to set fileview • MPI_File_set_view(fh, disp, etype, filetype, datarep, info) • fh is a file handle • disp, etype, and filetype define the fileview • datarep is one of “native”, “internal”, or “external32” • info is a set of hints to optimize performance

  25. MPI Info object • An info object bundles up a set of parameters integer finfo call MPI_Info_create(finfo, ierr) call MPI_Info_set(finfo, ‘access_style’, ‘write_mostly’, ierr) • MPI I/O defines a set of parameters used to help optimize I/O performance • MPI_Info_null can be used instead of an info object

  26. Open and Close • MPI_File_open(comm, filename, amode, info, fh) • comm, open is collective over this communicator • filename, string or character variable • file access mode: MPI_MODE_RDONLY, MPI_MODE_RDWR etc. • info object, used to pass hints to open • file handle • MPI_File_close(fh)

  27. Utility routines • MPI_File_delete • MPI_File_set_size • MPI_File_preallocate • MPI_File_set_info

  28. Query routines • MPI_File_get_size • MPI_File_get_group • MPI_File_get_amode • MPI_File_get_info • MPI_File_get_view

  29. Data access routines • Positioning • Explicit, each call has an offset • Individual, each PE maintains an individual file pointer • Shared, the file pointer is maintained globally • Synchronism • Blocking, routine returns when complete • Non-blocking, must call a termination routine to ensure completion • Coordination • Non-collective • Collective

  30. Summary of access routines

  31. MPI_File_seek MPI_File_get_position MPI_File_get_byte_offset MPI_File_seek_shared (collective) MPI_File_get_position_shared Summery of access routines (cont.)

  32. T3E Implementation • No shared file pointers • No non-blocking collective (split collective) • SPR filed on non-blocking read • Work in progress

  33. Examples • All the program fragments are available as working programs on the T3E • Do “module load training”, then look in $EXAMPLES/mpi_io • All examples are of a distributed dot product • initialize data with random numbers • compute dot product of whole vector • write out data into a shared file • read back in and check dot product PE 0 PE 1 PE 2

  34. Naming convention • First letter is positioning: explicit, individual, or shared • Second letter is synchronism: blocking or non-blocking • Third letter is coordination: non-collective or collective • ebn.f90 is the explicit, blocking non-collective example • There are several “ibn” examples dealing with different fileviews

  35. Filetype Example • Process 0 • Process 1 • Process 2

  36. Filetype Example filemode = MPI_MODE_RDWR + MPI_MODE_CREATE call MPI_INFO_CREATE(finfo, ierr) call MPI_INFO_SET(finfo, 'access_style','write_mostly',ierr) call MPI_FILE_OPEN(MPI_COMM_WORLD, 'vector', filemode,& finfo, fhv, ierr) call MPI_TYPE_CREATE_SUBARRAY(1, m*nprocs, m, m*me,& MPI_ORDER_FORTRAN, MPI_REAL, mpi_fileslice, ierr) disp=0 call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL, mpi_fileslice,& 'native', MPI_INFO_NULL, ierr)

  37. Individual, blocking, non-collective call MPI_FILE_WRITE(fhv, b, m, MPI_REAL, status, ierr) lresult=sdot(m, b, 1, b, 1) call MPI_REDUCE(lresult, result, 1, MPI_REAL, MPI_SUM, 0,& MPI_COMM_WORLD, ierr) if (me.eq.0) then write(6,*) 'dot product: ', result end if ! zero vector and read it back in b=0.0 disp=0 call MPI_FILE_SEEK(fhv, disp, MPI_SEEK_SET, ierr) call MPI_FILE_READ(fhv, b, m, MPI_REAL, status, ierr)

  38. Further Information on MPI I/O • MPI-The Complete Reference • Volume 1, The MPI Core • Volume 2, The MPI Extensions

More Related