Comparison of Communication and I/O of the Cray T3E and IBM SP

Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User Services

Overview • Node Characteristics • Interconnect Characteristics • MPI Performance • I/O Configuration • I/O Performance

Interconnect Memory CPU T3E Architecture • Distributed memory, single CPU processing elements

T3E Communication Network • Processing Elements (PE) are connected by a 3D torus.

T3E Communication Network • The peak bandwidth of the torus is about 600 Mbyte/sec bidirectional • Sustainable bandwidth is about 480 Mbytes/sec bidirectional • Latency is  1μs • Shmem API gives latency of 1μs, bandwidth 350 Mbyte/sec bidirectional

SP Architecture • Cluster of SMP nodes Interconnect Memory CPU CPU

Nodes SP Communication Network • Nodes are connected via adapters to the SP Switch. Switch is composed of boards which link 16 nodes. Boards are linked to form larger network. Switch Board

SP Communication Network • The peak bandwidth of adapter and switch is 300 Mbyte/sec bidirectional • Latency of the switch is about 2μs • Sustainable bandwidth is about 185 Mbytes/sec bidirectional

MPI Performance Intra-node is 1 MPI process per node, 2 MPI processes (typical) will halve bandwidth

MPI Performance

T3E I/O Configuration • PEs do not have local disk • All PEs access all filesystems equivalently • Path for (optimum) I/O generally looks like: • PE to I/O node via torus • I/O node to Fibre Channel Node (FCN) via Gigaring • FCN to Disk Array via Fibre loop • In some cases data on APP PE must be transferred to a system buffer on an OS PE then out to an FCN

Gigaring Disk Arrays I/O FCN T3E I/O Configuration

SP I/O Configuration • Nodes have local disk. One SCSI disk for all local filesystems. Non-optimal. • All nodes access Global Parallel File System (GPFS) filesystems equivalently • Path for GPFS I/O looks like: • Node to GPFS Node via IP over the switch • GPFS Node to Disk Array via SSA loop

SP I/O Configuration Disk Array GPFS Nodes Nodes Switch Switch

T3E Filesystems • /usr/tmp • fast • subject to 14 day purge, not backed up • check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes) • $TMPDIR • fast • purged at end of job or session • shares quota with /usr/tmp • $HOME • slower • permanent, backed up • check quota with quota (usually 2Gb and 3500 inodes)

SP Filesystems • /scratch and $SCRATCH • global • fast (GPFS) • subject to 14 day purge (or at session end for $SCRATCH), not backed up • check quota with myquota (usually 100Gb and 6000 inodes) • $TMPDIR • local (created in /scr) - only 2 Gbyte total • slower • purged at end of job or session • $HOME • global • slower (GPFS) • permanent, not backed up yet • check quota with myquota (usually 4Gb and 5000 inodes)

Types of I/O • Bewildering number of choices on both machines: • Standard Language I/O: Fortran or C (ANSI or POSIX) • Vendor extensions to language I/O • MPI I/O • Cray FFIO library (can be used from Fortran or C) • IBM MIO library, requires code changes

Standard Language I/O • Fortran direct access is slightly more efficient then sequential access both on the T3E (see comments on FFIO later) and the SP. It also allows file transferability. • C language I/O (fopen, fwrite, etc.) is inefficient on both machines. • POSIX standard I/O (open, read, etc.) can be efficient on the T3E, but requires care (see comments on FFIO later). Works well on the SP.

Vendor Extensions to Language I/O • Cray has a number of I/O routines (aqopen, etc.) which are legacies from the PVP systems. Non-portable. • IBM has extended Fortran syntax to provide asynchronous I/O. Non-portable.

MPI I/O • Part of MPI-2 • Interface for High Performance Parallel I/O • data partitioning • collective I/O • asynchronous I/O • portability and interoperability bwteen T3E and SP • Different subset implemented on T3E and SP

Summary of access routines for T3E

Summary of access routines for SP

Cray FFIO library • FFIO is a set of I/O layers tuned for different I/O characteristics • Buffering of data (configurable size) • Caching of data (configurable size) • Available to regular Fortran I/O without reprogramming • Available for C through POSIX-like calls, e.g. ffopen, ffwrite

FFIO - The assign command • controls program behavior at runtime • the assign command controls • controls which FFIO layer is active • striping across multiple partitions • lots more • scope of assign • File name • Fortran unit number • File type (e.g. all sequential unformatted files)

IBM MIO library • User interface based on POSIX I/O routines, so requires program modification • Useful trace module to collect statistics • Not much experience with using on GPFS filesystem • Coming soon

I/O Strategies - Exclusive access files • Each process reads and writes to a separate file • Language I/O • Increase language I/O performance with FFIO library (for example, sepcify a large buffer with the bufa layer) on T3E. For Fortran direct access default buffer is only the maximum of the record length or 32 Kbytes • read/write large amounts of data per request on the SP • MPI I/O • read/write large amounts of data per request

bufa FFIO layer Overview • bufa is an asynchronous buffering layer • performs read-ahead, write-behind • specify buffer size with -F bufa:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers • buffer space increases your applications memory requirements

I/O Strategies - Shared files • All PEs read and write the same file simultaneously • Language I/O (requires FFIO library global layer for T3E) • MPI I/O • On T3E, language I/O with FFIO library global layer and Cray extensions for additional flexibility

Positioning with a shared file • Positioning of a read or write is your responsibility • File pointers are private • Fortran • Use a direct access file, and read/write(rec=num) • Use Cray T3E extensions setpos and getpos to position file pointer (not portable) • C • Use ffseek • MPI I/O • MPI I/O fileview generally takes care of this. Positioning routines also available.

global FFIO layer Overview • global is a caching and buffering layer which enables multiple PEs to read and write to the same file • if one PE has already read the data, an additional read request from another PE will result in a remote memory copy • file open is a synchronizing event • By default, all PEs must open a global file, this can be changed by calling GLIO_GROUP_MPI(comm) • specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE

GPFS and shared files • On the T3E the global FFIO layer takes care of updates to a file from multiple PEs by tracking the state of the file across all PEs. • On the SP, GPFS implements a safe update scheme via tokens and a token manager. • If two processes access the same block of a GPFS file (256 Kbytes), a negotiation is conducted between the nodes and the token manager to determine the order of updates. This can slow down I/O considerably. • MPI I/O merges requests from different processes to alleviate this problem

I/O Performance Comparison • Each process writes a 200 Mbyte file. 2 processes per node on SP.

Further Information • I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials • Cray Publication - Application Programmer’s I/O Guide • Cray Publication - Cray T3E Fortran Optimization Guide • man assign • XL Fortran User’s Guide

Comparison of Communication and I/O of the Cray T3E and IBM SP