220 likes | 390 Views
August 15, 2000 Terry Jones Lawrence Livermore National Laboratory. Using GPFS. Outline. Scope Presentation aimed at scientific/technical app writers Many interesting topics skipped: fs reliability, fs maintenance, ... Brief Overview of GPFS Architecture Disk Drive Basics
E N D
August 15, 2000 Terry Jones Lawrence Livermore National Laboratory Using GPFS
Outline • Scope • Presentation aimed at scientific/technical app writers • Many interesting topics skipped: fs reliability, fs maintenance, ... • Brief Overview of GPFS Architecture • Disk Drive Basics • GPFS -vs- NFS -vs- JFS • What’s New In GPFS • libgpfs • Metadata improvements • Performance oriented changes • Application Do’s and Don’ts • 5 Tips that will keep your codes humming • Recent Performance Measurements • “You can’t put tailfins on data” -- H.B.
Disk Drive Basics Disks are mechanical thingies huge latencies Performance is a function of buffer size Caching Disk Speed is advancing, but slower than processor speed Disk Parallelism
Switch File System Architectures mmfsd mmfsd • GPFS Parallel File System • GPFS fs are striped across multiple disks on multiple I/O nodes • All compute nodes can share all disks • Larger files, larger file systems, higher bandwidths Metanode Stripe Grp Mgr Token Mgr Srvr mmfsd app vsd Switch I/O Nodes Compute Nodes • JFS (Native AIX fs) • No file sharing - on-node only • Apps must do own partitioning • NFS / DFS / AFS • Clients share files on server node • Server node becomes bottleneck Single Storage Controller Clients Unsecure Network
New in Version 1.3 • Multiple levels of indirect blocks => larger max file size, max number of files increases • SDR interaction improved • VSD KLAPI (flow control, one less copy, relieves usage of CSS send and receive pool) • Prefetch algorithms now recognize strided & reverse sequential access • Batching => much better metadata performance
New Programming Interfaces • Support for mmap functions as described in X/Open 4.2 • libgpfs • gpfs_prealloc() An external programming call, which provides the ability to preallocate space for a file, has been added to GPFS. This will allow you to preallocate an amount of space for a file that has already been opened, prior to writing data to the file (use to ensure sufficient space). • gpfs_stat() and gpfs_fstat() Provide exact mtime and ctime. • gpfs_getacl(), gpfs_putacl() gpfs_fgetattrs(), gpfs_putfattrs() ACLs for products like ADSM. • gpfs_fcntl() Provide hints about access patterns. Used by MPI-IO. Can be used to set access range, clear cache, & invoke Data Shipping...
GPFS Data Shipping • Not to be confused with IBM’s MPI-IO data shipping • A mechanism for coalescing small I/Os on a single node for disk efficiency (no distributed locking overhead, no false sharing). • It’s targeted at users who do not use MPI. • It violates POSIX semantics and affects other users of the same file. • Operation: • Data shipping entered via collective operation at file open time • Normal access from other programs inhibited while file is in DS mode • Each member of the collective assigned to serve a subset of blocks • Reads and writes go through this server regardless of where they originate • Data blocks only buffered at server, not at other clients
Application Do’s and Don’ts #1 • Avoid small (less than 64K) reads and writes • Use block-aligned records
Application Do’s and Don’ts #2 • Avoid overlapped writes node 1 node 2 node 3 node 4 Node 4 locks Offsets 6 - 8 GB Node 1 locks Offsets 0 - 2 GB Node 2 locks Offsets 2 - 4 GB Node 3 locks Offsets 4 - 6.1 GB 8 GB file File offsets 0 - 6.0 GB and 6.1 - 8.0 GB are accessed in parallel with no conflict Conflict for 6.0 - 6.1 GB detected by Lock manager, only one node may Access this region at a time
Application Do’s and Don’ts #3 • Partition data so as to make a given client’s accesses largely sequential (e.g. segmented data layouts are optimal, strided requires more token management). Strided Segmented
Application Do’s and Don’ts #4 • Use access patterns that can exploit prefetching and write-behind (e.g. sequential access is good, random access is bad). Use multiple opens if necessary. mmfsd mmfsd Metanode Stripe Grp Mgr Token Mgr Srvr I/O Node mmfsd app 1 4 I/O Node 2 5 Switch I/O Node 3 6 GPFS stripes successive blocks of each file across successive disks. For instance, disks 1, 2, 3 may be read concurrently if the access pattern is a match
Application Do’s and Don’ts #5 • Use MPI-IO • Collective operations • Predefined Data Layouts for memory & file • Opportunities for coalescing and improved prefetch/write-behind • (See notes from Dick Treumann’s MPI-IO talk)
Read & Write Performance taken from ASCI White machine • IBM RS/6000 SP System • 4 switch frames (1 in use at present) • 4 Nighthawk-2 nodes per frame • 16 375MHz 64-bit IBM Power3 CPUs per node • Now: 120 Nodes & 1920 CPUs Later: 512 Nodes & 8192 CPUs • Now: 8 GBytes RAM per node (0.96 TBytes total memory) Later: 266 @ 16 GB, 246 @ 8 GB (6.2 Tbytes total memory) • Now: Peak ~2.9 GFLOPS Later: Peak ~12.3 Tflops • Now: 4 I/O nodes (dedicated) Later: 16 I/O nodes (dedicated) • 4 RIO optical extenders per node (machine requires too much floor space for copper SSA cabling) • 4 SSA adapters per RIO • 3 73-GByte RAID sets per SSA adapter • Now: 14.0 TB total disk space Later: 150.0 TB total disk space • 5 Disks on each RAID (4 + P) – parity information is distributed among drives (i.e., not a dedicated parity drive)
Pre-upgrade White I/O Diagram4*(16*(3*(4+1))) RIO 116 NH2 Compute Nodes Each RIO connects To 4 Santa Cruz Adapters Colony Adapter Colony Switch (1 Switch/Frame) Each Santa Cruz Adapter is capable of: ~48 MB/sec writes ~88 MB/sec reads Clients (VSDs) communicate with LAPI protocol Speculation: Switch comm is client bottleneck: ~340 MB/sec/task Santa Cruz Disks are configured into a 4+p Raid Set (73 GB). Each Raid Set is capable of 14 MB/sec. Sixteen 375 Mhz Power 3 Santa Cruz 4 NH2 I/O Nodes Colony Adapter Colony Adapter: Point-to-point: ~400 MB/sec Unidirectional Switch comm is server bottleneck: ~1600 MB/sec total RIO Santa Cruz Each disk is an IBM SSA disk, max 8.5 MB/sec, typical 5.5 MB/sec, total capacity 18.2 GB. RIO RIO RIO Santa Cruz Four 3500 MB/sec RIOs Sixteen 375 Mhz Power 3
Single Client Performance Machine: white.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 4*(16*(3*(4+1))) CPU: 375Mhz P3 Node: NH-2 (16-way) ServerCache: 128*512K ClientCache: 100 MB MetadataCache: 2000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K
Performance Increase for 1x1 to 2x2 Machine: white.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 4*(16*(3*(4+1))) CPU: 375Mhz P3 Node: NH-2 (16-way) ServerCache: 128*512K ClientCache: 100 MB MetadataCache: 2000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K
Multi-Node Scalability Machine: white.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 4*(16*(3*(4+1))) CPU: 375Mhz P3 Node: NH-2 (16-way) ServerCache: 128*512K ClientCache: 100 MB MetadataCache: 2000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K
On Node Scalability Machine: white.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 4*(16*(3*(4+1))) CPU: 375Mhz P3 Node: NH-2 (16-way) ServerCache: 128*512K ClientCache: 100 MB MetadataCache: 2000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K
Metadata Improvements #1Batching on Removes • Segmented Block Allocation Map • Each segment contains bits representing blocks on all disks, each is lockable • Minimize contention for allocation map Machine: snow.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 2*(3*(3*(4+1))) CPU: 222Mhz P3 Node: NH-1 (8-way) ServerCache: 64*512K ClientCache: 20 MB MetadataCache: 1000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K
Metadata Improvements #2Batching on Creates Machine: snow.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 2*(3*(3*(4+1))) CPU: 222Mhz P3 Node: NH-1 (8-way) ServerCache: 64*512K ClientCache: 20 MB MetadataCache: 1000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K
Metadata Improvements #3Batching on Dir Operations Machine: snow.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 2*(3*(3*(4+1))) CPU: 222Mhz P3 Node: NH-1 (8-way) ServerCache: 64*512K ClientCache: 20 MB MetadataCache: 1000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K
“…and in conclusion…’’ • Further Info • Redbooks at http://www.redbooks.ibm.com (2, soon to be 3) • User guides at http://www.rs6000.ibm.com/resource/aix_resource/sp_books/gpfs/index.html • Europar 2000 paper on MPI-IO • Terry Jones, Alice Koniges, Kim Yates, “Performance of the IBM General Parallel File System”, Proc. International Parallel and Distributed Processing Symposium, May 2000. • Message Passing Interface Forum, “MPI-2: A Message Passing Interface Standard”, Standards Document 2.0, University of Tennessee, Knoxville, July 1997. • Acknowledgements • The metadata measurements (file creation & file deletion) are due to our summer intern, Bill Loewe • IBMers who reviewed this document for accuracy: Roger Haskin, Lyle Gayne, Bob Curran, Dan McNabb • This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.