1 / 22

August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

August 15, 2000 Terry Jones Lawrence Livermore National Laboratory. Using GPFS. Outline. Scope Presentation aimed at scientific/technical app writers Many interesting topics skipped: fs reliability, fs maintenance, ... Brief Overview of GPFS Architecture Disk Drive Basics

ifama
Download Presentation

August 15, 2000 Terry Jones Lawrence Livermore National Laboratory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. August 15, 2000 Terry Jones Lawrence Livermore National Laboratory Using GPFS

  2. Outline • Scope • Presentation aimed at scientific/technical app writers • Many interesting topics skipped: fs reliability, fs maintenance, ... • Brief Overview of GPFS Architecture • Disk Drive Basics • GPFS -vs- NFS -vs- JFS • What’s New In GPFS • libgpfs • Metadata improvements • Performance oriented changes • Application Do’s and Don’ts • 5 Tips that will keep your codes humming • Recent Performance Measurements • “You can’t put tailfins on data” -- H.B.

  3. Disk Drive Basics Disks are mechanical thingies  huge latencies Performance is a function of buffer size  Caching Disk Speed is advancing, but slower than processor speed  Disk Parallelism

  4. Switch File System Architectures mmfsd mmfsd • GPFS Parallel File System • GPFS fs are striped across multiple disks on multiple I/O nodes • All compute nodes can share all disks • Larger files, larger file systems, higher bandwidths Metanode Stripe Grp Mgr Token Mgr Srvr mmfsd app vsd Switch I/O Nodes Compute Nodes • JFS (Native AIX fs) • No file sharing - on-node only • Apps must do own partitioning • NFS / DFS / AFS • Clients share files on server node • Server node becomes bottleneck Single Storage Controller Clients Unsecure Network

  5. New in Version 1.3 • Multiple levels of indirect blocks => larger max file size, max number of files increases • SDR interaction improved • VSD KLAPI (flow control, one less copy, relieves usage of CSS send and receive pool) • Prefetch algorithms now recognize strided & reverse sequential access • Batching => much better metadata performance

  6. New Programming Interfaces • Support for mmap functions as described in X/Open 4.2 • libgpfs • gpfs_prealloc() An external programming call, which provides the ability to preallocate space for a file, has been added to GPFS. This will allow you to preallocate an amount of space for a file that has already been opened, prior to writing data to the file (use to ensure sufficient space). • gpfs_stat() and gpfs_fstat() Provide exact mtime and ctime. • gpfs_getacl(), gpfs_putacl() gpfs_fgetattrs(), gpfs_putfattrs() ACLs for products like ADSM. • gpfs_fcntl() Provide hints about access patterns. Used by MPI-IO. Can be used to set access range, clear cache, & invoke Data Shipping...

  7. GPFS Data Shipping • Not to be confused with IBM’s MPI-IO data shipping • A mechanism for coalescing small I/Os on a single node for disk efficiency (no distributed locking overhead, no false sharing). • It’s targeted at users who do not use MPI. • It violates POSIX semantics and affects other users of the same file. • Operation: • Data shipping entered via collective operation at file open time • Normal access from other programs inhibited while file is in DS mode • Each member of the collective assigned to serve a subset of blocks • Reads and writes go through this server regardless of where they originate • Data blocks only buffered at server, not at other clients

  8. Application Do’s and Don’ts #1 • Avoid small (less than 64K) reads and writes • Use block-aligned records

  9. Application Do’s and Don’ts #2 • Avoid overlapped writes node 1 node 2 node 3 node 4 Node 4 locks Offsets 6 - 8 GB Node 1 locks Offsets 0 - 2 GB Node 2 locks Offsets 2 - 4 GB Node 3 locks Offsets 4 - 6.1 GB 8 GB file File offsets 0 - 6.0 GB and 6.1 - 8.0 GB are accessed in parallel with no conflict Conflict for 6.0 - 6.1 GB detected by Lock manager, only one node may Access this region at a time

  10. Application Do’s and Don’ts #3 • Partition data so as to make a given client’s accesses largely sequential (e.g. segmented data layouts are optimal, strided requires more token management). Strided Segmented

  11. Application Do’s and Don’ts #4 • Use access patterns that can exploit prefetching and write-behind (e.g. sequential access is good, random access is bad). Use multiple opens if necessary. mmfsd mmfsd Metanode Stripe Grp Mgr Token Mgr Srvr I/O Node mmfsd app 1 4 I/O Node 2 5 Switch I/O Node 3 6 GPFS stripes successive blocks of each file across successive disks. For instance, disks 1, 2, 3 may be read concurrently if the access pattern is a match

  12. Application Do’s and Don’ts #5 • Use MPI-IO • Collective operations • Predefined Data Layouts for memory & file • Opportunities for coalescing and improved prefetch/write-behind • (See notes from Dick Treumann’s MPI-IO talk)

  13. Read & Write Performance taken from ASCI White machine • IBM RS/6000 SP System • 4 switch frames (1 in use at present) • 4 Nighthawk-2 nodes per frame • 16 375MHz 64-bit IBM Power3 CPUs per node • Now: 120 Nodes & 1920 CPUs Later: 512 Nodes & 8192 CPUs • Now: 8 GBytes RAM per node (0.96 TBytes total memory) Later: 266 @ 16 GB, 246 @ 8 GB (6.2 Tbytes total memory)  • Now: Peak ~2.9 GFLOPS Later: Peak ~12.3 Tflops • Now: 4 I/O nodes (dedicated) Later: 16 I/O nodes (dedicated) • 4 RIO optical extenders per node (machine requires too much floor space for copper SSA cabling) • 4 SSA adapters per RIO • 3 73-GByte RAID sets per SSA adapter • Now: 14.0 TB total disk space Later: 150.0 TB total disk space • 5 Disks on each RAID (4 + P) – parity information is distributed among drives (i.e., not a dedicated parity drive) 

  14. Pre-upgrade White I/O Diagram4*(16*(3*(4+1))) RIO 116 NH2 Compute Nodes Each RIO connects To 4 Santa Cruz Adapters Colony Adapter Colony Switch (1 Switch/Frame) Each Santa Cruz Adapter is capable of: ~48 MB/sec writes ~88 MB/sec reads Clients (VSDs) communicate with LAPI protocol Speculation: Switch comm is client bottleneck: ~340 MB/sec/task Santa Cruz Disks are configured into a 4+p Raid Set (73 GB). Each Raid Set is capable of 14 MB/sec. Sixteen 375 Mhz Power 3 Santa Cruz 4 NH2 I/O Nodes Colony Adapter Colony Adapter: Point-to-point: ~400 MB/sec Unidirectional Switch comm is server bottleneck: ~1600 MB/sec total RIO Santa Cruz Each disk is an IBM SSA disk, max 8.5 MB/sec, typical 5.5 MB/sec, total capacity 18.2 GB. RIO RIO RIO Santa Cruz Four 3500 MB/sec RIOs Sixteen 375 Mhz Power 3

  15. Single Client Performance Machine: white.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 4*(16*(3*(4+1))) CPU: 375Mhz P3 Node: NH-2 (16-way) ServerCache: 128*512K ClientCache: 100 MB MetadataCache: 2000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K

  16. Performance Increase for 1x1 to 2x2 Machine: white.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 4*(16*(3*(4+1))) CPU: 375Mhz P3 Node: NH-2 (16-way) ServerCache: 128*512K ClientCache: 100 MB MetadataCache: 2000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K

  17. Multi-Node Scalability Machine: white.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 4*(16*(3*(4+1))) CPU: 375Mhz P3 Node: NH-2 (16-way) ServerCache: 128*512K ClientCache: 100 MB MetadataCache: 2000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K

  18. On Node Scalability Machine: white.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 4*(16*(3*(4+1))) CPU: 375Mhz P3 Node: NH-2 (16-way) ServerCache: 128*512K ClientCache: 100 MB MetadataCache: 2000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K

  19. Metadata Improvements #1Batching on Removes • Segmented Block Allocation Map • Each segment contains bits representing blocks on all disks, each is lockable • Minimize contention for allocation map Machine: snow.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 2*(3*(3*(4+1))) CPU: 222Mhz P3 Node: NH-1 (8-way) ServerCache: 64*512K ClientCache: 20 MB MetadataCache: 1000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K

  20. Metadata Improvements #2Batching on Creates Machine: snow.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 2*(3*(3*(4+1))) CPU: 222Mhz P3 Node: NH-1 (8-way) ServerCache: 64*512K ClientCache: 20 MB MetadataCache: 1000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K

  21. Metadata Improvements #3Batching on Dir Operations Machine: snow.llnl.gov Dedicated: yes Date: Aug, 2000 PSSP: 3.2 + PTF 1 AIX: 4.3.3 + PTF 16 GPFS: 1.3 I/O Config: 2*(3*(3*(4+1))) CPU: 222Mhz P3 Node: NH-1 (8-way) ServerCache: 64*512K ClientCache: 20 MB MetadataCache: 1000 files GPFSprotocol: tcp VSDprotocol: K-LAPI GPFSblocksize: 512K

  22. “…and in conclusion…’’ • Further Info • Redbooks at http://www.redbooks.ibm.com (2, soon to be 3) • User guides at http://www.rs6000.ibm.com/resource/aix_resource/sp_books/gpfs/index.html • Europar 2000 paper on MPI-IO • Terry Jones, Alice Koniges, Kim Yates, “Performance of the IBM General Parallel File System”, Proc. International Parallel and Distributed Processing Symposium, May 2000. • Message Passing Interface Forum, “MPI-2: A Message Passing Interface Standard”, Standards Document 2.0, University of Tennessee, Knoxville, July 1997. • Acknowledgements • The metadata measurements (file creation & file deletion) are due to our summer intern, Bill Loewe • IBMers who reviewed this document for accuracy: Roger Haskin, Lyle Gayne, Bob Curran, Dan McNabb • This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.

More Related