320 likes | 613 Views
GPFS Tuning and Debugging. HEPiX 2007 Hamburg, Germany by Cary Whitney. GPFS Overview. Version 2.3 ptf12. Hardware. Metadata on faster disks FC for metadata, SATA for data Ideally on a RAM disk (Texas Memory) Use direct data access when possible
E N D
GPFS Tuning and Debugging HEPiX 2007 Hamburg, Germany by Cary Whitney
GPFS Overview • Version 2.3 ptf12 HEPiX 2007 DESY Hamburg - GPFS Tuning
Hardware • Metadata on faster disks • FC for metadata, SATA for data • Ideally on a RAM disk (Texas Memory) • Use direct data access when possible • Nodes should be treated as data transfer nodes • Tuning for WAN is very similar • GPFS is great for finding weak hardware (memory check, cpu, etc) HEPiX 2007 DESY Hamburg - GPFS Tuning
/etc/sysctl.conf • net.core.rmem_max = 16777216 • net.core.wmem_max = 16777216 • # • # Set buffers • net.ipv4.tcp_rmem = 4096 87380 8388608 • net.ipv4.tcp_wmem = 4096 65536 8388608 • net.ipv4.tcp_mem = 16777216 16777216 16777216 • # Set ephemeral ports • # • net.ipv4.ip_local_port_range = 32769 65535 • # • # Large read ahead • vm.max-readahead = 256 • vm.min-readahead = 128 • # GPFS • net.ipv4.tcp_synack_retries = 10 • net.core.netdev_max_backlog = 2500 • vm.min_free_kbytes = 10400 HEPiX 2007 DESY Hamburg - GPFS Tuning
Network • Routing (asymmetric routes) • Commonly caused by I/O nodes • Visibility • V2.3 All nodes need to be able to communicate with all nodes in the cluster. • V3.1 Owning cluster takes back metadata ownership. • Network reboots/switching cables – even though GPFS is communicating via Ethernet, it has its own internal timeouts that can be exceeded. HEPiX 2007 DESY Hamburg - GPFS Tuning
General Linux • /etc/ssh/sshd_config • # For GPFS • MaxStartups 1024 • Iptables • -I Networks 1 -p tcp -s X.X.0.0/16 --destination-port 790:792 -d 0.0.0.0/0 -j ACCEPT HEPiX 2007 DESY Hamburg - GPFS Tuning
GPFS Configuration • # ./mmlsconfig • tscTcpPort 790 • eventsExporterTcpPort 791 • Talk over secure ports • maxblocksize 4096k • Increase block size to match disk • maxMBpS 500 • Helps if the NSD are on a fast network • leaseDuration 120 • Help control timeout based on network HEPiX 2007 DESY Hamburg - GPFS Tuning
Design Tuning HEPiX 2007 DESY Hamburg - GPFS Tuning
DLFS LUN Layout (8-Node) SV01 SV02 SV03 SV04 SV08 SV09 SV10 SV11 QLOGIC FC Switch 1 QLOGIC FC Switch 2 DDN Controller 1 DDN Controller 2 MD (lun 0) sda MD (lun 2) sda MD (lun 4) sda MD (lun 8) sda MD (lun 1) sda MD (lun 3) sda MD (lun 5) sda MD (lun 7) sda DATA (lun 16) sdb DATA (lun 18) sdb DATA (lun 20) sdb DATA (lun 22) sdb DATA (lun 17) sdb DATA (lun 19) sdb DATA (lun 21) sdb DATA (lun 23) sdb HEPiX 2007 DESY Hamburg - GPFS Tuning Tier-1 Tier-2 Tier-3 Tier-4 Tier-5 Tier-6 Tier-7 Tier-8
Admin Node (SV05) LUN Layout (2-FC Ports) SV05 QLOGIC FC Switch 1 QLOGIC FC Switch 2 DDN Controller 1 DDN Controller 2 MD (lun 0) sda MD (lun 2) sda MD (lun 4) sda MD (lun 8) sda MD (lun 1) sda MD (lun 3) sda MD (lun 5) sda MD (lun 7) sda DATA (lun 16) sdb DATA (lun 18) sdb DATA (lun 20) sdb DATA (lun 22) sdb DATA (lun 17) sdb DATA (lun 19) sdb DATA (lun 21) sdb DATA (lun 23) sdb HEPiX 2007 DESY Hamburg - GPFS Tuning Tier-1 Tier-2 Tier-3 Tier-4 Tier-5 Tier-6 Tier-7 Tier-8
DLFS Failover Pairs SV01 SV02 SV03 SV04 SV08 SV09 SV10 SV11 Zone 1 Zone 2 Zone 1 Zone 2 QLOGIC FC Switch 1 QLOGIC FC Switch 2 DDN Controller 1 DDN Controller 2 MD (lun 0) sda MD (lun 2) sda MD (lun 4) sda MD (lun 8) sda MD (lun 1) sda MD (lun 3) sda MD (lun 5) sda MD (lun 7) sda DATA (lun 16) sdb DATA (lun 18) sdb DATA (lun 20) sdb DATA (lun 22) sdb DATA (lun 17) sdb DATA (lun 19) sdb DATA (lun 21) sdb DATA (lun 23) sdb HEPiX 2007 DESY Hamburg - GPFS Tuning Tier-1 Tier-2 Tier-3 Tier-4 Tier-5 Tier-6 Tier-7 Tier-8
DLFS Failover Scenario – An HBA Failure Step 1 – NSD fails over Failover Step 4 – NSD fails back SV01 SV02 SV03 SV04 SV08 SV09 SV10 SV11 Step 3 – Switch to spare HBA Step 2 – Enable spare port Zone 1 Zone 2 Zone 1 Zone 2 QLOGIC FC Switch 1 QLOGIC FC Switch 2 DDN Controller 1 DDN Controller 2 MD (lun 0) sda MD (lun 2) sda MD (lun 4) sda MD (lun 8) sda MD (lun 1) sda MD (lun 3) sda MD (lun 5) sda MD (lun 7) sda DATA (lun 16) sdb DATA (lun 18) sdb DATA (lun 20) sdb DATA (lun 22) sdb DATA (lun 17) sdb DATA (lun 19) sdb DATA (lun 21) sdb DATA (lun 23) sdb HEPiX 2007 DESY Hamburg - GPFS Tuning Tier-1 Tier-2 Tier-3 Tier-4 Tier-5 Tier-6 Tier-7 Tier-8
DLFS Failover Scenario – A DDN Controller Failure SV01 SV02 SV03 SV04 SV08 SV09 SV10 SV11 Zone 1 Zone 2 Zone 1 Zone 2 QLOGIC FC Switch 1 QLOGIC FC Switch 2 DDN Controller 1 Failove DDN Controller 2 MD (lun 0) sda MD (lun 2) sda MD (lun 4) sda MD (lun 8) sda MD (lun 1) sda MD (lun 3) sda MD (lun 5) sda MD (lun 7) sda DATA (lun 16) sdb DATA (lun 18) sdb DATA (lun 20) sdb DATA (lun 22) sdb DATA (lun 17) sdb DATA (lun 19) sdb DATA (lun 21) sdb DATA (lun 23) sdb HEPiX 2007 DESY Hamburg - GPFS Tuning Tier-1 Tier-2 Tier-3 Tier-4 Tier-5 Tier-6 Tier-7 Tier-8
DLFS Failover Scenario – A FC Switch Failure SV01 SV02 SV03 SV04 SV08 SV09 SV10 SV11 Zone 1 Zone 2 Zone 1 Zone 2 QLOGIC FC Switch 1 Failove QLOGIC FC Switch 2 DDN Controller 1 DDN Controller 2 MD (lun 0) sda MD (lun 2) sda MD (lun 4) sda MD (lun 8) sda MD (lun 1) sda MD (lun 3) sda MD (lun 5) sda MD (lun 7) sda DATA (lun 16) sdb DATA (lun 18) sdb DATA (lun 20) sdb DATA (lun 22) sdb DATA (lun 17) sdb DATA (lun 19) sdb DATA (lun 21) sdb DATA (lun 23) sdb HEPiX 2007 DESY Hamburg - GPFS Tuning Tier-1 Tier-2 Tier-3 Tier-4 Tier-5 Tier-6 Tier-7 Tier-8
DLFS Failover Scenario – A Node Failure failover SV01 SV02 SV03 SV04 SV08 SV09 SV10 SV11 Zone 1 Zone 2 Zone 1 Zone 2 QLOGIC FC Switch 1 QLOGIC FC Switch 2 DDN Controller 1 DDN Controller 2 MD (lun 0) sda MD (lun 2) sda MD (lun 4) sda MD (lun 8) sda MD (lun 1) sda MD (lun 3) sda MD (lun 5) sda MD (lun 7) sda DATA (lun 16) sdb DATA (lun 18) sdb DATA (lun 20) sdb DATA (lun 22) sdb DATA (lun 17) sdb DATA (lun 19) sdb DATA (lun 21) sdb DATA (lun 23) sdb HEPiX 2007 DESY Hamburg - GPFS Tuning Tier-1 Tier-2 Tier-3 Tier-4 Tier-5 Tier-6 Tier-7 Tier-8
DLFS Failover Scenario – Multiple Failures Possible failover SV01 SV02 SV03 SV04 SV08 SV09 SV10 SV11 Zone 1 Zone 2 Zone 1 Zone 2 QLOGIC FC Switch 1 QLOGIC FC Switch 2 DDN Controller 1 DDN Controller 2 MD (lun 0) sda MD (lun 2) sda MD (lun 4) sda MD (lun 8) sda MD (lun 1) sda MD (lun 3) sda MD (lun 5) sda MD (lun 7) sda DATA (lun 16) sdb DATA (lun 18) sdb DATA (lun 20) sdb DATA (lun 22) sdb DATA (lun 17) sdb DATA (lun 19) sdb DATA (lun 21) sdb DATA (lun 23) sdb HEPiX 2007 DESY Hamburg - GPFS Tuning Tier-1 Tier-2 Tier-3 Tier-4 Tier-5 Tier-6 Tier-7 Tier-8
Caveat • Design is geared toward Ethernet attached clusters • N5 (Cray) we installed larger FC switches to allow the I/O nodes to have direct access • Because of IBM issues, the Cray now mounts NGF via 2 NFS servers. This is for /project only. • NFS are separate GPFS clusters, thus they are the only owner of the GPFS metadata • Exports work the same as ext3 HEPiX 2007 DESY Hamburg - GPFS Tuning
Logs, errors and commands • /var/adm/ras/mmfs.log.latest • /var/log/messages • ./mmfsadm dump waiters • ./tsstatus filesystem • pdsh and dshbak (There is a version for Linux) HEPiX 2007 DESY Hamburg - GPFS Tuning
Coffee • Tue Jan 9 01:40:14 2007: Expel X.X.X.X (host03 in cluster1) request from X.X.X.X (host19 in cluster1). Expelling: X.X.X.X (host19 in cluster1) • Tue Jan 9 01:40:14 2007: Recovering nodes in cluster cluster2: X.X.X.X (in cluster1) • Tue Jan 9 01:40:14 2007: Recovery: fs7, delay 121 sec. for safe recovery. • Tue Jan 9 02:38:53 2007: Expel X.X.X.X (pdhost) request from X.X.X.X (host13 in cluster1). Expelling: X.X.X.X (host13 in cluster1) • Wed Apr 18 11:45:29 2007: Recovery: software, delay 99sec. for safe recovery. HEPiX 2007 DESY Hamburg - GPFS Tuning
Expired Lease • GPFS: 6027-2710 Node XX.XX.XX.XX (hostname) is being expelled due to expired lease. • On both clusters increase the missed ping timeout: • mmchconfig minMissedPingTimeout=60 • Only the quorum nodes need to be recycled (one at a time) to pick up the minMissedPing Timeout change, do lease manager node last. • The negative impact of increasing this value is that when a node is experiencing problems or goes down, it will take longer to expel it HEPiX 2007 DESY Hamburg - GPFS Tuning
Waiters • 0x88868B0 waiting 0.053385000 seconds, Msg handler quotaMsgRequestShare: on ThCond 0xB6763E34 (0xB6763E34) (Quota entry get share), reason 'wait for getShareIsActive = false' • 0xB656AEC0 waiting 0.043670000 seconds, Alloc summary file worker: on ThCond 0xF92206DC (0xF92206DC) (wait for inodeFlushFlag), reason 'waiting for the flush flag' • pdsh to the cluster to see where the longest waiter is or if there is a node that has a communication problem. HEPiX 2007 DESY Hamburg - GPFS Tuning
Waiting on HD • 0x85E4D60 waiting 0.016863000 seconds, NSD I/O Worker: for I/O completion on disk sdb3 • 0x85E28E0 waiting 0.015721000 seconds, NSD I/O Worker: for I/O completion on disk sdb3 • 0x85E1730 waiting 0.005890000 seconds, NSD I/O Worker: for I/O completion on disk sdb3 • Disks are just not keeping up with demand. Local disk with 3ware controller. • If time are in multiple seconds, then there could be a disk problem. HEPiX 2007 DESY Hamburg - GPFS Tuning
LocalPanic waiter • 0xB5B1D650 waiting 0.175064000 seconds, SG Exception LocalPanic: delaying for 0.024934000 more seconds, reason: StripeGroup::EndUse waiting for CacheObj holds • GPFS Restarting • Only fix is to restart GPFS • Caused by a security scan and possibly load on the node. HEPiX 2007 DESY Hamburg - GPFS Tuning
Filesystem corrupted • Wed Apr 18 11:50:50 2007: Command: err 46: mount sgespool 27858 • Wed Apr 18 11:50:50 2007: Device not ready. mount: wrong fs type, bad option, bad superblock on /dev/sgespool, or too many mounted file systems (could this be the IDE device where you in fact use ide-scsi so that sr0 or sda or so is needed?) Failed to open sgespool. Device not ready. HEPiX 2007 DESY Hamburg - GPFS Tuning
File system corrupt • Apr 16 15:44:41 host03 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=10700590: Invalid disk data structure. Error code 108. Volume common . Sense Data • Apr 16 15:44:41 host03 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=10700584: A8 DB 59 84 B8 C3 4E 3F • Apr 16 15:44:41 host03 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=10700587: 08 AF B3 14 B3 51 B8 38 • Apr 16 15:44:41 host03 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=10700588: 00 00 00 00 09 3B 24 A2 • Apr 16 15:44:41 host03 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=10700585: 2D 68 19 78 FA 4D 1B 3E • Apr 16 15:44:41 host03 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=10700585: 31 40 4D 91 F2 EC 9A 10 • Apr 16 15:44:41 host03 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=10700586: A8 DB 4D F9 B8 C3 48 3F • Apr 16 15:44:41 host03 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=10700590: 6C 00 01 00 00 00 00 01 HEPiX 2007 DESY Hamburg - GPFS Tuning
File system problem • Mon Jan 8 23:34:59 2007: File system manager takeover failed. • Mon Jan 8 23:34:59 2007: No such device • Mon Jan 8 23:34:59 2007: Node 128.X.X.X (host33) resigned as manager for fs12. • Mon Jan 8 23:34:59 2007: No such device • Mon Jan 8 23:35:00 2007: Node 128.X.X.X (host05) appointed as manager for fs10. • Mon Jan 8 23:35:00 2007: Node 128.X.X.X (host05) resigned as manager for fs10. • Mon Jan 8 23:35:00 2007: Too many disks are unavailable. • Mon Jan 8 23:35:00 2007: Node 128.X.X.X (host05) appointed as manager for fs10. • Mon Jan 8 23:35:01 2007: Node 128.X.X.X (host05) resigned as manager for fs10. • Mon Jan 8 23:35:01 2007: Too many disks are unavailable. • Mon Jan 8 23:35:01 2007: Node 128.X.X.X (host05) appointed as manager for fs10. • Mon Jan 8 23:35:01 2007: Node 128.X.X.X (host05) resigned as manager for fs10. • Mon Jan 8 23:35:01 2007: Too many disks are unavailable. • Mon Jan 8 23:35:02 2007: Node 128.X.X.X (host05) appointed as manager for fs10. • Mon Jan 8 23:35:02 2007: Node 128.X.X.X (host05) resigned as manager for fs10. • Mon Jan 8 23:35:02 2007: Too many disks are unavailable. • Mon Jan 8 23:36:19 2007: Node 128.X.X.X (host05) appointed as manager for fs5. • Mon Jan 8 23:36:19 2007: Node 128.X.X.X (host05) resigned as manager for fs5. • Mon Jan 8 23:36:19 2007: Log recovery failed. HEPiX 2007 DESY Hamburg - GPFS Tuning
File system corrupt • Mon Jan 8 23:34:55 2007: Node 128.X.X.X (host05) resigned as manager for fs10. • Mon Jan 8 23:34:55 2007: Too many disks are unavailable. • Mon Jan 8 23:34:55 2007: Cannot mount file system fs10 because it does not have a manager. • Mon Jan 8 23:34:55 2007: The last file system manager was node 128.X.X.X (host05). It has failed with error: • Mon Jan 8 23:34:55 2007: Too many disks are unavailable. • Mon Jan 8 23:34:55 2007: File System fs10 unmounted by the system with return code 212 reason code 218 • Mon Jan 8 23:34:55 2007: The current file system manager failed and no new manager will be appointed. HEPiX 2007 DESY Hamburg - GPFS Tuning
Disk down • # /usr/lpp/mmfs/bin/mmlsdisk fs1 • disk driver sector failure holds holds • name type size group metadata data status availability • ------------ -------- ------ ------- -------- ----- ------------- ------------ • host05 nsd 512 4001 yes yes ready up • host11 nsd 512 4003 yes yes ready up • host14 nsd 512 4004 yes yes ready up • host17 nsd 512 4005 yes yes ready unrecovered • host20 nsd 512 4006 yes yes ready up • host23 nsd 512 4007 yes yes ready up HEPiX 2007 DESY Hamburg - GPFS Tuning
Starting a down disk • ./mmchdisk fs1 start –a • Scanning file system metadata, phase 1 ... • 3 % complete on Thu Dec 1 14:49:16 2005 • 100 % complete on Thu Dec 1 14:50:23 2005 • Scan completed successfully. • Scanning file system metadata, phase 2 ... • 12 % complete on Thu Dec 1 14:50:26 2005 • 100 % complete on Thu Dec 1 14:50:50 2005 • Scan completed successfully. • Scanning file system metadata, phase 3 ... • 68 % complete on Thu Dec 1 14:50:53 2005 • 100 % complete on Thu Dec 1 14:50:54 2005 • Scan completed successfully. • Scanning user file metadata ... • 1 % complete on Thu Dec 1 14:50:57 2005 • 25 % complete on Thu Dec 1 14:51:00 2005 • 81 % complete on Thu Dec 1 14:51:03 2005 • 100 % complete on Thu Dec 1 14:51:04 2005 • Scan completed successfully. HEPiX 2007 DESY Hamburg - GPFS Tuning
fsck • ./mmfsck • Multiple passes over all inodes will be performed due to a • shortage of available memory. File system check would need • a minimum available pagepool memory of 572948480 bytes to perform • only one pass. The currently available memory for use by • mmfsck is 32944100 bytes. • Continue with multiple passes? yes • Checking inodes • Regions 0 to 596 of total 17077 • Node 128.X.X.X (host633) starting inode scan 0 to 284671 • Node 128.X.X.X (host433) starting inode scan 284672 to 569343 • Node 128.X.X.X (host433) ending inode scan 284672 to 569343 • Node 128.X.X.X (host433) starting inode scan 569344 to 854015 • Node 128.X.X.X (host633) ending inode scan 0 to 284671 • Node 128.X.X.X (host633) starting inode scan 854016 to 1138687 • Node 128.X.X.X (host633) ending inode scan 854016 to 1138687 • … • Checking inode map file • Checking directories and files • Checking log files • Checking extended attributes file • Checking allocation summary file • Checking file reference counts • Checking file system replication status HEPiX 2007 DESY Hamburg - GPFS Tuning
Where are things mounted • # ./tsstatus fs5 • The file system is still mounted on. • File system fs5 is managed by node X.X.X.X and mounted on: • X.X.X.X host1 • X.X.X.X host2 cluster1 • X.X.X.X host3 cluster1 • X.X.X.X host4 cluster2 HEPiX 2007 DESY Hamburg - GPFS Tuning
Concluding Resources • Some additional information in slide notes • Read the source • Coffee • NFS Stale handles • 95% is bad • App exceeds memory killing gpfs • https://lists.sdsc.edu/mailman/listinfo/gpfs-general • http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs23/bl1ins10/bl1ins1043.html • http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs31/bl1ins1146.html HEPiX 2007 DESY Hamburg - GPFS Tuning