570 likes | 684 Views
Scyld ClusterWare System Administration. Orientation Agenda – Part 1. Scyld ClusterWare foundations Booting process Startup scripts File systems Name services Cluster Configuration Cluster Components Networking infrastructure NFS File servers IPMI Configuration Break.
E N D
Scyld ClusterWare System Administration Confidential – Internal Use Only
Orientation Agenda – Part 1 • Scyld ClusterWare foundations • Booting process • Startup scripts • File systems • Name services • Cluster Configuration • Cluster Components • Networking infrastructure • NFS File servers • IPMI Configuration • Break Confidential – Internal Use Only
Orientation Agenda – Part 2 • Parallel jobs • MPI configuration • Infiniband interconnect • Queuing • Initial setup • Tuning • Policy case studies • Other software and tools • Troubleshooting • Questions and Answers Confidential – Internal Use Only
Orientation Agenda – Part 1 • Scyld ClusterWare foundations • Booting process • Startup scripts • File systems • Name services • Cluster Configuration • Cluster Components • Networking infrastructure • NFS File servers • Break Confidential – Internal Use Only
Optional Disks Interconnection Network Master Node Internet or Internal Network Cluster VirtualizationArchitectureRealized • Minimal in-memory OS with single daemon rapidly deployed in seconds - no disk required • Less than 20 seconds • Virtual, unified process space enables intuitive single sign-on, job submission • Effortless job migration to nodes • Monitor & manage efficiently from the Master • Single System Install • Single Process Space • Shared cache of the cluster state • Single point of provisioning • Better performance due to lightweight nodes • No version skew is inherently more reliable Manage & use a cluster like a single SMP machine Confidential – Internal Use Only
Optional Disks Interconnection Network Master Node Internet or Internal Network Elements of Cluster Systems • Some important elements of a cluster system • Booting and Provisioning • Process creation, monitoring and control • Update and consistency model • Name services • File Systems • Physical management • Workload virtualization Confidential – Internal Use Only
Booting and Provisioning • Integrated, automatic network boot • Basic hardware reporting and diagnostics in the Pre-OS stage • Only CPU, memory and NIC needed • Kernel and minimal environment from master • Just enough to say “what do I do now?” • Remaining configuration driven by master • Logs are stored in: • /var/log/messages • /var/log/beowulf/node.* Confidential – Internal Use Only
DHCP and TFTP services • Started from /etc/rc.d/init.d/beowulf • Locate vmlinuz in /boot • Configure syslog and other parameters on the head node • Loads kernel modules • Setup libraries • Creates ramdisk image for compute nodes • Starts DHCP/TFTP server (beoserv) • Configures NAT for ipforwarding if needed • Starts kickback name service daemon (4.2.0+) • Tune network stack Confidential – Internal Use Only
Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only
Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only
Subnet configuration • Default used to be class C Network • netmask 255.255.255.0 • Limited to 155 compute nodes ( 100 + $NODE < 255 ) • Last octect denotes special devices • x.x.x.10 switches • x.x.x.30 storage • Infiniband is a separate network • x.x.1.$(( 100 + $NODE )) • Needed eth0:1 to reach IPMI network • x.x.2.$(( 100 + $NODE )) • /etc/sysconfig/network-scripts/ifcfg-eth0:1 • ifconfig eth0:1 10.54.2.1 netmask 255.255.255.0 Confidential – Internal Use Only
Subnet configuration • New standard is class B Network • netmask 255.255.0.0 • Limited to 100 * 256 compute nodes • 10.54.50.x – 10.54.149.x • Third octect denotes special devices • x.x.10.x switches • x.x.30.x storage • Infiniband is a separate network • x.$(( x+1)).x.x • IPMI is on the same network (eth0:1 not needed) • x.x.150.$NODE Confidential – Internal Use Only
Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only
Setup_fs • Script is in /usr/lib/beoboot/bin/setup_fs • Configuration file: /etc/beowulf/fstab • # Select which FSTAB to use.if [ -r /etc/beowulf/fstab.$NODE ] ; then FSTAB=/etc/beowulf/fstab.$NODEelse FSTAB=/etc/beowulf/fstabfiecho "setup_fs: Configuring node filesystems using $FSTAB...“ • $MASTER is determined and populated • “nonfatal” option allows compute nodes to finish boot process and log errors in /var/log/beowulf/node.* • NFS mounts of external servers needs to be done via IP address because name services have not been configured yet Confidential – Internal Use Only
beofdisk • Beofdisk configures partition tables on compute nodes • To configure first drive: • bpsh 0 fdisk /dev/sda • Typical interactive usage • Query partition table: • beofdisk -q --node 0 • Write partition tables to other nodes: • for i in $(seq 1 10); do beofdisk -w --node $i ; done • Create devices initially • Use head nodes /dev/sd* as reference: • [root@scyld beowulf]# ls -l /dev/sda*brw-rw---- 1 root disk 8, 0 May 20 08:18 /dev/sdabrw-rw---- 1 root disk 8, 1 May 20 08:18 /dev/sda1brw-rw---- 1 root disk 8, 2 May 20 08:18 /dev/sda2brw-rw---- 1 root disk 8, 3 May 20 08:18 /dev/sda3[root@scyld beowulf]# bpsh 0 mknod /dev/sda1 b 8 1 Confidential – Internal Use Only
Create local filesystems • After partitions have been created, mkfs • bpsh –an mkswap /dev/sda1 • bpsh –an mkfs.ext2 /dev/sda2 • ext2 is a non-journaled filesystem, faster than ext3 for scratch file system • If corruption occurs, simply mkfs again • Copy int18 bootblock if needed: • bpcp /usr/lib/beoboot/bin/int18_bootblock $NODE:/dev/sda • /etc/beowulf/config options for file system creation • # The compute node file system creation and consistency checking policies.fsck fullmkfs never Confidential – Internal Use Only
Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only
Name services • /usr/lib/beoboot/bin/node_up populates /etc/hosts and /etc/nsswitch.conf on compute nodes • beo name service determines values from /etc/beowulf/config file • bproc name service determines values from current environment • ‘getent’ can be used to query entries • getent netgroup cluster • getent hosts 10.54.0.1 • getent hosts n3 • If system-config-authentication is run, ensure that proper entries still exist in /etc/nsswitch.conf (head node) Confidential – Internal Use Only
Optional Disks Interconnection Network Master Node Internet or Internal Network BeoNSS Hostnames n0 n1 n2 n3 n4 n5 • Opportunity: We control IP address assignment • Assign node IP addresses in node order • Changes name lookup to addition • Master: 10.54.0.1GigE Switch: 10.54.10.0IB Switch: 10.54.11.0NFS/Storage: 10.54.30.0Nodes: 10.54.50.$node • Name format • Cluster hostnames have the base form n<N> • Options for admin-defined names and networks • Special names for "self" and "master" • Current machine is ".-2" or "self". • Master is known as ".-1", “master”, “master0” .-1 master Confidential – Internal Use Only
Changes • Prior to 4.2.0 • Hostnames default to .<NODE> form • /etc/hosts had to be populated with alternative names and IP addresses • May break @cluster netgroup and hence NFS exports • /etc/passwd and /etc/group needed on compute nodes for Torque • 4.2.0+ • Hostnames default to n<NODE> form • Configuration is driven by /etc/beowulf/config and beoNSS • Username and groups can be provided by kickback daemon for Torque Confidential – Internal Use Only
Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only
ClusterWare Filecache functionality • Provided by filecache kernel module • Configured by /etc/beowulf/config libraries directives • Dynamically controlled by ‘bplib’ • Capabilities exist in all ClusterWare 4 versions • 4.2.0 add prestage keyword in /etc/beowulf/config • Prior versions needed additional scripts in /etc/beowulf/init.d • For libraries listed in /etc/beowulf/config, files can be prestaged by ‘md5sum’ the file • # Prestage selected libraries. The keyword is generic, but the current# implementation only knows how to "prestage" a file that is open'able on# the compute node: through the libcache, across NFS, or already exists# locally (which isn't really a "prestaging", since it's already there).prestage_libs=`beoconfig prestage`for libname in $prestage_libs ; do # failure isn't always fatal, so don't use run_cmd echo "node_up: Prestage file:" $libname bpsh $NODE md5sum $libname > /dev/nulldone Confidential – Internal Use Only
Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only
Compute nodes init.d scripts • Located in /etc/beowulf/init.d • Scripts start on the head node and need explicit bpsh and beomodprobe to operate on compute nodes • $NODE has been prepopulated by /usr/lib/beoboot/bin/node_up • Order is based on file name • Numbered files can be used to control order • beochkconfig is used to set +x bit on files Confidential – Internal Use Only
Cluster Configuration • /etc/beowulf/config is the central location for cluster configuration • Features are documented in ‘man beowulf-config’ • Compute node order is determined by ‘node’ parameters • Changes can be activated by doing a ‘service beowulf reload’ Confidential – Internal Use Only
Orientation Agenda – Part 1 • Scyld ClusterWare foundations • Booting process • Startup scripts • File systems • Name services • Cluster Configuration • Cluster Components • Networking infrastructure • NFS File servers • IPMI configuration • Break Confidential – Internal Use Only
Optional Disks Interconnection Network Master Node Internet or Internal Network Elements of Cluster Systems • Some important elements of a cluster system • Booting and Provisioning • Process creation, monitoring and control • Update and consistency model • Name services • File Systems • Physical management • Workload virtualization Confidential – Internal Use Only
Compute Node Boot Process • Starts with /etc/beowulf/node_up • Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: • System date • Basic network configuration • Kernel modules (device drivers) • Network routing • setup_fs • Name services • chroot • Prestages files (4.2.0+) • Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only
Optional Disks Interconnection Network Master Node Internet or Internal Network Remote Filesystems • Remote - Share a single disk among all nodes • Every node sees same filesystem • Synchronization mechanisms manage changes • Locking has either high overhead or causes serial blocking • "Traditional" UNIX approach • Relatively low performance • Doesn't scale well; server becomes bottleneck in large systems • Simplest solution for small clusters, reading/writing small files Confidential – Internal Use Only
NFS Server Configuration • Head node NFS services • Configuration in /etc/exports • Provides system files (/bin, /usr/bin) • Increase number of NFS daemons • echo “RPCNFSDCOUNT=64” > /etc/sysconfig/nfs ; service nfs restart • Dedicated NFS server • SLES10 was recommended; RHEL5 now includes some xfs support • xfs has better performance • OS has better IO performance than RHEL4 • Network trunking can be used to increase bandwidth (with caveats) • Hardware RAID • Adaptec RAID card • CTRL-A at boot • arcconf utility from http://www.adaptec.com/en-US/support/raid/ • External storage (Xyratex or nStor) • SAS-attached • Fibre channel attached Confidential – Internal Use Only
Network trunking • Use multiple physical links as a single pipe for data • Configuration must be done on host and switch • SLES 10 configuration • Create a configuration file /etc/sysconfig/network/ifcfg-bond0 for the bond0 interface • BOOTPROTO=staticDEVICE=bond0IPADDR=10.54.30.0NETMASK=255.255.0.0STARTMODE=onbootMTU='‘BONDING_MASTER=yesBONDING_SLAVE_0=eth0BONDING_SLAVE_1=eth1BONDING_MODULE_OPTS='mode=0 miimon=500' Confidential – Internal Use Only
Network trunking • HP switch configuration • Create trunk group via serial or telnet interface • Netgear (admin:password) • Create trunk group via http interface • Cisco • Create etherchannel configuration Confidential – Internal Use Only
External Storage • Xyratex arrays have a configuration interface • Text based via serial port • Newer devices (nStor 5210, Xyratex F/E 5402/5412/5404) have embedded StorView • http://storage0:9292 • admin:password • RAID arrays, logical drives are configured and monitored • LUNs are numbered and presented on each port. Highest LUN is the controller itself • Multipath or failover needs to be configured Confidential – Internal Use Only
Need for QLogic Failover • Collapse LUN presentation in OS to a single instance per LUN • Minimize potential for user error which maintaining failover and static loadbalancing Confidential – Internal Use Only
Physical Management • ipmitool • Intelligent Platform Management Interface (IPMI) is integrated into the base management console (BMC) • Serial-over-LAN (SOL) can be implemented • Allows access to hardware such as sensor data or power states • E.g. ipmitool –H n$NODE-ipmi –U admin –P admin power {status,on,off} • bpctl • Controls the operational state and ownership of compute nodes • Examples might be to reboot or power off a node • Reboot: bpctl –S all -R • Power off: bpctl –S all –P • Limit user and group access to run on a particular node or set of nodes Confidential – Internal Use Only
IPMI Configuration • Full spec is available here: • http://www.intel.com/design/servers/ipmi/pdf/IPMIv2_0_rev1_0_E3_markup.pdf • Penguin Specific configuration • Recent products all have IPMI implementations. Some are in-band (share physical media with eth0), some are out-of-band (separate port and cable from eth0) • Altus 1300, 600, 650 – In-band, lan channel 6 • Altus 1600, 2600, 1650, 2650; Relion 1600, 2600, 1650, 2650, 2612 – Out-of-band, lan channel 2 • Relion 1670 – In-band, lan channel 1 • Altus x700/x800, Relion x700 – Out-of-band OR in-band, lan channel 1 • Some ipmitool versions have a bug and need to following command to commit a write • bpsh $NODE ipmitool raw 12 1 $CHANNEL 0 0 Confidential – Internal Use Only
Orientation Agenda – Part 2 • Parallel jobs • MPI configuration • Infiniband interconnect • Queueing • Initial setup • Tuning • Policy case studies • Other software and tools • Questions and Answers Confidential – Internal Use Only
Explicitly Parallel Programs • Different paradigms exist for parallelizing programs • Shared memory • OpenMP • Sockets • PVM • Linda • MPI • Most distributed parallel programs are now written using MPI • Different options for MPI stacks: MPICH, OpenMPI, HP, and Intel • ClusterWare comes integrated with customized versions of MPICH and OpenMPI Confidential – Internal Use Only
Compiling MPICH programs • mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/MPICH • GNU, PGI, and Intel compilers are supported • Effectively set libraries and includes for compile and linking • prefix="/usr“part1="-I${prefix}/include“part2="“part3="-lmpi -lbproc“…part1="-L${prefix}/${lib}/MPICH/p4/gnu $part1“…$cc $part1 $part2 $part3 Confidential – Internal Use Only
Running MPICH programs • mpirun is used to launch MPICH programs • If Infiniband is installed, the interconnect fabric can be chosen using the machine flag: • -machine p4 • -machine vapi • Done by changing LD_LIBRARY_PATH at runtime • export LD_LIBRARY_PATH=${libdir}/MPICH/${MACHINE}/${compiler}:${LD_LIBRARY_PATH} • Hooks for using mpiexec for Queue system • elif [ -n "${PBS_JOBID}" ]; then for var in NP NO_LOCAL ALL_LOCAL BEOWULF_JOB_MAP do unset $var done for hostname in `cat $PBS_NODEFILE` do NODENUMBER=`getent hosts ${hostname} | awk '{print $3}' | tr -d '.'` BEOWULF_JOB_MAP="${BEOWULF_JOB_MAP}:${NODENUMBER}“ done # Clean a leading : from the map export BEOWULF_JOB_MAP=`echo ${BEOWULF_JOB_MAP} | sed 's/^://g'` # The -n 1 argument is important here exec mpiexec -n 1 ${progname} "$@" Confidential – Internal Use Only
Environment Variable Options • Additional environment variable control: • NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes. • ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs. • ALL_NODES—Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU. • ALL_LOCAL — Run every process on the master node; used for debugging purposes. • NO_LOCAL — Don’t run any processes on the master node. • EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment. • BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on. Confidential – Internal Use Only
Running MPICH programs • Prior to ClusterWare 4.1.4, mpich jobs were spawned outside of the queue system • BEOWULF_JOB_MAP had to be set based on machines listed in $PBS_NODEFILE • number_of_nodes=`cat $PBS_NODEFILE | wc -l`hostlist=`cat $PBS_NODEFILE | head -n 1 `for i in $(seq 2 $number_of_nodes ) ; do hostlist=${hostlist}:`cat $PBS_NODEFILE | head -n $i | tail -n 1`doneBEOWULF_JOB_MAP=`echo $hostlist | sed 's/\.//g' | sed 's/n//g'`export BEOWULF_JOB • Starting with ClusterWare 4.1.4, mpiexec was included with the distribution. mpiexec is an alternative spawning mechanism that starts processes as part of the queue system • Other MPI implementations have alternatives. HP-MPI and Intel MPI use rsh and run outside of the queue system. OpenMPI uses libtm to properly start processes Confidential – Internal Use Only
MPI Primer • Only a brief introduction is provided here for MPI. Many other in-depth tutorials are available on the web and in published sources. • http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html • http://www.llnl.gov/computing/tutorials/mpi/ • Paradigms for writing parallel programs depend upon the application • SIMD (single-instruction multiple-data) • MIMD (multiple-instruction multiple-data) • MISD (multiple-instruction single-data) • SIMD will be presented here as it is a commonly used template • A single application source is compiled to perform operations on different sets of data • The data is read by the different threads or passed between threads via messages (hence MPI = message passing interface) • Contrast this with shared memory or OpenMP where data is locally via memory • Optimizations in the MPI implementation can perform localhost optimization; however, the program is still written using a message passing construct • MPI specification has many functions; however most MPI programs can be written with only a small subset Confidential – Internal Use Only
Infiniband Primer • Infiniband provides a low-latency, high-bandwidth interconnect for message to minimize IO for tightly coupled parallel applications • Infiniband requires hardware, kernel drivers, O/S support, user land drivers, and application support • Prior to 4.2.0, software stack was provided by SilverStorm • Starting with 4.2.0, ClusterWare migrated to using the OpenFabrics (ofed, openIB) stack Confidential – Internal Use Only
Infiniband Subnet Manager • Every Infiniband network requires a Subnet Manager to discover and manage the topology • Our clusters typically ship with a Managed QLogic Infiniband switch with an embedded subnet manager (10.54.0.20; admin:adminpass) • Subnet Manager is configured to start at switch boot • Alternatively, a software Subnet Manager (e.g. openSM) can be run on a host connected to the Infiniband fabric. • Typically the embedded subnet manager is more robust and provides a better experience Confidential – Internal Use Only
Communication Layers • Verbs API (VAPI) provides a hardware specific interface to the transport media • Any program compiled with VAPI can only run on the same hardware profile and drivers • Makes portability difficult • Direct Access Programming Language (DAPL) provides a more consistent interface • DAPL layers can communicate with IB, Myrinet, and 10GigE hardware • Better portability for MPI libraries • TCP/IP interface • Another upper layer protocol provides IP-over-IB (IPoIB) where the IB interface is assigned an IP address and most standard TCP/IP applications work Confidential – Internal Use Only
MPI Implementation Comparison • MPICH is provided by Argonne National Labs • Runs only over Ethernet • Ohio State University has ported MPICH to use the Verbs API => MVAPICH • Similar to MPICH but uses Infiniband • LAM-MPI was another implementation which provided a more modular format • OpenMPI is the successor to LAM-MPI and has many options • Can use different physical interfaces and spawning mechanisms • http://www.openmpi.org • HP-MPI, Intel-MPI • Licensed MPICH2 code and added functionality • Can use a variety of physical interconnects Confidential – Internal Use Only
OpenMPI Configuration • ./configure --prefix=/opt/openmpi --with-udapl=/usr --with-tm=/usr --with-openib=/usr --without-bproc --without-lsf_bproc --without-grid --without-slurm --without-gridengine --without-portals --without-gm --without-loadleveler --without-xgrid --without-mx --enable-mpirun-prefix-by-default --enable-static • make all • make install • Create scripts in /etc/profile.d to set default environment variables for all users • mpirun -v -mca pls_rsh_agent rsh -mca btl openib,sm,self -machinefile machinefile ./IMB-MPI1 Confidential – Internal Use Only
Queuing • How are resources allocated among multiple users and/or groups? • Statically by using bpctl user and group permissions • ClusterWare supports a variety of queuing packages • TaskMaster (advanced MOAB policy based scheduler integrated ClusterWare) • Torque • SGE Confidential – Internal Use Only
Interacting with TaskMaster • Because TaskMaster uses the MOAB scheduler with Torque pbs_server and pbs_mom components, all of the Torque commands are still valid • qsub will submit a job to Torque, MOAB then polls pbs_server to detect new jobs • msub will submit a job to Moab which then pushes the job to pbs_server • Other TaskMaster commands • qstat -> showq • qdel, qhold, qrls -> mjobctl • pbsnodes -> showstate • qmgr -> mschedctl, mdiag • Configuration in /opt/moab/moab.cfg Confidential – Internal Use Only