Administration Tools for Managing Large Scale Linux Cluster

Administration Tools for Managing Large Scale Linux Cluster CRC KEK Japan S.Kawabata, A.Manabe atsushi.manabe@kek.jp

Linux PC Clusters in KEK

PC Cluster 2 PenIII 800MHz 80CPU (40 nodes) PC Cluster 1 PenIII Xeon 500MHz 144 CPUs (36 nodes)

PC Cluster 3 (Belle) Pentium III Xeon 700MHz 320CPU (80 nodes)

PC cluster 4 (Neutron simulation) • FujitsuTS225 50 nodes • Pentium III 1GHz x 2CPU • 512MB memory • 31GB disk • 100BaseTX x 2 • 1U rack-mount model • RS232C x2 • Remote BIOS setting • Remote reset/power-off

PC Cluster 5 (Belle) 1U server Pentium III 1.2GHz 256 CPU (128 nodes)

3U PC Cluster 6 Blade server: LP Pentium III 700MHz 40CPU (40 nodes)

PC clusters • Already more than 400 (>800CPUs) nodes Linux PC clusters were installed. • Only >middle size PC cluster are counted. • A major exp. (Belle) group plan to install several x100 nodes of blade server in this year. • All PC clusters are managed by individual user group themselves. ACAT2002

Center Machine (KEK CRC) • Currently machines in KEK Computer Center(CRC) are UNIX(solaris,AIX) servers. • Plan to have >1000 nodes Linux computing cluster in near future (~2004). • Will be installed under `~4years rental’ contract. (every 2 years HW update ?) ACAT2002

Center Machine • The system will be share among many user groups. (don’t dedicate to one gr. only) • Their demand for CPU power vary with months. (High demand before int’l-conference or so on) • Of course, we use load-balancing Batch system. • Big groups uses their own software frame work. • Their jobs only run under some restricted version of OS(Linux) /middle-ware/configuration. ACAT2002

R&D system • Frequent change of system configuration/ cpu partition. • To manage such size of PC cluster and such user request, we need to have some sophisticated admin. tools. ACAT2002

Necessary admin. tools • System (SW) Installation /update • Configuration • Status Monitoring/ System Health Check • Command Execution ACAT2002

Installation tool ACAT2002

Installation tool • Two types of `installation tool’ • Disk Cloning • Application Package Installer • system(kernel) is an application in this term. ACAT2002

Copy disk partition image to nodes Installation tool (cloning) Image Cloning Install system/application on a `master host’. ACAT2002

Installation tool (package installer) request Package server Image and control Package Information DB Clients Package archive ACAT2002

Remote Installation via NW • Cloning disk image • SystemImager (VA) http://systemimager.sourceforge.net/ • CATS-i (soongsil Univ.) • CloneIt http://www.ferzkopp.net/Software/CloneIt/ • Comercial: ImageCast, Ghost,….. • Packages/Applications installation • Kickstart + rpm (RedHat) • LUI (IBM) http://oss.software.ibm.com/developerworks/projects/lui • Lucie (TiTec) http://matsu-www.is.titech.ac.jp/~takamiya/lucie/ • LCFGng, Arusha Public Domin Software ACAT2002

Dolly+ • We developed ‘image cloning via NW’ installer `dolly+’. • WHY ANOTHER? • We install/update • maybe frequently (according to user needs) • 100~1000 nodes simultaneously. • Making packages for our own softwares is boring. • Traditional Server/Client type software suffer server bottleneck. • Multicast copy with ~GB image seems unstable.(No free soft ? ) ACAT2002

S (few) Server - (Many) Client model • Server could be a daemon process.(you don‘t need to start it by hand) • Performance is not scalable against # of nodes. • Server bottle neck. Network congestion Multicasting or Broadcasting • No server bottle neck. • Get max performance of network which support multicasting in switch fabrics. • Nodes failure does not affect to all the process very much, it could be robust. • Since failed node need re-transfer. Speed is governed by the slowest node as in RING topology. • Not TCP but UDP, so application must take care of transfer reliability.

Dolly and Dolly+ Dolly • A Linux application software to copy/clone files or/anddisk images among many PCs through a network. • Dolly is originally developed by CoPs project in ETH (Swiss) and an open software. Dolly+ features • Sequential files (no limitation of over 2GB) and/or normal files (optinal:decompress and untar on the fly) transfer/copy via TCP/IP network. • Virtual RING network connection topology to cope with server bottleneck problem. • Pipeline and multi-threading mechanism for speed-up. • Fail recovery mechanism for robust operation.

Dolly: Virtual Ring Topology Master = host having original image • Physical network connection is as you like. • Logically ‘Dolly’ makes a node ring chain which is specified by dolly’s config file and send data node by node bucket relay. • Though transfer is only between its two adjacent nodes, it can utilize max. performance ability of switching network of full duplex ports. • Good for network complex of many switches. node PC network hub switch physical connection Logical (virtual) connection

Cascade Topology • Server bottle neck could be overcome. • Cannot get maximum network performance but better than many client to only one serv. topology. • Week against a node failure. Failure will spread in cascade way as well and difficult to recover.

BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 7 6 network Server 5 8 7 Node 1 network 5 3 thread in parallel 7 6 Node 2 Next node PIPELINING & multi threading

Performance of dolly+ Less than 5min! for 100 nodes expected HW: FujitsuTS225 PenIII 1GHz x2, SCSI disk, 512MB mem, 100BaseT NW ACAT2002

Dolly+ transfer speed scalability with size of image 600 PC: Hardware spec. 1500 (server & nodes) 500 1GHz PentiumIII x 2 10MB/s line 400 IDE-ATA/100 disk 100BASE-TX net 7MB/s line 300 40 50 60 70 256MB memory 1000 transfered bytes (MB) setup elapsed time speed 500 1server-1nodes 230sec 8.2MB/s 1server-2nodes 252sec 7.4MB/s x2 1server-7nodes 266sec 7.0MB/s x7 1server-10nodes 260sec 7.2MB/s x10 0 0 100 200 elapsed time (sec)

S Short cutting Fail recovery mechanism • Only one node failure could be “show stopper” in RING (=series connection) topology. • Dolly+ provides automatic ‘short cut’ mechanism against a node trouble. • In a node trouble, the upper stream node detect it by sending time out. • The upper stream node negotiate with the lower stream node for reconnection and retransfer of a file chunk. • RING topology makes its implementation easy. time out ACAT2002

Re-transfer in short cutting BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 7 6 network Server 5 8 7 Node 1 network 5 7 6 Node 2 Works with even Sequential file. Next node

Dolly+: How do you start it on linux Config file example Server side(which has the original file) % dollyS [-v] -f config_file Nodes side % dollyC [-v] iofiles 3 /dev/hda1 > /tmp/dev/hda1 /data/file.gz >> /data/file boot.tar.Z >> /boot server n000.kek.jp firstclient n001.kek.jp lastclient n020.kek.jp client 20 n001 n002 : n020 endconfig # of files to Xfer master name # of client nodes clients names end code The left of ‘>’ is input file in the server. The right is output file in clients. '>' means dolly+ does not modify the image. '>>' indicate dolly+ should cook (decompress , untar ..) the file according to the name of the file. ACAT2002

How does dolly+ clone the system after booting. • Nodes broadcast over the LAN in search of an installation server (Pre-eXecution Environment). • PXE/DHCP server respond to nodes with information about the nodes IP and kernel download server. • The kernel and `ram disk image’ are Multicast TFTP’ed to the nodes and the kernel gets start. • The kernel hands off to an installation script which run a disk tool and ‘dolly+ ’.(scripts and appli. are in the ram disk image) ACAT2002

How does dolly+ start after rebooting. • The code partitions the hard drive, creates file systems and start `dolly+’ client on the node. • You start `dolly+’ master on the master host to start up a disk clone process. • The code then configure unique node information such as Host name, IP addess from DHCP information. • ready to boot from its hard drive for the first time. ACAT2002

PXE Trouble • BY THE WAYwe suffered sometimes PXE mtftp transfer failure in the case of >20 nodes booting simultaneously. If you have same trouble, mail me please. We start rewriting mtftp client code of RedHat Linux PXE server. ACAT2002

Configuration ACAT2002

(Sub)system Configuration • Linux (Unix) has a lot of configuration file to configure sub-systems. If you have 1000nodes, you have to manage (many)x1000 config. files. • To manage them, three types of solution • Cetralized information service server (like NIS). • Need support by sub-system (nsswitch) • Automatic remote editing raw config. files (like cfengine). • Must care about each node’s file separately. ACAT2002

Configuration--new proposal from CS. • Program (configure) whole system with a source code by O.O way. • Systematic & uniform way configuration. • Source reuse (inheritance) as much as possible. • Template • override to other-site’s configuration. • Arusha (http://ark.sourceforge.net) • LCFGng (http://www.lcfg.org) ACAT2002

Notify Ack Fetch new profile Configuration files & control commands exec. LCFGng (Univ. Edinburgh) New Compile

LCFGng • Good things • Author says that it works on ~1000 nodes. • Fully automatic. (you just edit source code and compile it in a host.) • Differences of sub-systems are hidden from user (administrator). (or move to `components (DB->actual config file)’) ACAT2002

LCFGng • Configuration Language is too primitive. Hostname.Component.Parameter Value • Components are not so manyor you must write your own components scripts for each sub-system by yourself. • far easier writing config. file itself than writing component. • Activating timing of the config. change could not be controlled. ACAT2002

Status monitoring ACAT2002

Status Monitoring • System state monitoring • CPU/memory/disk/network utilization Ganglia*1,plantir*2 • (Sub-)system service sanity check Pikt*3/Pica*4/cfengine *1 http://ganglia.sourceforge.net *2 http://www.netsonde.com *3 http://pikt.org *4 http://pica.sourceforge.net/wtf.html ACAT2002

Ganglia ( Univ. Calfornia) • Gmond (each node) • All node `multicast’ each system status info. each other and each node has current status of all nodes. -> good redundancy and robust • declare that it works on ~1000 nodes • Meta-deamon (Web server) • stores volatile data of gmond in Round-robin DB and represent XML image of all nodes activity • Web Interface ACAT2002

Plantir (Network adaption ) • Quick understanding of system status from One Web Page. ACAT2002

Remote Execution ACAT2002

Remote execution • Administrator sometimes need to issue a command to all (part of ) nodes urgently. • Remote execution could be rsh/ssh/pikt/cfengine/SUT(mpich)* /gexec.. • Points are • To make it easy to know the execution result (fail or success) at a glance. • Parallel execution among nodes. • Otherwise If it takes 1sec. at each node, then 1000 sec for 1000 nodes. *) Scalable Unix tools for cluster http://www-unix.mcs.anl.gov/sut/ ACAT2002

WANI • WEB base remote command executer. • Easy to select nodes concerned. • Easy to specify script or to type-in command lines to execute in nodes. • Issue the commands to nodes in parallel. • Collect result with error/failure detection. • Currently, the software is in prototyping by combinations of existing protocol and tools. (Anyway it works!) ACAT2002

WANI is implemented on `Webmin’ GUI Start Command input Node selection ACAT2002

Switch to another page Command execution result Host name Results from 200nodes in 1 Page ACAT2002

Flame color represents; White: initial Yellow: command starts Black: finished 1 2 3 4 • Exit code • “fail/error” word `grep –i`. • *sys_errlist[] (perror) list check. • `strings /bin/sh` output check Error detection BG color ACAT2002

Stdout output Click here Click here Stderr output

Command Result Pages Result Error marked Result lpr Node hosts execution WEB Browser Piktc_svc PIKT server Webmin server Piktc error detector print_filter Lpd

Administration Tools for Managing Large Scale Linux Cluster

Administration Tools for Managing Large Scale Linux Cluster

Presentation Transcript

Tools for Cluster Administration and Applications (ancient technology – from 2001…)

Remote Sensing Tools for Assessing Large Scale Habitat Quality for Ungulates

Development of Large Scale Optimization Tools for Beam Tracking Codes

Linux System Administration

Linux Tools

LARGE SCALE

Challenges managing large-scale wireless networks

Administration Tools for Managing Large Scale Linux Cluster

Large scale

quattor Framework for Managing Grid-enabled Large Scale Computing Fabrics

System Administration: Linux

LINUX ADMINISTRATION

in Large-Scale Cluster

Managing large-scale workflows with Pegasus

Databases for large scale integration

Hepc Linux Cluster

Managing a Large Scale Student Environment:

Large scale events THEMIS/Cluster conjunctions along the magnetopause

Linux Administration Course

Linux Administration

Challenges managing large-scale wireless networks

Goals of the Large-Scale Cluster Computing Workshop