530 likes | 758 Views
Administration Tools for Managing Large Scale Linux Cluster. CRC KEK Japan S.Kawabata, A.Manabe atsushi.manabe@kek.jp. Linux PC Clusters in KEK. PC Cluster 2 PenIII 800MHz 80CPU (40 nodes). PC Cluster 1 PenIII Xeon 500MHz 144 CPUs (36 nodes). PC Cluster 3 (Belle)
E N D
Administration Tools for Managing Large Scale Linux Cluster CRC KEK Japan S.Kawabata, A.Manabe atsushi.manabe@kek.jp
PC Cluster 2 PenIII 800MHz 80CPU (40 nodes) PC Cluster 1 PenIII Xeon 500MHz 144 CPUs (36 nodes)
PC Cluster 3 (Belle) Pentium III Xeon 700MHz 320CPU (80 nodes)
PC cluster 4 (Neutron simulation) • FujitsuTS225 50 nodes • Pentium III 1GHz x 2CPU • 512MB memory • 31GB disk • 100BaseTX x 2 • 1U rack-mount model • RS232C x2 • Remote BIOS setting • Remote reset/power-off
PC Cluster 5 (Belle) 1U server Pentium III 1.2GHz 256 CPU (128 nodes)
3U PC Cluster 6 Blade server: LP Pentium III 700MHz 40CPU (40 nodes)
PC clusters • Already more than 400 (>800CPUs) nodes Linux PC clusters were installed. • Only >middle size PC cluster are counted. • A major exp. (Belle) group plan to install several x100 nodes of blade server in this year. • All PC clusters are managed by individual user group themselves. ACAT2002
Center Machine (KEK CRC) • Currently machines in KEK Computer Center(CRC) are UNIX(solaris,AIX) servers. • Plan to have >1000 nodes Linux computing cluster in near future (~2004). • Will be installed under `~4years rental’ contract. (every 2 years HW update ?) ACAT2002
Center Machine • The system will be share among many user groups. (don’t dedicate to one gr. only) • Their demand for CPU power vary with months. (High demand before int’l-conference or so on) • Of course, we use load-balancing Batch system. • Big groups uses their own software frame work. • Their jobs only run under some restricted version of OS(Linux) /middle-ware/configuration. ACAT2002
R&D system • Frequent change of system configuration/ cpu partition. • To manage such size of PC cluster and such user request, we need to have some sophisticated admin. tools. ACAT2002
Necessary admin. tools • System (SW) Installation /update • Configuration • Status Monitoring/ System Health Check • Command Execution ACAT2002
Installation tool ACAT2002
Installation tool • Two types of `installation tool’ • Disk Cloning • Application Package Installer • system(kernel) is an application in this term. ACAT2002
Copy disk partition image to nodes Installation tool (cloning) Image Cloning Install system/application on a `master host’. ACAT2002
Installation tool (package installer) request Package server Image and control Package Information DB Clients Package archive ACAT2002
Remote Installation via NW • Cloning disk image • SystemImager (VA) http://systemimager.sourceforge.net/ • CATS-i (soongsil Univ.) • CloneIt http://www.ferzkopp.net/Software/CloneIt/ • Comercial: ImageCast, Ghost,….. • Packages/Applications installation • Kickstart + rpm (RedHat) • LUI (IBM) http://oss.software.ibm.com/developerworks/projects/lui • Lucie (TiTec) http://matsu-www.is.titech.ac.jp/~takamiya/lucie/ • LCFGng, Arusha Public Domin Software ACAT2002
Dolly+ • We developed ‘image cloning via NW’ installer `dolly+’. • WHY ANOTHER? • We install/update • maybe frequently (according to user needs) • 100~1000 nodes simultaneously. • Making packages for our own softwares is boring. • Traditional Server/Client type software suffer server bottleneck. • Multicast copy with ~GB image seems unstable.(No free soft ? ) ACAT2002
S (few) Server - (Many) Client model • Server could be a daemon process.(you don‘t need to start it by hand) • Performance is not scalable against # of nodes. • Server bottle neck. Network congestion Multicasting or Broadcasting • No server bottle neck. • Get max performance of network which support multicasting in switch fabrics. • Nodes failure does not affect to all the process very much, it could be robust. • Since failed node need re-transfer. Speed is governed by the slowest node as in RING topology. • Not TCP but UDP, so application must take care of transfer reliability.
Dolly and Dolly+ Dolly • A Linux application software to copy/clone files or/anddisk images among many PCs through a network. • Dolly is originally developed by CoPs project in ETH (Swiss) and an open software. Dolly+ features • Sequential files (no limitation of over 2GB) and/or normal files (optinal:decompress and untar on the fly) transfer/copy via TCP/IP network. • Virtual RING network connection topology to cope with server bottleneck problem. • Pipeline and multi-threading mechanism for speed-up. • Fail recovery mechanism for robust operation.
Dolly: Virtual Ring Topology Master = host having original image • Physical network connection is as you like. • Logically ‘Dolly’ makes a node ring chain which is specified by dolly’s config file and send data node by node bucket relay. • Though transfer is only between its two adjacent nodes, it can utilize max. performance ability of switching network of full duplex ports. • Good for network complex of many switches. node PC network hub switch physical connection Logical (virtual) connection
Cascade Topology • Server bottle neck could be overcome. • Cannot get maximum network performance but better than many client to only one serv. topology. • Week against a node failure. Failure will spread in cascade way as well and difficult to recover.
BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 7 6 network Server 5 8 7 Node 1 network 5 3 thread in parallel 7 6 Node 2 Next node PIPELINING & multi threading
Performance of dolly+ Less than 5min! for 100 nodes expected HW: FujitsuTS225 PenIII 1GHz x2, SCSI disk, 512MB mem, 100BaseT NW ACAT2002
Dolly+ transfer speed scalability with size of image 600 PC: Hardware spec. 1500 (server & nodes) 500 1GHz PentiumIII x 2 10MB/s line 400 IDE-ATA/100 disk 100BASE-TX net 7MB/s line 300 40 50 60 70 256MB memory 1000 transfered bytes (MB) setup elapsed time speed 500 1server-1nodes 230sec 8.2MB/s 1server-2nodes 252sec 7.4MB/s x2 1server-7nodes 266sec 7.0MB/s x7 1server-10nodes 260sec 7.2MB/s x10 0 0 100 200 elapsed time (sec)
S Short cutting Fail recovery mechanism • Only one node failure could be “show stopper” in RING (=series connection) topology. • Dolly+ provides automatic ‘short cut’ mechanism against a node trouble. • In a node trouble, the upper stream node detect it by sending time out. • The upper stream node negotiate with the lower stream node for reconnection and retransfer of a file chunk. • RING topology makes its implementation easy. time out ACAT2002
Re-transfer in short cutting BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 7 6 network Server 5 8 7 Node 1 network 5 7 6 Node 2 Works with even Sequential file. Next node
Dolly+: How do you start it on linux Config file example Server side(which has the original file) % dollyS [-v] -f config_file Nodes side % dollyC [-v] iofiles 3 /dev/hda1 > /tmp/dev/hda1 /data/file.gz >> /data/file boot.tar.Z >> /boot server n000.kek.jp firstclient n001.kek.jp lastclient n020.kek.jp client 20 n001 n002 : n020 endconfig # of files to Xfer master name # of client nodes clients names end code The left of ‘>’ is input file in the server. The right is output file in clients. '>' means dolly+ does not modify the image. '>>' indicate dolly+ should cook (decompress , untar ..) the file according to the name of the file. ACAT2002
How does dolly+ clone the system after booting. • Nodes broadcast over the LAN in search of an installation server (Pre-eXecution Environment). • PXE/DHCP server respond to nodes with information about the nodes IP and kernel download server. • The kernel and `ram disk image’ are Multicast TFTP’ed to the nodes and the kernel gets start. • The kernel hands off to an installation script which run a disk tool and ‘dolly+ ’.(scripts and appli. are in the ram disk image) ACAT2002
How does dolly+ start after rebooting. • The code partitions the hard drive, creates file systems and start `dolly+’ client on the node. • You start `dolly+’ master on the master host to start up a disk clone process. • The code then configure unique node information such as Host name, IP addess from DHCP information. • ready to boot from its hard drive for the first time. ACAT2002
PXE Trouble • BY THE WAYwe suffered sometimes PXE mtftp transfer failure in the case of >20 nodes booting simultaneously. If you have same trouble, mail me please. We start rewriting mtftp client code of RedHat Linux PXE server. ACAT2002
Configuration ACAT2002
(Sub)system Configuration • Linux (Unix) has a lot of configuration file to configure sub-systems. If you have 1000nodes, you have to manage (many)x1000 config. files. • To manage them, three types of solution • Cetralized information service server (like NIS). • Need support by sub-system (nsswitch) • Automatic remote editing raw config. files (like cfengine). • Must care about each node’s file separately. ACAT2002
Configuration--new proposal from CS. • Program (configure) whole system with a source code by O.O way. • Systematic & uniform way configuration. • Source reuse (inheritance) as much as possible. • Template • override to other-site’s configuration. • Arusha (http://ark.sourceforge.net) • LCFGng (http://www.lcfg.org) ACAT2002
Notify Ack Fetch new profile Configuration files & control commands exec. LCFGng (Univ. Edinburgh) New Compile
LCFGng • Good things • Author says that it works on ~1000 nodes. • Fully automatic. (you just edit source code and compile it in a host.) • Differences of sub-systems are hidden from user (administrator). (or move to `components (DB->actual config file)’) ACAT2002
LCFGng • Configuration Language is too primitive. Hostname.Component.Parameter Value • Components are not so manyor you must write your own components scripts for each sub-system by yourself. • far easier writing config. file itself than writing component. • Activating timing of the config. change could not be controlled. ACAT2002
Status monitoring ACAT2002
Status Monitoring • System state monitoring • CPU/memory/disk/network utilization Ganglia*1,plantir*2 • (Sub-)system service sanity check Pikt*3/Pica*4/cfengine *1 http://ganglia.sourceforge.net *2 http://www.netsonde.com *3 http://pikt.org *4 http://pica.sourceforge.net/wtf.html ACAT2002
Ganglia ( Univ. Calfornia) • Gmond (each node) • All node `multicast’ each system status info. each other and each node has current status of all nodes. -> good redundancy and robust • declare that it works on ~1000 nodes • Meta-deamon (Web server) • stores volatile data of gmond in Round-robin DB and represent XML image of all nodes activity • Web Interface ACAT2002
Plantir (Network adaption ) • Quick understanding of system status from One Web Page. ACAT2002
Remote Execution ACAT2002
Remote execution • Administrator sometimes need to issue a command to all (part of ) nodes urgently. • Remote execution could be rsh/ssh/pikt/cfengine/SUT(mpich)* /gexec.. • Points are • To make it easy to know the execution result (fail or success) at a glance. • Parallel execution among nodes. • Otherwise If it takes 1sec. at each node, then 1000 sec for 1000 nodes. *) Scalable Unix tools for cluster http://www-unix.mcs.anl.gov/sut/ ACAT2002
WANI • WEB base remote command executer. • Easy to select nodes concerned. • Easy to specify script or to type-in command lines to execute in nodes. • Issue the commands to nodes in parallel. • Collect result with error/failure detection. • Currently, the software is in prototyping by combinations of existing protocol and tools. (Anyway it works!) ACAT2002
WANI is implemented on `Webmin’ GUI Start Command input Node selection ACAT2002
Switch to another page Command execution result Host name Results from 200nodes in 1 Page ACAT2002
Flame color represents; White: initial Yellow: command starts Black: finished 1 2 3 4 • Exit code • “fail/error” word `grep –i`. • *sys_errlist[] (perror) list check. • `strings /bin/sh` output check Error detection BG color ACAT2002
Stdout output Click here Click here Stderr output
Command Result Pages Result Error marked Result lpr Node hosts execution WEB Browser Piktc_svc PIKT server Webmin server Piktc error detector print_filter Lpd