400 likes | 564 Views
Administration Tools for Managing Large Scale Linux Cluster. CRC KEK Japan S.Kawabata,A.Manabe. Linux PC Clusters in KEK. PC Cluster 2 PenIII 800MHz 80CPU (40 nodes). PC Cluster 1 PenIII Xeon 500MHz 144 CPUs (36 nodes). PC Cluster 3 Pentium III Xeon 700MHz 320CPU (80 nodes).
E N D
Administration Tools for Managing Large Scale Linux Cluster CRC KEK Japan S.Kawabata,A.Manabe
PC Cluster 2 PenIII 800MHz 80CPU (40 nodes) PC Cluster 1 PenIII Xeon 500MHz 144 CPUs (36 nodes) ACAT2002
PC Cluster 3 Pentium III Xeon 700MHz 320CPU (80 nodes)
PC Cluster 4 1U server Pentium III 1.2GHz 256 CPU (128 nodes)
3U PC Cluster 5 Blade server: LP Pentium III 700MHz 40CPU (40 nodes)
PC clusters • Already more than 400 nodes installed. • Only counting >middle size PC cluster. • All PC clusters are managed by individual user group. • A major exp. group plan to install several x100 nodes of blade server in this year. ACAT2002
Center Machine • KEK Computer Center plan to have >1000 nodes in near future (~2004). • Will be installed under `~4year rental’ contract. • System will be share among many user groups. (don’t dedicate to one gr.) • According to their demand to cpu power, system partition will be vary. ACAT2002
PC cluster for system R&D • FujitsuTS225 50 nodes • PentiumIII 1GHz x 2CPU • 512MB memory • 31GB disk • 100BaseTX x 2 • 1U rack mount model • RS232C x2 • Remote BIOS setting • Remote reset/power-off ACAT2002
Necessary admin tools • Installation /update • Command Execution • Configuration • Status Monitoring ACAT2002
Installation tool ACAT2002
Installation tool Image Cloning Install system/application Copy disk partition image to nodes ACAT2002
Installation tool Package server Package Information Db Clients Package archive ACAT2002
Remote Installation via NW • Cloning disk image • SystemImager (VA) http://systemimager.sourceforge.net/ • CATS-i (soongsil Univ.) • CloneIt http://www.ferzkopp.net/Software/CloneIt/ • Comercial: ImageCast, Ghost,….. • Packages/Applications installation • Kickstart + rpm (RedHat) • LUI (IBM) http://oss.software.ibm.com/developerworks/projects/lui • Lucie (TiTec) http://matsu-www.is.titech.ac.jp/~takamiya/lucie/ ACAT2002
Dolly+ • We developed ‘image cloning via NW’ installer `dolly+’. • WHY ANOTHER? • We install/update • frequently (according to user needs) • 100~1000 nodes at a time. • Traditional Server/Client software suffer server bottleneck. • Multicast copy with ~GB image seems unstable.(No free soft ? ) ACAT2002
S (few) Server - (Many) Client model • Server could be a daemon process.(you don‘t need to start it by hand) • Performance is not scalable against # of nodes. • Server bottle neck. Network congestion Multicasting or Broadcasting • No server bottle neck. • Get max performance of network which support multicasting in switch fablics. • Nodes failure does not affect to all the process very much, it could be robust. • Since failed node need re-transfer. Speed is governed by the slowest node as in RING topology. • Not TCP but UDP, so application must take care of transfer reliability.
Dolly and Dolly+ Dolly • A Linux application software to copy/clone files or/anddisk images among many PCs through a network. • Dolly is originally developed by CoPs project in ETH (Swiss) and a free software. Dolly+ features • Sequential files (no limitation of over 2GB) and/or normal files (optinal:decompress and untar on the fly) transfer/copy via TCP/IP network. • Virtual RING network connection topology. • Pipeline and multi-threading mechanism for speed-up. • Fail recovery mechanism for robust operation.
Dolly: Virtual Ring Topology Server = host having original image • Physical network connection is as you like. • Logically ‘Dolly’ makes a node ring chain which is specified by dolly’s config file. • Though transfer is only between its two adjacent nodes, it can utilize max. performance ability of switching network of full duplex ports. • Good for network complex by many switches. node PC network hub switch physical connection Logical (virtual) connection
Cascade Topology • Server bottle neck could be overcome. • Cannot get maximum network performance but better than many client to only one serv. topology. • Week against a node failure. Failure will spread in cascade way as well and difficult to recover.
BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 7 6 network Server 5 8 7 Node 1 network 5 7 6 Node 2 Next node PIPELINING & multi threading 3 thread in parallel
S Fail recovery mechanism • Only one node failure could be “show stopper” in RING (=series connection) topology. • Dolly+ provides automatic ‘short cut’ mechanism in node problem. • In a node trouble, the upper stream node detect it by sending time out. • The upper stream node negotiate with the lower stream node for reconnection and retransfer of a file chunk. • RING topology makes its implementation easy. time out Short cutting ACAT2002
Re-transfer in short cutting BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 7 6 network Server 5 8 7 Node 1 network 5 7 6 Node 2 Next node
Dolly+: How do you start it on linux Config file example Server side(which has the original file) % dollyS [-v] -f config_file Nodes side % dollyC [-v] iofiles 3 /dev/hda1 > /tmp/dev/hda1 /data/file.gz >> /data/file boot.tar.Z >> /boot server n000.kek.jp firstclient n001.kek.jp lastclient n020.kek.jp client 20 n001 n002 : n020 endconfig # of files to Xfer server name # of client nodes clients names end code The left of ‘>’ is input file in the server. The right is output file in clients. '>' means dolly+ does not modify the image. '>>' indicate dolly+ should cook (decompress , untar ..) the file according to the name of the file. ACAT2002
Performance of dolly+ Less than 5min! for 100 nodes expected HW: FujitsuTS225 PenIII 1GHz x2, SCSI disk, 512MB mem, 100BaseT NW ACAT2002
Dolly+ transfer speed scalability with size of image 600 PC: Hardware spec. 1500 (server & nodes) 500 1GHz PentiumIII x 2 10MB/s line 400 IDE-ATA/100 disk 100BASE-TX net 7MB/s line 300 40 50 60 70 256MB memory 1000 transfered bytes (MB) setup elapsed time speed 500 1server-1nodes 230sec 8.2MB/s 1server-2nodes 252sec 7.4MB/s x2 1server-7nodes 266sec 7.0MB/s x7 1server-10nodes 260sec 7.2MB/s x10 0 0 100 200 elapsed time (sec)
How does dolly+ start after rebooting. • Nodes broadcast over the LAN in search of an installation server. • PXE server respond to nodes with information about the nodes IP and kernel download server. • The kernel and `ram disk / FS’ are Multicast TFTP’ed to the nodes and the kernel gets start. • The kernel hands off to an installation script which run a disk tool and ‘dolly+ ’. ACAT2002
How does dolly+ start after rebooting. • The code partitions the hard drive, creates file systems and start `dolly+’ client on the node. • You start `dolly+’ master on the master host to start up a disk clone process. • The code then configure individual nodes. (Host name, IP addess… etc.) • ready to boot from its hard drive for the first time. ACAT2002
Remote Execution ACAT2002
Remote execution • Administrator sometimes need to issue a command to all nodes urgently. • Remote execution could be rsh/ssh/pikt/cfengine/SUT(mpich )* …. • Points are • To make it easy to know the execution result (fail or success) at a glance. • Parallel execution among nodes. *) Scalable Unix tools for cluster http://www-unix.mcs.anl.gov/sut/ ACAT2002
WANI • WEB base remote command executer. • Easy to select nodes concerned. • Easy to specify script/type-in commands. • Issue the commands to nodes in parallel. • Collect result after error/fail detection. • Currently, the software is in prototyping by combinations of existed protocol and tools. (Anyway it works!) ACAT2002
WANI is implemented on `Webmin’ GUI Start Command input Node selection ACAT2002
Switch to another page Command execution result Host name Results from 200nodes in 1 Page ACAT2002
Error detection • Exit code • “fail/failure/error” word `grep –i` • *sys_errlist[] (perror) list check • `strings /bin/sh` output check Flame color represents; White: initial Yellow: command starts Black: finished 1 2 BG color 3 4 ACAT2002
Stdout output Click here Click here Stderr output
Command Result Pages Result Error marked Result lpr Node hosts execution WEB Browser Piktc_svc PIKT server Webmin server Piktc error detector print_filter Lpd
Status Monitoring • Cfengine /Pikt*1/Pica*2 • Ganglia*3 *1 http://pikt.org *2 http://pica.sourceforge.net/wtf.html *3 http://ganglia.sourceforge.net ACAT2002
Conclusion • Installation: dolly+ • Install/Update/Switch system very quick. • Remote Command Execution • `Result at a glance’ is important for quick iteration. • Parallel execution is important • Status monitor • Configuration manager • Not matured yet. • Software is available from http://corvus.kek.jp/~manabe/pcf/dolly • Thank you for your reading ! ACAT2002