Administration Tools for Managing Large Scale Linux Cluster

Administration Tools for Managing Large Scale Linux Cluster CRC KEK Japan S.Kawabata,A.Manabe

Linux PC Clusters in KEK

PC Cluster 2 PenIII 800MHz 80CPU (40 nodes) PC Cluster 1 PenIII Xeon 500MHz 144 CPUs (36 nodes) ACAT2002

PC Cluster 3 Pentium III Xeon 700MHz 320CPU (80 nodes)

PC Cluster 4 1U server Pentium III 1.2GHz 256 CPU (128 nodes)

3U PC Cluster 5 Blade server: LP Pentium III 700MHz 40CPU (40 nodes)

PC clusters • Already more than 400 nodes installed. • Only counting >middle size PC cluster. • All PC clusters are managed by individual user group. • A major exp. group plan to install several x100 nodes of blade server in this year. ACAT2002

Center Machine • KEK Computer Center plan to have >1000 nodes in near future (~2004). • Will be installed under `~4year rental’ contract. • System will be share among many user groups. (don’t dedicate to one gr.) • According to their demand to cpu power, system partition will be vary. ACAT2002

PC cluster for system R&D • FujitsuTS225 50 nodes • PentiumIII 1GHz x 2CPU • 512MB memory • 31GB disk • 100BaseTX x 2 • 1U rack mount model • RS232C x2 • Remote BIOS setting • Remote reset/power-off ACAT2002

Necessary admin tools • Installation /update • Command Execution • Configuration • Status Monitoring ACAT2002

Installation tool ACAT2002

Installation tool Image Cloning Install system/application Copy disk partition image to nodes ACAT2002

Installation tool Package server Package Information Db Clients Package archive ACAT2002

Remote Installation via NW • Cloning disk image • SystemImager (VA) http://systemimager.sourceforge.net/ • CATS-i (soongsil Univ.) • CloneIt http://www.ferzkopp.net/Software/CloneIt/ • Comercial: ImageCast, Ghost,….. • Packages/Applications installation • Kickstart + rpm (RedHat) • LUI (IBM) http://oss.software.ibm.com/developerworks/projects/lui • Lucie (TiTec) http://matsu-www.is.titech.ac.jp/~takamiya/lucie/ ACAT2002

Dolly+ • We developed ‘image cloning via NW’ installer `dolly+’. • WHY ANOTHER? • We install/update • frequently (according to user needs) • 100~1000 nodes at a time. • Traditional Server/Client software suffer server bottleneck. • Multicast copy with ~GB image seems unstable.(No free soft ? ) ACAT2002

S (few) Server - (Many) Client model • Server could be a daemon process.(you don‘t need to start it by hand) • Performance is not scalable against # of nodes. • Server bottle neck. Network congestion Multicasting or Broadcasting • No server bottle neck. • Get max performance of network which support multicasting in switch fablics. • Nodes failure does not affect to all the process very much, it could be robust. • Since failed node need re-transfer. Speed is governed by the slowest node as in RING topology. • Not TCP but UDP, so application must take care of transfer reliability.

Dolly and Dolly+ Dolly • A Linux application software to copy/clone files or/anddisk images among many PCs through a network. • Dolly is originally developed by CoPs project in ETH (Swiss) and a free software. Dolly+ features • Sequential files (no limitation of over 2GB) and/or normal files (optinal:decompress and untar on the fly) transfer/copy via TCP/IP network. • Virtual RING network connection topology. • Pipeline and multi-threading mechanism for speed-up. • Fail recovery mechanism for robust operation.

Dolly: Virtual Ring Topology Server = host having original image • Physical network connection is as you like. • Logically ‘Dolly’ makes a node ring chain which is specified by dolly’s config file. • Though transfer is only between its two adjacent nodes, it can utilize max. performance ability of switching network of full duplex ports. • Good for network complex by many switches. node PC network hub switch physical connection Logical (virtual) connection

Cascade Topology • Server bottle neck could be overcome. • Cannot get maximum network performance but better than many client to only one serv. topology. • Week against a node failure. Failure will spread in cascade way as well and difficult to recover.

BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 7 6 network Server 5 8 7 Node 1 network 5 7 6 Node 2 Next node PIPELINING & multi threading 3 thread in parallel

S Fail recovery mechanism • Only one node failure could be “show stopper” in RING (=series connection) topology. • Dolly+ provides automatic ‘short cut’ mechanism in node problem. • In a node trouble, the upper stream node detect it by sending time out. • The upper stream node negotiate with the lower stream node for reconnection and retransfer of a file chunk. • RING topology makes its implementation easy. time out Short cutting ACAT2002

Re-transfer in short cutting BOF EOF 1 2 3 4 5 6 7 8 9 ….. File chunk =4MB 6 9 8 7 6 network Server 5 8 7 Node 1 network 5 7 6 Node 2 Next node

Dolly+: How do you start it on linux Config file example Server side(which has the original file) % dollyS [-v] -f config_file Nodes side % dollyC [-v] iofiles 3 /dev/hda1 > /tmp/dev/hda1 /data/file.gz >> /data/file boot.tar.Z >> /boot server n000.kek.jp firstclient n001.kek.jp lastclient n020.kek.jp client 20 n001 n002 : n020 endconfig # of files to Xfer server name # of client nodes clients names end code The left of ‘>’ is input file in the server. The right is output file in clients. '>' means dolly+ does not modify the image. '>>' indicate dolly+ should cook (decompress , untar ..) the file according to the name of the file. ACAT2002

ACAT2002

Performance of dolly+ Less than 5min! for 100 nodes expected HW: FujitsuTS225 PenIII 1GHz x2, SCSI disk, 512MB mem, 100BaseT NW ACAT2002

Dolly+ transfer speed scalability with size of image 600 PC: Hardware spec. 1500 (server & nodes) 500 1GHz PentiumIII x 2 10MB/s line 400 IDE-ATA/100 disk 100BASE-TX net 7MB/s line 300 40 50 60 70 256MB memory 1000 transfered bytes (MB) setup elapsed time speed 500 1server-1nodes 230sec 8.2MB/s 1server-2nodes 252sec 7.4MB/s x2 1server-7nodes 266sec 7.0MB/s x7 1server-10nodes 260sec 7.2MB/s x10 0 0 100 200 elapsed time (sec)

How does dolly+ start after rebooting. • Nodes broadcast over the LAN in search of an installation server. • PXE server respond to nodes with information about the nodes IP and kernel download server. • The kernel and `ram disk / FS’ are Multicast TFTP’ed to the nodes and the kernel gets start. • The kernel hands off to an installation script which run a disk tool and ‘dolly+ ’. ACAT2002

How does dolly+ start after rebooting. • The code partitions the hard drive, creates file systems and start `dolly+’ client on the node. • You start `dolly+’ master on the master host to start up a disk clone process. • The code then configure individual nodes. (Host name, IP addess… etc.) • ready to boot from its hard drive for the first time. ACAT2002

Remote Execution ACAT2002

Remote execution • Administrator sometimes need to issue a command to all nodes urgently. • Remote execution could be rsh/ssh/pikt/cfengine/SUT(mpich )* …. • Points are • To make it easy to know the execution result (fail or success) at a glance. • Parallel execution among nodes. *) Scalable Unix tools for cluster http://www-unix.mcs.anl.gov/sut/ ACAT2002

WANI • WEB base remote command executer. • Easy to select nodes concerned. • Easy to specify script/type-in commands. • Issue the commands to nodes in parallel. • Collect result after error/fail detection. • Currently, the software is in prototyping by combinations of existed protocol and tools. (Anyway it works!) ACAT2002

WANI is implemented on `Webmin’ GUI Start Command input Node selection ACAT2002

Switch to another page Command execution result Host name Results from 200nodes in 1 Page ACAT2002

Error detection • Exit code • “fail/failure/error” word `grep –i` • *sys_errlist[] (perror) list check • `strings /bin/sh` output check Flame color represents; White: initial Yellow: command starts Black: finished 1 2 BG color 3 4 ACAT2002

Stdout output Click here Click here Stderr output

Command Result Pages Result Error marked Result lpr Node hosts execution WEB Browser Piktc_svc PIKT server Webmin server Piktc error detector print_filter Lpd

Status Monitoring • Cfengine /Pikt*1/Pica*2 • Ganglia*3 *1 http://pikt.org *2 http://pica.sourceforge.net/wtf.html *3 http://ganglia.sourceforge.net ACAT2002

Conclusion • Installation: dolly+ • Install/Update/Switch system very quick. • Remote Command Execution • `Result at a glance’ is important for quick iteration. • Parallel execution is important • Status monitor • Configuration manager • Not matured yet. • Software is available from http://corvus.kek.jp/~manabe/pcf/dolly • Thank you for your reading ! ACAT2002

ACAT2002

Administration Tools for Managing Large Scale Linux Cluster

Administration Tools for Managing Large Scale Linux Cluster

Presentation Transcript

Tools for Cluster Administration and Applications (ancient technology – from 2001…)

Remote Sensing Tools for Assessing Large Scale Habitat Quality for Ungulates

Development of Large Scale Optimization Tools for Beam Tracking Codes

Linux System Administration

Administration Tools for Managing Large Scale Linux Cluster

Linux Tools

LARGE SCALE

Challenges managing large-scale wireless networks

Large scale

quattor Framework for Managing Grid-enabled Large Scale Computing Fabrics

System Administration: Linux

LINUX ADMINISTRATION

in Large-Scale Cluster

Managing large-scale workflows with Pegasus

Hepc Linux Cluster

Advanced Linux Administration

Managing a Large Scale Student Environment:

Large scale events THEMIS/Cluster conjunctions along the magnetopause

Linux Administration Course

Linux Administration

Challenges managing large-scale wireless networks

Goals of the Large-Scale Cluster Computing Workshop