100 likes | 195 Views
Beowulf Software. Monitoring and Administration. Beowulf Watch http://www.kaybee.org/~kirk/html/linux.html Tck/Tk and rsh based memory, users, process info. EPCKPT Checkpoint for Linux http://www.cos.ufrj.br/~edpin/epckpt checkpointing kernel patch to Linux
E N D
Monitoring and Administration • Beowulf Watch • http://www.kaybee.org/~kirk/html/linux.html • Tck/Tk and rsh based • memory, users, process info. • EPCKPT Checkpoint for Linux • http://www.cos.ufrj.br/~edpin/epckpt • checkpointing kernel patch to Linux • saving running process’s snapshot for later restart • useful for fault tolerance, process tracing/debugging, rollback transactions, migration
Monitoring and Administration • lperfex • http://www.osc.edu/~troy/lperfex • performance monitoring and analysis tool • for Linux/IA32 system • P-pro/PIII status register의 정보 사용(???) • Compaq CMU • http://www.compaq.com/solutions/customsystems/hps/linux-cmu.html • Disk Image Cloning • can do network installation and disk partitioning • Console Broadcasting • Serial Console connecting each computing nodes
Monitoring and Administration • SCMS • http://smile.cpe.ku.ac.th/software/scms/index.html • Parallel Unix command • pls, pps, … • Display node status • CPU, Memory, Device info. • administration • shutdown, reboot, remote login, remote command execution • FAI (Fully Automatic Installation) • http://www.informatik.uni-koeln.de/fai • Automatic Installation over cluster PCs • for Debian Linux, no interaction needed
Global Process Space • BPROC • http://beowulf.gsfc.nasa.gov/software/bproc.html • remote process start without remote-login • Ghost Process implemented with Kernel-Thread • master node의 ghost process는 remote에서 실행중인 real process에 대응된다. • PID masquerading • masqueraded PID related operation을 control하는 daemon • Starting Processes • rexec : execve syscall과 유사, homogeneous node여야 한다. • move or rfork : saving process’s memory region and recreating it on the remote node • can transport binary and anything mmap’ed(ex DLL)
Global Process Space • bexec (brexec) • ftp://ftp.parl.clemson.edu/pub/beowulf/bexec-1.1.2.tgz • use a daemon to start tasks and deliver signals • user-level implementation
Load-balancing & Allocations • job manager • http://bond.imm.dtu.dk/jobd • load balancing and queue control of jobs • solve problem of batch queue computing system • Condor • http://www.cs.wisc.edu/project/condor • Load-balancing • over large number of systems owned by different people • process migration, node status monitoring, resource allocation • Condor + BPROC ??
Cluster Networking • Channel Bonding • http://pdsf.nersc.gov/linux • allow multiple device to be used as one in order to improve bandwidth • low-level approach
File Systems • GFS (Global File Systems) • http://www.globalfilesystem.org • multiple nodes can share storage over network • SFS (Secure File Systems) • http://elbe.borg.umn.edu • store files securely on remote sites using normal network protocols(FTP,HTTP,NFS…) • use smartcards for authentication and signature
File Systems • PVFS (Parallel Virtual File System) • http://ece.clemson.edu/parl/pvfs • improve performance of coarse-grain parallel applications with large I/O requirements • operates at the user-level • no kernel modifications needed