1 / 44

Beowulf

Beowulf. Pēteris Lediņš University of Latvia Nov 16, 2004. What is Beowulf. Historically

booth
Download Presentation

Beowulf

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beowulf Pēteris Lediņš University of Latvia Nov 16, 2004

  2. What is Beowulf • Historically Beowulf is the earliest surviving epic poem written in English. It is a story about a hero of great strength and courage who defeated a monster called Grendel. See History to find out more about the Beowulf hero. Famed was this Beowulf: far flew the boast of him, son of Scyld, in the Scandian lands. So becomes it a youth to quit him well with his father's friends, by fee and gift, that to aid him, aged, in after days, come warriors willing, should war draw nigh, liegemen loyal: by lauded deeds shall an earl have honor in every clan. • Computer related • The Beowulf architecture was first introduced by scientists(Thomas Sterling and Don Becker) at NASA who were trying to develop a supercomputer class of machine using inexpensive off-the-shelf computer technology.

  3. COW • COW – cluster of Workstations • A set of computers with one server with user /home and /usr/local directories via NFS. • All clients have local *nix installed with rsh – remote shell • Workstations used by people at day, by computing processes at nights.

  4. COW vs Beowulf • Ordinary nodes of Beowulf have no videocards, keyboards, etc. • Users see Beowulf as one computer • Global PID • Beowulf is dedicated for parallel computing

  5. Beowulf vs. <openMosix|many processor server> • openMosix – in openMosix parallel computing supplied by kernel. Beowulf’s approach based on messaging between instances of process. • Many processor server – with Beowulf you have to different memory spaces, with many processor server – one memory space by hardware.

  6. More or less precise def. The accepted definition of a true beowulf is that it is a cluster of computers (``compute nodes'') interconnected with a network with the following characteristics: • The nodes are dedicated to the beowulf and serve no other purpose. • The network(s) on which the nodes reside is(are) dedicated to the beowulf and serve(s) no other purpose. • An essential part of the beowulf definition (that distinguishes it from, for example a vendor-produced massively parallel processor - MPP - system) is that its compute nodes are mass produced commodities, readily available ``off the shelf'', and hence relatively inexpensive. • Ordinary network - at least to the extent that it must integrate with popular computers and hence must interconnect through a standard bus. The nodes all run open source software. • The resulting cluster is used for High Performance Computing (HPC, also called ``parallel supercomputing'' and other names).

  7. Hardware classification • From Beowulf how-to: • A CLASS I Beowulf is a machine that can be assembled from parts found in at least 3 nationally/globally circulated advertising catalogs – “Computer shopper test” • A CLASS II Beowulf is simply any machine that does not pass the Computer Shopper certification test

  8. Beowulfery • Explore the possibilities of building supercomputers out of the mass-market electronic equivalent of coat hangers and chewing gum. • Open source • Specific tasks – no standard solution • Beowulf design is best driven (and extended) by one's needs of the moment and vision of the future and not by a mindless attempt to slavishly follow the original technical definition anyway.

  9. Software • Building clusters is straightforward, but managing its software can be complex. • Oscar (Open Source Cluster Application Resources) • Scyld – scyld.com from the scientists of NASA - commercial • true beowulf in a box • Beowulf Operating System • Rocks • OpenSCE • WareWulf • Clustermatic

  10. OSCAR • http://oscar.openclustergroup.org/ • Cluster on a CD – automates cluster install process • Wizard driven • Can be installed on any Linux system, supporting RPMs • Components: Open source and BSD style license

  11. Rocks • Award Winning Open Source High Performance Linux Cluster Solution • The current release of NPACI Rocks is 3.3.0. • Rocks is built on top of RedHat Linux releases • Two types of nodes • Frontend • Two ethernet interfaces • Lots of disk space • Compute • Disk drive for caching the base operating environment (OS and libararies) • Rocks uses an SQL database for global variable saving

  12. Rocks physical structure

  13. Rocks frontend installation • 3 CD’s – • Rocks Base CD, HPC Roll CD and Kernel Roll CD • Bootable Base CD • User-friendly wizard mode installation • Cluster information • Local hardware • Both ethernet interfaces

  14. Rocks frontend installation

  15. Rocks compute nodes • Installation • Login frontend as root • Run insert-ethers • Run a program which captures compute node DHCP requests and puts their information into the Rocks MySQL database

  16. Rocks compute nodes • Use install CD to boot the compute nodes • Insert-ethers also authenticates compute nodes. When insert-ethers is running, the frontend will accept new nodes into the cluster based on their presence on the local network. Insert-ethers must continue to run until a new node requests its kickstart file, and will ask you to wait until that event occurs. When insert-ethers is off, no unknown node may obtain a kickstart file.

  17. Rocks compute nodes • Monitor installation • When finished do next node • Nodes divided in cabinets • Possible to install compute nodes of different architecture than frontend

  18. MPI • MPI is a software systems that allows you to write message-passing parallel programs that run on a cluster, in Fortran and C. • MPI (Message Passing Interface) is a defacto standard for portable message-passing parallel programs standardized by the MPI Forum and available on all massively-parallel supercomputers.

  19. Parallel programming • Mind that memory is distributed – each node has its own memory space • Decomposition – divide large problems into smaller • Use mpi.h for C programs

  20. Message passing • Message Passing Program consists of multiple instance of serial program that communicate bylibrary calls. These calls may be roughly divided into four classes: • Calls used to initialise, manage and finally terminate communication • Calls used to communicate between pairs of processors • Calls that communicate operations among group of processors • Calls used to create arbitrary data type

  21. Helloworld.c #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello world! I am %d of %d\n", rank, size); MPI_Finalize(); return(0); }

  22. Communication /* * The root node sends out a message to the next node in the ring and * each node then passes the message along to the next node. The root * node times how long it takes for the message to get back to it. */ #include<stdio.h> /* for input/output */ #include<mpi.h> /* for mpi routines */ #define BUFSIZE 64 /* The size of the messege being passed */ main( int argc, char** argv) { double start,finish; int my_rank; /* the rank of this process */ int n_processes; /* the total number of processes */ char buf[BUFSIZE]; /* a buffer for the messege */ int tag=0; /* not important here */ MPI_Status status; /* not important here */ MPI_Init(&argc, &argv); /* Initializing mpi */ MPI_Comm_size(MPI_COMM_WORLD, &n_processes); /* Getting # of processes */ MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); /* Getting my rank */

  23. Communication again /* * If this process is the root process send a messege to the next node * and wait to recieve one from the last node. Time how long it takes * for the messege to get around the ring. If this process is not the * root node, wait to recieve a messege from the previous node and * then send it to the next node. */ start=MPI_Wtime(); printf("Hello world! I am %d of %d\n", my_rank, n_processes); if( my_rank == 0 ) { /* send to the next node */ MPI_Send(buf, BUFSIZE, MPI_CHAR, my_rank+1, tag, MPI_COMM_WORLD); /* receive from the last node */ MPI_Recv(buf, BUFSIZE, MPI_CHAR, n_processes-1, tag, MPI_COMM_WORLD, &status); }

  24. Even more of communication if( my_rank != 0) { /* receive from the previous node */ MPI_Recv(buf, BUFSIZE, MPI_CHAR, my_rank-1, tag, MPI_COMM_WORLD, &status); /* send to the next node */ MPI_Send(buf, BUFSIZE, MPI_CHAR, (my_rank+1)%n_processes, tag, MPI_COMM_WORLD); } finish=MPI_Wtime(); MPI_Finalize(); /* I’m done with mpi stuff */ /* Print out the results. */ if (my_rank == 0) { printf("Total time used was %f seconds\n", finish-start); } return 0; }

  25. Compiling • Compile code using mpicc – MPI C compiler. • /u1/local/mpich-pgi/bin/mpicc -o helloworld2 helloworld2.c • Run using mpirun.

  26. Rocks computing • Mpirun on Rocks clusters is used to launch jobs that are linked with the Ethernet device for MPICH. • "mpirun" is a shell script that attempts to hide the differences in starting jobs for various devices from the user. On workstation clusters, you must supply a file that lists the different machines that mpirun can use to run remote jobs • MPICH is an implementation of MPI, the Standard for message-passing libraries.

  27. Rocks computing example • High-Performance Linpack • software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. • Launch HPL on two processors: • Create a file in your home directory named machines, and put two entries in it, such as: • compute-0-0 • compute-0-1 • Download the the two-processor HPL configuration file and save it as HPL.dat in your home directory. • Now launch the job from the frontend: • $ /opt/mpich/gnu/bin/mpirun -nolocal -np 2 -machinefile machines /opt/hpl/gnu/bin/xhpl

  28. Rocks cluster-fork • Same jobs of standard Unix commands on different nodes • By default, cluster-fork uses a simple series of ssh connections to launch the task serially on every compute node in the cluster. • My processes on all nodes: • $ cluster-fork ps -U$USER • Hostnames of nodes: • $ cluster-fork ps -U$USER

  29. Rocks cluster-fork again • Often you wish to name the nodes your job is started on • $ cluster-fork --query="select name from nodes where name like 'compute-1-%'" [cmd] • Or use --nodes=compute-0-%d:0-2

  30. Sun Grid Engine • Rocks ships with Sun Grid Engine • Grid Engine is Distributed Resource Management (DRM) software. • optimally place computing tasks and balance the load on a set of networked computers • allow users to generate and queue more computing tasks than can be run at the moment • ensure that tasks are executed with respect to priority and to providing all users with a fair share of access over time

  31. Rocks and Grid engine • Jobs are submitted to Grid Engine via scripts • Give parameters for the processes to be executed and how should they be executed • Run script: qsub testjob.sh • Query status: qstat -f • STDOUT, STDERR output is kept in files using name of the script that is run

  32. Grid Engine output • Grid Engine puts the output of job into 4 files. • $HOME/sge-qsub-test.sh.o<job id> for (STDOUT messages) • $HOME/sge-qsub-test.sh.e<job id> (STDERR messages). • The other 2 files pertain to Grid Engine status and they are named: • $HOME/sge-qsub-test.sh.po<job id> (STDOUT messages) • $HOME/sge-qsub-test.sh.pe<job id> (STDERR messages)

  33. testjob.sh #!/bin/bash #$ -S /bin/bash # # Set the Parallel Environment and number of procs. #$ -pe mpi 2 # Where we will make our temporary directory. BASE="/tmp" # make a temporary key export KEYDIR=`mktemp -d $BASE/keys.XXXXXX` # Make a temporary password. // Makepasswd is quieter, and presumably more efficient. //We must use the -s 0 flag to make sure the password contains no quotes. if [ -x `which mkpasswd` ]; then export PASSWD=`mkpasswd -l 32 -s 0` else export PASSWD=`dd if=/dev/urandom bs=512 count=100 | md5sum | gawk '{print $1}'` fi

  34. testjob.sh /usr/bin/ssh-keygen -t rsa -f $KEYDIR/tmpid -N "$PASSWD" cat $KEYDIR/tmpid.pub >> $HOME/.ssh/authorized_keys2 # make a script that will run under its own ssh-agent # cat > $KEYDIR/launch-script <<"EOF" #!/bin/bash expect -c 'spawn /usr/bin/ssh-add $env(KEYDIR)/tmpid' -c \ 'expect "Enter passphrase for $env(KEYDIR)/tmpid" \ { send "$env(PASSWD)\n" }' -c 'expect "Identity"' echo # Put your Job commands here. #------------------------------------------------ /opt/mpich/gnu/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines \ /opt/hpl/gnu/bin/xhpl #------------------------------------------------ EOF chmod u+x $KEYDIR/launch-script

  35. testjob.sh # start a new ssh-agent from scratch -- make it forget previous ssh-agent connections unset SSH_AGENT_PID unset SSH_AUTH_SOCK /usr/bin/ssh-agent $KEYDIR/launch-script # # cleanup # grep -v "`cat $KEYDIR/tmpid.pub`" $HOME/.ssh/authorized_keys2 \ > $KEYDIR/authorized_keys2 mv $KEYDIR/authorized_keys2 $HOME/.ssh/authorized_keys2 chmod 644 $HOME/.ssh/authorized_keys2 rm -rf $KEYDIR

  36. Monitoring Rocks • Set of web pages to monitor activities and configuration • Apache webserver with access from internal network only • From outside - viewing webpages involves sending a web browser screen over a secure, encrypted SSH channel – • ssh frontendsite, start Mozilla there, see http://localhost • mozilla --no-remote • Access from public network (not recommended) • Modify IPtables

  37. Rocks monitoring pages Should look like this:

  38. More monitoring through web • Access through web pages includes PHPMyAdmin for SQL server • TOP command for cluster • Graphical monitoring of cluster • Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.

  39. Ganglia looks like this:

  40. Default services • 411 Secure information service • Distribute files to nodes – password changes, login files • In cron – run every hour • DNS for local communication • Postfix mail software

  41. Sources • http://www.tldp.org/HOWTO/Beowulf-HOWTO.html • http://www.cs.iusb.edu/beowulf.html • Indiana University South Bend • http://www.beowulf.org • http://www.rocksclusters.org/Rocks/ • NPACI Rocks Cluster Distribution: Users Guide • http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book/beowulf_book/index.html • http://www.scyld.com/ • http://www.ganglia.info • http://www.sci.hkbu.edu.hk/mscsc/lab/cluster/cluster1.pdf

More Related