220 likes | 516 Views
Kolkata Tier-3 Rocks Cluster. Grid-Peer Rock Cluster for ALICE & CBM Development Works AT Variable Energy Cyclotron Centre. Kolkata Tier-3 Rocks Cluster. Rocks Overview Managing Rocks Under the hood with Torque/Maui Implementing Disk Quotas Using Rocks. USING OF ROCKS CLUSTER.
E N D
Kolkata Tier-3 Rocks Cluster Grid-Peer Rock Cluster for ALICE & CBM Development Works AT Variable Energy Cyclotron Centre
Kolkata Tier-3 Rocks Cluster • Rocks Overview • Managing Rocks • Under the hood with Torque/Maui • Implementing Disk Quotas • Using Rocks
USING OF ROCKS CLUSTER • Rocks is an open-source Linux cluster distribution that enables end users to easily build computational clusters, grid endpoints. • It is based on CentOS with a modified anaconda installer that simplifies mass installation onto many computers. (We are use Scientific Linux CERN 6x as OS in Kolkata Tier-3 Rocks Cluster). • Installations are customised with additional software packages called rolls.
Rocks Philosophies • Quick to install. • It should not take a month (or even more than a day) to install a thousand node cluster. • Nodes are 100% configured. • No “after the fact” tweaking. • If a node is out of configuration, just reinstall. • Don’t spend time on configuration management of nodes. • Just reinstall. • Nodes install from kickstart files generated from a database. • Modify default disk partitioning • Add software via RPMs or “Rolls”
Grid-peer DELL Blade Enclosure 10 GB ofcethernet 8GB OFC SAN IBM STORAGE 1GB Ethernet 10 GB ofcethernet HP Blade Enclosure Public Switch
Hardware Information Frontend Node (One Nos. HP DL585G7 Rack Mount Server) • Disk Capacity: 4TB with RAID 1 Technology for Disk Mirroring (usable). (Total Disk are 8 * 1TB= 8TB) • Memory Capacity: 256 GB • Ethernet: 6physical ports (e.g., "eth0" to "eth5“ .Eth0,eth1, eth2 & eth3 are 1Gbps RJ-45 type & rest eth4 & eth5 are 10GB OFC type) • CPU : 4 nos. of AMD Opteron 2.60 Ghz 64 Bit processor . Each processor are 16 nos. of core. Total no. of core are 64 cores (4 CPU * 16Cores)
Computation Nodes We are two type of compute nodes based on processor & per core with RAM • Compute-0-0 to compute-0-7 (8 Nos. of HP Full length Blade Server) • CPU : 4 nos. of AMD Opteron 64 Bit processor. Each processor are 16 nos. of core. Total no. of core are 64 cores (4 CPU * 16 Cores) • Disk Capacity: 900GB with RAID 1 (Total nos. of Disk are 2 * 900Gb) • Memory Capacity: 256 GB • Compute-0-8 to compute-0-13 (6 Nos. of Dell Half length Blade Server) • CPU : 2 nos. of Intel Xeon 64 Bit processor. Each processor are 8nos of core. Total no. of core are 16 cores (2 CPU * 8Cores) • Disk Capacity: 146GB with RAID 1 (Total nos. of Disk are 2 * 146Gb) • Memory Capacity: 16 GB
Access-Utilize & Manages Rocks • Access into Grid-Peer R0cks Cluster • Utilize Rocks
Access into Grid-peer Rocks Cluster Access to Kolkata Tier3 Grid-Peer Rocks Cluster: • From Linux Or Mac Based PC/Desktop Open a terminal, then type below command: • ssh –XY <userid>@grid-peer.tier2-kol.res.in Or ssh –X <userid>@144.16.112.6 • From Windows based PC/Desktop Download Putty. Open a putty, then type below command: • ssh –XY <userid>@grid-peer.tier2-kol.res.in Or ssh –X <userid>@144.16.112.6
Utilize Rocks • To See the all active node with frontend server, type following command: # rocks list hosts (It’s show 14 nodes named compute-0-0, compute-0-1 …compute-0-8 …… ..compute-0-13.) ============================================== [root@grid-peer ~]# rocks list host HOST MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION grid-peer: Frontend 64 0 0 os install compute-0-9: Compute 16 0 9 os install compute-0-2: Compute 64 0 2 os install compute-0-3: Compute 64 0 3 os install compute-0-4: Compute 64 0 4 os install compute-0-5: Compute 64 0 5 os install compute-0-6: Compute 64 0 6 os install compute-0-10: Compute 16 0 10 os install compute-0-11: Compute 16 0 11 os install compute-0-12: Compute 16 0 12 os install compute-0-13: Compute 16 0 13 os install compute-0-7: Compute 64 0 7 os install compute-0-1: Compute 64 0 1 os install compute-0-0: Compute 64 0 0 os install Compute-0-8: Compute 16 0 8 osinstall ===============================================================
To see the rolls list of rocks (which are installed), type following command: # rocks list rolls ===================================== [root@grid-peer ~]# rocks list roll NAME VERSION ARCH ENABLED kernel: 6.1.1 x86_64 yes ganglia: 6.1.1 x86_64 yes base: 6.1.1 x86_64 yes hpc: 6.1.1 x86_64 yes torque: 6.1.0 x86_64 yes Scientific_Linux_CERN: 6.1.1 x86_64 yes =====================================
Torque/Maui We have implement torque/maui which is include pbs_server & pbs_mom. We include first 7 AMD processor based nodes i.e. compute-0-0 to compute-0-6. Those node are non-interactive nodes. Any physical login on those nodes are not allowed by ssh. So, submit jobs on those nodes by “qsub”
Installed daemons • On Frontend • maui • pbs_server • pbs_mom (not running) • mpiexec (mostly for the man-page) • On Compute nodes • pbs_mom • mpiexec
Debugging and Analysing • Lots of tools: • pbsnodes -- node status pbsnodes –a -- list of nodes are online pbsnodes –l -- list of nodes are offline • qstat –f -- all details of a job • checkjob-- check job status • showq -- check the job status. • tracejob -- trace the job status with nodes information.
Check jobs running on the nodes • Check for jobs running on the cluster (showq): [root@fotcluster2 nodes]# showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 1472 asmith Running 64 99:13:35:10 1473 klangfeld Running 120 99:22:01:12 2 Active Jobs 184 of 948 Processors Active (19.41%) 16 of 79 Nodes Active (20.25%) ==================================== • showq –r : Display list of current running job • showq–i: Display list of idle job • showq –r : Display list of block job
Check jobs are running on • Whats nodes are the jobs running on: (tracejobJOBID) : +++++++++++++++++++++++++++++++++ • [root@grid-peer ~]# tracejob 100 • /var/spool/torque/mom_logs/20140722: No such file or directory • /var/spool/torque/sched_logs/20140722: No such file or directory • Job: 100.grid-peer.tier2-kol.res.in • 07/22/2014 14:17:28 S enqueuing into route, state 1 hop 1 • 07/22/2014 14:17:28 S dequeuing from route, state QUEUED • 07/22/2014 14:17:28 S enqueuing into general, state 1 hop 1 • 07/22/2014 14:17:28 A queue=route • 07/22/2014 14:17:28 A queue=general • 07/22/2014 14:17:29 S Job Run at request of root@grid-peer.tier2-kol.res.in • 07/22/2014 14:17:29 S Not sending email: User does not want mail of this type. • 07/22/2014 14:17:29 A user=prasun group=generalgroupjobname=psr.job queue=general ctime=1406018848 qtime=1406018848 etime=1406018848 start=1406018849 • owner=prasun@grid-peer.tier2-kol.res.in exec_host=compute-0-1/1 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 • 07/22/2014 14:17:59 S Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=3240kb resources_used.vmem=31180kb resources_used.walltime=00:00:31 • 07/22/2014 14:17:59 S Not sending email: User does not want mail of this type. • 07/22/2014 14:17:59 S on_job_exit valid pjob: 100.grid-peer.tier2-kol.res.in (substate=50) • 07/22/2014 14:17:59 A user=prasun group=generalgroupjobname=psr.job queue=general ctime=1406018848 qtime=1406018848 etime=1406018848 start=1406018849 • owner=prasun@grid-peer.tier2-kol.res.in exec_host=compute-0-1/1 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 session=52126 • end=1406018879 Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=3240kb resources_used.vmem=31180kb resources_used.walltime=00:00:31 • [root@grid-peer ~]#
Submit job in torque : qsub • Make file i.e. abc.job #!/bin/bash # PBS –N abc -> set the job name # PBS –o abc.out -> set the job std out file #PBS –e abc.err -> set the job std err file # PBS –M <your mail id> -mail id of yourself +++++++++++++++++++enter your jobs parameter ++++++++++++++++++++++++++ sleep 10 hostname sleep 20
Submit job in torque : qsub • Then save it. • Now submit your job by qsub. # qsubabc.job • After submitting a job, you get an unique job id. • Then check your job by “showq –r “ command.