60 likes | 326 Views
Pipeline Medulla and Cranium Clusters Status LONI Open Discussion Forum November 2012. - GUI Application on Windows, Mac and Linux - Command Line Interface - Web Start. Pipeline Architecture. Compute nodes. File systems. Grid Engine LSF Torque/PBS (beta). - Server Installer (DPS)
E N D
Pipeline Medulla and Cranium Clusters Status LONI Open Discussion Forum November 2012
- GUI Application on Windows, Mac and Linux - Command Line Interface - Web Start Pipeline Architecture Compute nodes File systems • Grid Engine • LSF • Torque/PBS (beta) - Server Installer (DPS) - Amazon EC2 Image - DRMAA plugin - JGDI plugin
412 nodes dual quad-core Xeon X5550 CPUs and 24G of RAM, total 3064 slots available Stateless, /tmp directory may cause problems for some tools. Use /ifs/tmp for scratch space, There are significant upgrades as compared with cranium (new OS, new scheduler, etc.) Ongoing testing and validation (support@loni.ucla.edu, pipeline@loni.ucla.edu) Medulla.loni.ucla.eduServer
304 Sun v20z nodes purchasedoriginally 2 AMD Opteron 250 processors, 8GB RAM, 608 slots 76 Sun X2200 node supplementary purchase 2x quad-core AMD 2354, 608 slots Allocations: pipeline.q: 794 short.q: 88 medium.q: 64 long.q: 64 Cranium.loni.ucla.eduProduction
Node hardware issues Dev environment and servers draw from same node pool as shown in the previous slide 110 slots reallocated for purposes including: Dev/Q&A PWS Additionally, our 4 qmasters, 6 submission servers,and other management nodes draw from this pool as well Accurate counts for all environments will be posted online for all to see internally Cranium.loni.ucla.edu“Missing Nodes”
More nodes in current use than are listed A process goes to one node on a CPU but uses all the memory, no jobs can be submitted to the remaining three nodes despite their availability When a job fails in Pipeline, it may be because you randomly landed on a bad node restarting the module may finish just fine (e.g. random chance component to job submission) Getting around submission quotas and use of more nodes than users are allowed IFS is very slow and gets bogged down with file I/O operations 1) ifs went beyond 95% capacity, which is the point where Isilon processes begin to die. 2) The cranium cluster has 100G aggregated bandwidth to our core router; the Isilon cluster has 10G aggregated bandwidth to that same core. Flagging individual nodes as needing to be re-mounted: whole racks need to be taken down and restarted as a whole(?) Reporting availability of 1800+ nodes; only 400 show up User Feedback