Pipeline Medulla and Cranium Clusters Status LONI Open Discussion Forum November 2012

Pipeline Medulla and Cranium Clusters Status LONI Open Discussion Forum November 2012

- GUI Application on Windows, Mac and Linux - Command Line Interface - Web Start Pipeline Architecture Compute nodes File systems • Grid Engine • LSF • Torque/PBS (beta) - Server Installer (DPS) - Amazon EC2 Image - DRMAA plugin - JGDI plugin

412 nodes dual quad-core Xeon X5550 CPUs and 24G of RAM, total 3064 slots available Stateless, /tmp directory may cause problems for some tools. Use /ifs/tmp for scratch space, There are significant upgrades as compared with cranium (new OS, new scheduler, etc.) Ongoing testing and validation (support@loni.ucla.edu, pipeline@loni.ucla.edu) Medulla.loni.ucla.eduServer

304 Sun v20z nodes purchasedoriginally 2 AMD Opteron 250 processors, 8GB RAM, 608 slots 76 Sun X2200 node supplementary purchase 2x quad-core AMD 2354, 608 slots Allocations: pipeline.q: 794 short.q: 88 medium.q: 64 long.q: 64 Cranium.loni.ucla.eduProduction

Node hardware issues Dev environment and servers draw from same node pool as shown in the previous slide 110 slots reallocated for purposes including: Dev/Q&A PWS Additionally, our 4 qmasters, 6 submission servers,and other management nodes draw from this pool as well Accurate counts for all environments will be posted online for all to see internally Cranium.loni.ucla.edu“Missing Nodes”

More nodes in current use than are listed A process goes to one node on a CPU but uses all the memory, no jobs can be submitted to the remaining three nodes despite their availability When a job fails in Pipeline, it may be because you randomly landed on a bad node restarting the module may finish just fine (e.g. random chance component to job submission) Getting around submission quotas and use of more nodes than users are allowed IFS is very slow and gets bogged down with file I/O operations 1) ifs went beyond 95% capacity, which is the point where Isilon processes begin to die. 2) The cranium cluster has 100G aggregated bandwidth to our core router; the Isilon cluster has 10G aggregated bandwidth to that same core. Flagging individual nodes as needing to be re-mounted: whole racks need to be taken down and restarted as a whole(?) Reporting availability of 1800+ nodes; only 400 show up User Feedback

Pipeline Medulla and Cranium Clusters Status LONI Open Discussion Forum November 2012

Pipeline Medulla and Cranium Clusters Status LONI Open Discussion Forum November 2012

Presentation Transcript

Pipeline Financing Discussion

LONI Pipeline

ET Open Forum December 2012

IMDA QA/RA Forum 2012 Discussion Forum 1: CAPA’s

Cranium

Open Clusters

Pipeline Financing Discussion

Registrar Open Forum Spring 2012

H101-2012-Discussion 6 November 2, 2012

Discussion Forum Bridge Consulting 9 November 2012

Discussion Forum:

Open Staff Forum May 2012

MPI Forum ABI WG Status and Discussion

Open Forum Discussion of Organizational Merge

Second Discussion November 14, 2012

Finance and Risk Forum November 2012

LONI

LONI Pipeline Integration/ UNC shape analysis

IPMA Forum 2006 Open Source Discussion

Open Staff Forum November 2012

Medulla Oblongata (bulbus/medulla)

Finance and Risk Forum November 2012