Triton Shared Computing Cluster Project Update

Triton Shared Computing Cluster Project Update Jim Hayes jhayes@sdsc.edu 2014-05-21

TSCC To Date • A Look Under the Hood • Plans for the Near Future

Project Model • TSCC is a heterogeneous cluster with a hybrid usage model • General computing, GPU, large-memory nodes; h/w choices reviewed annually and expected to change • “Condo” nodes purchased by researchers, access shared among condo owners; “hotel” nodes provided by RCI project, access via cycle purchase • Operation and infrastructure costs paid by condo owners, subsidized by RCI project and hotel sales • All users have access to infiniband and 10GbE networking, home and parallel file systems • TSCC one-year production anniversary 2014-05-10

Participation • Condo • 15 groups w/169 users • 116 general compute nodes (5+ in pipeline), 20 GPU • Hotel • 192 groups w/410 users • 46 general compute nodes, 3 GPU, 4 PDAFM • Of 94 trial (90-day, 250 SU) accounts, 20 bought cycles • Classes

Jobs • ~1.54 million jobs since 5/10/2013 production • 800K hotel • 391K home • 42K condo • 98K glean • >6 million SUs spent

Job Stats – Node Count Single node job count is 1.53M – about 233x 2-node count

TSCC Job Stats - ppn

TSCC Job Stats - Runtime

TSCC Job Stats – Node Usage

Issues from Prior Meeting • Maui scheduler crashing/hanging/unresponsive – Fixed • Upgrade of torque fixed communication problems • Duplicate job charges/hanging reservations – Managed • Less frequent post-upgrade; wrote scripts to detect/reverse • Glean queue not working – Fixed • Gave up on maui and wrote script to manage glean queue • X11 Forwarding failure – Fixed • Problem was missing ipv6 conf file • Home filesystem unresponsive under write load – Fixed • Post upgrade, fs handles user write load gracefully • zfs/nfs quotas broken; handled manually • User access to snapshots not working; restores via ticket

TSCC Rack Layout tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n

Networking v v v tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n Infinibandx36 1GbEx40 10GbEx32

Node Classes tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n Hotel GPU Condo GPU Hotel Hotel PDAFM Condo Administration Home F/S

Processors tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n 2x8 2.6 GHz Sandy Bridge 4x8 2.5GHz Opteron 2x6 2.3GHz Sandy Bridge 2x6 2.6 GHz Ivy Bridge 2x8 2.2GHz Sandy Bridge

Memory tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n 128GB 64GB 256GB 512GB 32GB

GPUs tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n 4x GTX 780 Ti 4x GTX 780 4x GTX 680

Infiniband Connectivity tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n rack0 ibswitch1 rack0 ibswitch2 rack0 ibswitch2 rack1 ibswitch4 rack1 ibswitch3 ibswitch6 ibswitch7 IB switch

qsub Switches for IB Jobs • Only necessary for condo/glean jobs, because • Hotel nodes all in same rack (rack0) • Home nodes on same switch • To run on a single switch, specify switch property, e.g., • qsub -q condo -l nodes=2:ppn=8:ib:ibswitch3 • qsub -q hotel -l nodes=2:ppn=8:ib:ibswitch1 • To run in a single rack (IB switches interconnected), specify rack property, e.g., • qsub -q condo -l nodes=2:ppn=8:ib:rack1

Queues • All users have access to hotel, gpu-hotel, and pdafm queues • qsub [-q hotel] – max time 7d, total cores/user 176 • qsub -q gpu-hotel – max time 14d, total cores/user 36 • Non-GPU jobs may be run in gpu-hotel • qsub -q pdafm – max time 3d, total cores/user 96 • Condo owners have access to home, condo, glean queues • qsub -q home – unlimited time, cores • qsub -q condo – max time 8h, total cores/user 512 • qsub -q glean – unlimited time, max total cores/user 1024 • No charge to use, but jobs will be killed for higher-priority jobs • GPU owners have access to gpu-condo queue • qsub -q gpu-condo – max time 8h, total cores/user 84 • GPU node jobs allocated 1 GPU per 3 cores • Queue limits subject to change w/out notice

Commands to Answer FAQs • “Why isn’t my job running?” • checkjobjob id, e.g., checkjob 1234567 • “BankFailure” indicates not enough SUs remain to run job • “No Resources” may indicate bad request (e.g., ppn=32 in hotel) • “What jobs are occupying these nodes?” • lsjobs --property=string, e.g., lsjobs --property=hotel • “How many SUs do I have left?” • gbalance -u login, e.g., gbalance -u jhayes • “Why is my balance so low?” • gstatement -u login, e.g., gstatement-u jhayes • “How much disk space am I using?” • df -h /home/login, e.g., df -h /home/jhayes

Hardware Selection/Pricing • Jump to Haswell processors in 4th quarter 2014 • Go back to vendors for fresh quotes • Original vendor agreements for fixed pricing expired 12/2013-1/2014 • Interim pricing on HP nodes $4,300 • GPU pricing still ~$6,300 • Final price depends on GPU selected--many changes in NVIDIA offerings since January, 2013

Participation • First Haswell purchase will be hotel expansion • Hotel getting increasingly crowded in recent months • 8 nodes definite, 16 if $$ available • Goal for coming year is 100 new condo nodes • Nominal “break even” point for cluster is ~250 condo nodes • Please help spread the word!

Cluster Operations • Adding i/o nodes by mid-June • Offload large i/o from logins w/out burning SUs • General s/w upgrade late June • Latest CentOS, application versions • Research user-defined web services over summer • oasis refresh toward end of summer • Automate Infiniband switch/rack grouping • Contemplating transition from torque/maui to slurm • maui is no longer actively developed/supported • If we make the jump, we’ll likely use translation scripts to ease transition

Q&A

Triton Shared Computing Cluster Project Update