1 / 26

Triton Shared Computing Cluster Project Update

Triton Shared Computing Cluster Project Update. Jim Hayes jhayes@sdsc.edu 2014-05-21. TSCC To Date A Look Under the Hood Plans for the Near Future. Project Model. TSCC is a heterogeneous cluster with a hybrid usage model

Download Presentation

Triton Shared Computing Cluster Project Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Triton Shared Computing Cluster Project Update Jim Hayes jhayes@sdsc.edu 2014-05-21

  2. TSCC To Date • A Look Under the Hood • Plans for the Near Future

  3. Project Model • TSCC is a heterogeneous cluster with a hybrid usage model • General computing, GPU, large-memory nodes; h/w choices reviewed annually and expected to change • “Condo” nodes purchased by researchers, access shared among condo owners; “hotel” nodes provided by RCI project, access via cycle purchase • Operation and infrastructure costs paid by condo owners, subsidized by RCI project and hotel sales • All users have access to infiniband and 10GbE networking, home and parallel file systems • TSCC one-year production anniversary 2014-05-10

  4. Participation • Condo • 15 groups w/169 users • 116 general compute nodes (5+ in pipeline), 20 GPU • Hotel • 192 groups w/410 users • 46 general compute nodes, 3 GPU, 4 PDAFM • Of 94 trial (90-day, 250 SU) accounts, 20 bought cycles • Classes

  5. Jobs • ~1.54 million jobs since 5/10/2013 production • 800K hotel • 391K home • 42K condo • 98K glean • >6 million SUs spent

  6. Job Stats – Node Count Single node job count is 1.53M – about 233x 2-node count

  7. TSCC Job Stats - ppn

  8. TSCC Job Stats - Runtime

  9. TSCC Job Stats – Node Usage

  10. Issues from Prior Meeting • Maui scheduler crashing/hanging/unresponsive – Fixed • Upgrade of torque fixed communication problems • Duplicate job charges/hanging reservations – Managed • Less frequent post-upgrade; wrote scripts to detect/reverse • Glean queue not working – Fixed • Gave up on maui and wrote script to manage glean queue • X11 Forwarding failure – Fixed • Problem was missing ipv6 conf file • Home filesystem unresponsive under write load – Fixed • Post upgrade, fs handles user write load gracefully • zfs/nfs quotas broken; handled manually • User access to snapshots not working; restores via ticket

  11. TSCC To Date • A Look Under the Hood • Plans for the Near Future

  12. TSCC Rack Layout tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n

  13. Networking v v v tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n Infinibandx36 1GbEx40 10GbEx32

  14. Node Classes tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n Hotel GPU Condo GPU Hotel Hotel PDAFM Condo Administration Home F/S

  15. Processors tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n 2x8 2.6 GHz Sandy Bridge 4x8 2.5GHz Opteron 2x6 2.3GHz Sandy Bridge 2x6 2.6 GHz Ivy Bridge 2x8 2.2GHz Sandy Bridge

  16. Memory tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n 128GB 64GB 256GB 512GB 32GB

  17. GPUs tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n 4x GTX 780 Ti 4x GTX 780 4x GTX 680

  18. Infiniband Connectivity tscc-0-n tscc-1-n tscc-2-n tscc-3-n tscc-5-n tscc-6-n tscc-7-n rack0 ibswitch1 rack0 ibswitch2 rack0 ibswitch2 rack1 ibswitch4 rack1 ibswitch3 ibswitch6 ibswitch7 IB switch

  19. qsub Switches for IB Jobs • Only necessary for condo/glean jobs, because • Hotel nodes all in same rack (rack0) • Home nodes on same switch • To run on a single switch, specify switch property, e.g., • qsub -q condo -l nodes=2:ppn=8:ib:ibswitch3 • qsub -q hotel -l nodes=2:ppn=8:ib:ibswitch1 • To run in a single rack (IB switches interconnected), specify rack property, e.g., • qsub -q condo -l nodes=2:ppn=8:ib:rack1

  20. Queues • All users have access to hotel, gpu-hotel, and pdafm queues • qsub [-q hotel] – max time 7d, total cores/user 176 • qsub -q gpu-hotel – max time 14d, total cores/user 36 • Non-GPU jobs may be run in gpu-hotel • qsub -q pdafm – max time 3d, total cores/user 96 • Condo owners have access to home, condo, glean queues • qsub -q home – unlimited time, cores • qsub -q condo – max time 8h, total cores/user 512 • qsub -q glean – unlimited time, max total cores/user 1024 • No charge to use, but jobs will be killed for higher-priority jobs • GPU owners have access to gpu-condo queue • qsub -q gpu-condo – max time 8h, total cores/user 84 • GPU node jobs allocated 1 GPU per 3 cores • Queue limits subject to change w/out notice

  21. Commands to Answer FAQs • “Why isn’t my job running?” • checkjobjob id, e.g., checkjob 1234567 • “BankFailure” indicates not enough SUs remain to run job • “No Resources” may indicate bad request (e.g., ppn=32 in hotel) • “What jobs are occupying these nodes?” • lsjobs --property=string, e.g., lsjobs --property=hotel • “How many SUs do I have left?” • gbalance -u login, e.g., gbalance -u jhayes • “Why is my balance so low?” • gstatement -u login, e.g., gstatement-u jhayes • “How much disk space am I using?” • df -h /home/login, e.g., df -h /home/jhayes

  22. TSCC To Date • A Look Under the Hood • Plans for the Near Future

  23. Hardware Selection/Pricing • Jump to Haswell processors in 4th quarter 2014 • Go back to vendors for fresh quotes • Original vendor agreements for fixed pricing expired 12/2013-1/2014 • Interim pricing on HP nodes $4,300 • GPU pricing still ~$6,300 • Final price depends on GPU selected--many changes in NVIDIA offerings since January, 2013

  24. Participation • First Haswell purchase will be hotel expansion • Hotel getting increasingly crowded in recent months • 8 nodes definite, 16 if $$ available • Goal for coming year is 100 new condo nodes • Nominal “break even” point for cluster is ~250 condo nodes • Please help spread the word!

  25. Cluster Operations • Adding i/o nodes by mid-June • Offload large i/o from logins w/out burning SUs • General s/w upgrade late June • Latest CentOS, application versions • Research user-defined web services over summer • oasis refresh toward end of summer • Automate Infiniband switch/rack grouping • Contemplating transition from torque/maui to slurm • maui is no longer actively developed/supported • If we make the jump, we’ll likely use translation scripts to ease transition

  26. Q&A

More Related