810 likes | 921 Views
Memory restriction, limits and heterogeneous grids. A case study. Txema Heredia. Or an example of how to adapt your policies to your needs. DISCLAIMER
E N D
Memory restriction, limits and heterogeneous grids. A case study. • Txema Heredia Or an example of how to adapt your policies to your needs
DISCLAIMER What I am going to present is not either the panacea nor has to adapt to nor solve immediately your cluster issues. This is just a brief description of the problems we faced and how did we use different SGE’s options to handle them. Also, no animal was harmed in the making of this powerpoint.
“hey, let’s buy a cluster” - my boss
What did we need? • Users: • biologists, not programmers • Processes: • user-made scripts • single core biological software
What did we NOT need? • Nopes: • threads / parallel programming (mostly) • GPUs • Ayes: • thousands of single-core jobs
Our cluster • 8 computing nodes • 8 cores • 8 Gb RAM • 1 front-end
Our cluster • NFS • Rocks cluster (CentOS) • SGE
First steps with SGE • 1st try: • One queue to rule them all
First steps with SGE • 1st try: • all.q queue • free for all
First steps with SGE • 1st try - conclusions: • chaos reigned • constant conflicts between users (specially time related) • FIFO queuing • swapping
2nd try • 2nd try • round-robin-like scheduling • share tree/functional tickets • split cluster by time usage: • 3 queues: fast / medium / slow
2nd try • fast: • 2 hours / 2 nodes • medium: • 48 hours / 3 nodes • slow: • ∞ hours / 3 nodes
2nd try • Conclusions: • ↓ chaos • ↓ user conflicts • Still swapping • High undersubscription of the cluster
2nd try • 3 types of jobs • Don’t need to coexist at the same time • 1 user → 1 type of job • User knowledge • Saturation of the unlimited queue
2nd try • Queue tinkering: • wallclock time • number of hosts • Better results, but not good enough: • Waiting jobs & idle nodes
2nd try • There are 2 wars here: • memory / swap • splitting leads to undersubscription
Memory • Buy more memory • from 8x8Gb • to 4x 32Gb, 3x 16Gb, 1x 8Gb • This reduces our problem, but doesn’t fix it
Swap • Swapping in a cluster is the root of all evil
Swap • Complex attribute “h_vmem”
h_core h_fsize h_rss h_stack h_rt ≠ h_cpu h_data = h_vmem
h_vmem • h_vmem • SIGKILL • s_vmem • SIGXCPU • You can combine both
h_vmem • Requestable by default • We want them to be consumable • qmon / qconf -mc
h_vmem • requestable = YES • consumable = YES / JOB • default = whatever you want
h_vmem • Only for parallel environment jobs: • consumable = YES • sge_shepherd memory = h_vmem*slots • consumble = JOB • sge_shepherd memory = h_vmem
h_vmem • default = 100M • “everything” dies • default = 6G • “everything” works
h_vmem • Now we can limit the memory • But we can still have swapping
h_vmem • Define h_vmem in each host • qmon / qconf -me hostname
h_vmem • Exact memory: • more secure • Bigger memory: • more margin
Memory • From now on, any job submission must contain a memory request: • qsub ... -l h_vmem=3G...
Undersubscription • Dual restriction: • 8 jobs/slots per node • 32 / 16 / 8 GB mem per node • The minimum of both will apply
8 Gb node 32 Gb node
8Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 8 Gb node 7 slots free 0 Gb free Stupid scheduling 0 slots free 24 Gb free 32 Gb node
8Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 8 Gb node 0 slots free 0 Gb free Smart scheduling 7 slots free 24 Gb free 32 Gb node
Smart scheduling • We want each job to go to the node where it better fits.
(another) DISCLAIMER This is strictly for our case and needs. It may appeal to you, or some ideas can inspire you, but it is not intended to be a step-by-step solution for everyone. It is just an example of “things that can be done”.
Smart scheduling • Create 3 hostgroups: • @32G, @16G and @8G • Group nodes by memory
Smart scheduling • Maximize the ratio memory/core: • job <1Gb → 8Gb nodes • 1Gb < job < 2Gb → 16Gb nodes • 2Gb < job → 32Gb nodes
Smart scheduling • 3 different queues: • all-32 • all-16 • all-8 • assign the corresponding hostgroup