Memory restriction, limits and heterogeneous grids. A case study.

Memory restriction, limits and heterogeneous grids. A case study. • Txema Heredia Or an example of how to adapt your policies to your needs

DISCLAIMER What I am going to present is not either the panacea nor has to adapt to nor solve immediately your cluster issues. This is just a brief description of the problems we faced and how did we use different SGE’s options to handle them. Also, no animal was harmed in the making of this powerpoint.

Our story

“hey, let’s buy a cluster” - my boss

What did we need?

What did we need? • Users: • biologists, not programmers • Processes: • user-made scripts • single core biological software

What did we NOT need? • Nopes: • threads / parallel programming (mostly) • GPUs • Ayes: • thousands of single-core jobs

And thus, our baby was born

Our cluster • 8 computing nodes • 8 cores • 8 Gb RAM • 1 front-end

Our cluster • NFS • Rocks cluster (CentOS) • SGE

First steps with SGE

First steps with SGE • 1st try: • One queue to rule them all

First steps with SGE • 1st try: • all.q queue • free for all

First steps with SGE • 1st try - conclusions: • chaos reigned • constant conflicts between users (specially time related) • FIFO queuing • swapping

2nd try • 2nd try • round-robin-like scheduling • share tree/functional tickets • split cluster by time usage: • 3 queues: fast / medium / slow

2nd try • fast: • 2 hours / 2 nodes • medium: • 48 hours / 3 nodes • slow: • ∞ hours / 3 nodes

2nd try • Conclusions: • ↓ chaos • ↓ user conflicts • Still swapping • High undersubscription of the cluster

2nd try • 3 types of jobs • Don’t need to coexist at the same time • 1 user → 1 type of job • User knowledge • Saturation of the unlimited queue

2nd try • Queue tinkering: • wallclock time • number of hosts • Better results, but not good enough: • Waiting jobs & idle nodes

2nd try • There are 2 wars here: • memory / swap • splitting leads to undersubscription

The memory war

Memory • Buy more memory • from 8x8Gb • to 4x 32Gb, 3x 16Gb, 1x 8Gb • This reduces our problem, but doesn’t fix it

Swap • Swapping in a cluster is the root of all evil

Swap • Complex attribute “h_vmem”

h_core h_fsize h_rss h_stack h_rt ≠ h_cpu h_data = h_vmem

h_vmem • h_vmem • SIGKILL • s_vmem • SIGXCPU • You can combine both

h_vmem • Requestable by default • We want them to be consumable • qmon / qconf -mc

h_vmem

h_vmem • requestable = YES • consumable = YES / JOB • default = whatever you want

h_vmem • Only for parallel environment jobs: • consumable = YES • sge_shepherd memory = h_vmem*slots • consumble = JOB • sge_shepherd memory = h_vmem

h_vmem • default = 100M • “everything” dies • default = 6G • “everything” works

h_vmem • Now we can limit the memory • But we can still have swapping

h_vmem • Define h_vmem in each host • qmon / qconf -me hostname

h_vmem • Exact memory: • more secure • Bigger memory: • more margin

Memory • From now on, any job submission must contain a memory request: • qsub ... -l h_vmem=3G...

No more swapping!!

Undersubscription

Undersubscription • Dual restriction: • 8 jobs/slots per node • 32 / 16 / 8 GB mem per node • The minimum of both will apply

8 Gb node 32 Gb node

8Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 8 Gb node 7 slots free 0 Gb free Stupid scheduling 0 slots free 24 Gb free 32 Gb node

8Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 1Gb 8 Gb node 0 slots free 0 Gb free Smart scheduling 7 slots free 24 Gb free 32 Gb node

Smart scheduling • We want each job to go to the node where it better fits.

(another) DISCLAIMER This is strictly for our case and needs. It may appeal to you, or some ideas can inspire you, but it is not intended to be a step-by-step solution for everyone. It is just an example of “things that can be done”.

Smart scheduling • Create 3 hostgroups: • @32G, @16G and @8G • Group nodes by memory

Smart scheduling • Maximize the ratio memory/core: • job <1Gb → 8Gb nodes • 1Gb < job < 2Gb → 16Gb nodes • 2Gb < job → 32Gb nodes

Smart scheduling • 3 different queues: • all-32 • all-16 • all-8 • assign the corresponding hostgroup

Memory restriction, limits and heterogeneous grids. A case study.

Memory restriction, limits and heterogeneous grids. A case study.

Presentation Transcript

A CASE STUDY.

A Case Study

A Case Study

A case study:

A CASE STUDY

A Case Study

a case study

System Design and Memory Limits

Grids: TACC Case Study

A case study

A Case Study

A JSDL Application Repository Portal for Heterogeneous Grids and the NGS

A Case Study…

A Case Study

Dynamic Service Aggregation in Heterogeneous Grids

A Case Study

Heterogeneous Groups and the Limits of Cooperation

A Case for Heterogeneous Disk Arrays

Dynamic Service Aggregation in Heterogeneous Grids

A case study

Case Study: Virtual Memory in UNIX

A CASE STUDY