Cluster Resource Management: A Scalable Approach

Cluster Resource Management: A Scalable Approach Ning Li and Jordan Parker CS 736 Class Project

Outline • Introduction • A Scalable Approach: Hierarchy • Results • Conclusions • Questions Ning Li and Jordan Parker

Why Study Resource Management? • Clusters have become increasingly popular for large parallel computing. • Web Servers • Clusters are becoming increasingly large to the order of thousands of nodes. • Clusters are providing multiple services. • Hard to evaluate • Bad is easy to determine • Good is much harder Ning Li and Jordan Parker

A 50% A 50% A 50% A 50% Node 1 Node 2 Node 3 Node 4 Overall B 50% B 50% B 50% B 50% B 100% Node 1 Node 2 Node 3 Node 4 A 66% A 66% A 66% B 100% B 33% B 33% B 33% Resource Management Example • 4th Node Services only B • Poor Management • Ideal Overall A 37.5% B 62.5% Ning Li and Jordan Parker

Clustering Goals • Scalability • Reliability • High Performance • Affordability Ning Li and Jordan Parker

Related Work • Proportional-Share • Cluster Reserves Ning Li and Jordan Parker

Related Work: Approach Differences • Our Goal: to provide a scalable solution for resource management. • Other work focused primarily on just having good management • This often meant 1 manager for all the nodes • Clearly this could present a scalable bottleneck • Effectiveness: Other solutions probably better for smaller clusters, we hope to be better for large (>1000 nodes) clusters. Ning Li and Jordan Parker

Hierarchical Management Nodes service jobs Managers facilitate resource management 1 2 3 4 5 6 7 8 9 10 11 12 Hierarchy: A Scalable Approach Ning Li and Jordan Parker

Banking Algorithm • Goal • Determine best allocation given previous usage • Primitives • Tickets • Bank accounts • Deposit / withdraw tickets • 6 Steps Ning Li and Jordan Parker

Banking Algorithm • Step 1: For each service class on each node • Deposit unused tickets • Step 2: For each service class on each node • Reallocate service class • Full utilization: Allocation = usage + k • Under utilization: Allocation = usage - k Ning Li and Jordan Parker

Banking Algorithm Cont. • Step 3: For each service class • Compare total allocation to desired • Subtract from over-allocated • Add to needy & under-allocated • Step 4: For each service class • Deposit / Withdraw • If still over-allocated withdraw • If still under-allocated deposit Ning Li and Jordan Parker

Banking Algorithm Cont. • Step 5: • Withdraw and allocate • Reward the needy nodes • Step 6: • Done, clear the bank accounts Ning Li and Jordan Parker

1 2 2 3 4 5 3 4 5 6 7 8 9 10 11 12 6 7 8 9 10 11 12 Reliability • Bottom-up Manager Replacement 5 5 6 7 2 2 1 3 8 9 10 4 11 12 Ning Li and Jordan Parker

Results Ning Li and Jordan Parker

Implementation Details • Simulations via The NS – Network Simulator • Low bandwidth 10Mbs communication network • UDP for lower server overhead • Assumptions • Node level resource management works ideally Ning Li and Jordan Parker

Node 4 Overall 1st 40% 1st 66% 1st 66% 1st 66% 1st 60% Node 2 Node 3 Node 1 2nd 20% 3rd 40% 2nd 30% 2nd 33% 2nd 33% 2nd 33% 3rd 10% Test 1: Overview • 4 nodes – 3 services – 60/30/10 Allocation • 4th node receives all of 3rd class’s requests • Steady Workload Ning Li and Jordan Parker

Test 1: Data Ning Li and Jordan Parker

Test 2: Overview • 100 nodes – 3 services – 60/30/10 Allocation • nodes 1-30 receive all of 3rd class’s requests • Steady Workload Ning Li and Jordan Parker

Test 3: Overview • 100 nodes – 3 services – 60/30/10 Allocation • nodes 1-30 receive all of 3rd class’s requests • Dynamic Workload Ning Li and Jordan Parker

Test 4: Overview • 100 nodes – 3 services – 60/30/10 Allocation • nodes 1-30 receive all of 3rd class’s requests • Steady Workload • Reporting 1/5 • Nodes every 0.3 second • Managers every 1.5 seconds Ning Li and Jordan Parker

Test 5: Overview • 900 nodes – 3 services – 60/30/10 Allocation • nodes 1-300 receive all of 3rd class’s requests • Steady Workload Ning Li and Jordan Parker

Conclusions • Benefits of an hierarchy • Scalable • Reliable • Geographic Applications • Implemented a new management scheme: Banking • Comparable Results • Improved Scalability Ning Li and Jordan Parker

Conclusions • Clusters are sensitive to small policy changes • Clusters are built for specific workloads • Their performance is important and small changes have significant impact • No scheme is universally applicable • Future Work • Real system implementation • Real Workloads • Real node level resource management • More steady performance Ning Li and Jordan Parker

Questions Ning Li and Jordan Parker

Related Work: Proportional-Share • Stride Scheduling • Ticket based and similar to lottery • Scale • Randomly query k nodes to find best allocation • Different Application • Condor-like resource allocation/applications Ning Li and Jordan Parker

Related Work: ClusterReserves • Resource Container Schedulers • Constrained Optimization Algorithm • Scale • Centralized single manager Ning Li and Jordan Parker

Hierarchical Cluster Reserves – Version 1 • Modify Cluster Reserves optimization algorithm • Use it when manager manages nodes • ANDwhen level_n+1 manager manages level_n managers. Ning Li and Jordan Parker

Hierarchical Cluster Reserves – Version 2 • Cluster Reserves optimization algorithm • Use it when manager manages nodes • Don’t use it for upper level managers • Modify the manager to manager reporting • Lie to the algorithm Ning Li and Jordan Parker

Cluster Resource Management: A Scalable Approach