Scheduling and Resource Management for Next-generation Clusters

Scheduling and Resource Management for Next-generation Clusters Yanyong Zhang Penn State University www.cse.psu.edu/~yyzhang

What is a Cluster? • Cost effective • Easily scalable • Highly available • Readily upgradeable

Scientific & Engineering Applications • HPTi win 5 year $15M procurement to provide systems for weather modeling (NOAA). (http://www.noaanews.noaa.gov/stories/s419.htm) • Sandia's expansion of their Alpha-based C-plant system. • Maui HPCC LosLobos Linux Super-cluster (http://www.dl.ac.uk/CFS/benchmarks/beowulf/tsld007.htm) • A performance-price ratio of … is demonstrated in simulations of wind instruments using a cluster of 20 …. (http://www.swiss.ai.mit.edu/~pas/p/sc95.html) • The PC cluster based parallel simulation environment and the technologies … will have a positive impact on networking research nationwide …. (http://www.osc.edu/press/releases/2001/approved.shtml)

Commercial Applications • Business applications • Transaction Processing (IBM DB2, oracle …) • Decision Support System (IBM DB2, oracle …) • Internet applications • Web serving / searching (Google.Com …) • Infowares (yahoo.Com, AOL.Com) • Email, eChat, ePhone, eBook,eBank, eSociety, eAnything • Computing portal

Resource Management • Each application is demanding • Several applications/users can be present at the same time Resource management and Quality-of-service become important.

P0 P1 P2 P3 P4 Arrival Q High Speed Network System Model 4 4 3 • Each node is • independent • Maximum MPL • Arrival queue

Two Phases in Resource Management • Allocation Issues • Admission Control • Arrival Queue Principle • Scheduling Issues (CPU Scheduling) • Resource Isolation • Co-allocation

RECV Scheduling skewness switch SEND Co-allocation / Co-scheduling P1 P0 P0 t0 t1 TIME

Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Contribution 4: optimizing clustered DB2 NEXT

Contribution 1:Boosting CPU Utilization at Supercomputing Centers

Response Time slowdown = Execute Time in Isolation Objective Response Time Wait Time Execute Time Wait in the arrival Q Wait in the ready/blocked Q minimize

5 3 2 6 2 2 Existing Techniques • Back Filling (BF) • Gang Scheduling (GS) • Migration (M) time 2 8 8 3 2 6 2 space # of CPUs = 14

Proposed Scheme • MBGS = GS + BF + M • Use GS as the basic framework • At each row of GS matrix, apply BF technique • Whenever GS matrix is re-calculated, M should be considered.

How Does MBGS Perform?

Contribution 2:Reducing Response Times for Commercial Applications

Objective Response Time Wait Time Execute Time Wait in the arrival Q Wait in the ready/blocked Q • Minimize wait time • Minimize response time

wasted Previous Work I:Gang Scheduling (GS) (1) MINUTES ! (2) GS is not responsive enough !

Previous Work II:Dynamic Co-scheduling P1 P2 P3 P0 B D A C It’s A’s turn C just finishes I/O B just gets a msg Everybody else is blocked The scheduler on each node makes independent decision based on local events without global synchronizations.

How do you wait for a message? Busy Wait Spin Block Spin Yield No Explicit Reschedule Local SB SY What do you do on message arrival? Interrupt & Reschedule DCS DCS-SB DCS-SY Periodically Reschedule PB PB-SB PB-SY Dynamic Co-scheduling Heuristics

Simulation Study • A detailed simulator at a microsecond granularity • System parameters • System configurations (maximum MPL, to partition or not) • System overheads (context switch overheads, interrupt costs, costs associated with manipulating queues)

Simulation Study (Cont’d) • Application parameters • Injection load • Characteristics (CPU intensive, IO intensive, communication intensive or somewhere in the middle)

Impact of Load

Impact of Workload Characteristics Comm intensive I/O intensive

Periodic Boost Heuristics • S1: Compute Phase • S2: S1 + Unconsumed Msg. • S3: Recv. + Msg. Arrived • S4: Recv. + No Msg. • A: S3-> {S2,S1} • B: S3->S2->S1 • C: {S3,S2,S1} • D: {S3,S2}->S1 • E: S2->S3->S1

P0 P1 P2 P3 Pp … … High Speed Network Analytical Modeling Study • The state space is impossible to handle. Dynamic arrival

_ _ _ _ _ jkB jA1, …, mA,   ik, jkR, jk, iX  i, jA, j1B,…,jPBi+, ,…, B  , ik1,…,iM, n number of nodes  l l=1 jk,l1,…,N, jk  1,…,mQ+mO, k1,…,P, N  jk,1 B Q B jkR(l) 1,…,iM, M ik _ _ iY jA, jiM,jQ B jA1, …, mA,   i, jR,j1B ,…, i+, jR(l) 1,…,iM,  jQ 1,…,mQ+mO jkB1,…,N, Reduced State Space (much more tractable !! ) Analysis Description Number of jobs on node k Original State Space (impossible to handle!!) Assumption: The state of each processor is stochastically independent and identical to the state of the other processors. 

Analysis Description (Cont)  Address the state transition rates using Continuous Markov model; Build the Generator Matrix Q  Get the invariant probability vector by solving Q = 0,and e = 1.  Use fixed-point iteration to get the solution

… 1 IO 2 IO 1 C 2 C … … … … 1 C 2 C 1 B 2 IO 2 C 1 IO 2 C 1 IO 1 1 1 1 r1 1xP1 r1’ … 1 SP 2 C 1 2 C 1 B 1 C 2 C 1 C 2 C 1 SN 2 C 1x(1-P1) Q Q … Q 2 C 1 C Q Q … … 2 C 1 B 2 C 1 SN … 2 C 1 SP 1 C 2 * r1 = P( )x 1 IO 2 * 2 * 1 IO +{P( )+P( )}x 1 1 SN 2 * +P( )x 1 1 1/1+1/1+1/1 1/1+1/1 SB Example r2 = …

Results Optimal PB Frequency Optimal Spin Time for SB

Results – Optimal Quantum Length CPU Intensive Comm Intensive I/O Intensive

Contribution 3:Scheduling Multiple Classes of Applications interactive real time batch

Objective BE RT How long did it take me to finish?? Response time How many deadlines have been missed? Miss rate cluster

Fairness Ratio (x:y) Cluster Resource x x+y y x+y

P0 P1 P0 P1 P0 P1 RT RT1 time time time RT2 BE BE 2DCS-TDM 2DCS-PS 1GS x:y = 2:1 How to Adhere to Fairness Ratio?

BE responsetime RT : BE = 2:1 RT : BE = 1:9 RT : BE = 9:1

RT Deadline Miss Rate RT : BE = 1:9 RT : BE = 2:1 RT : BE = 9:1

Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Characterizing decision support workloads on the clustered database server • Resource management for transaction processing workloads on the clustered database server NEXT

Experiment Setup • IBM DB2 Universal Database for Linux, EEE, Version 7.2 • 8 dual node Linux/Pentium cluster, that has 256 MB RAM and 18 GB disk on each node. • TPC-H workload. Queries are run sequentially (Q1 – Q20). Completion time for each query is measured.

A 001 4 4 4 2 B 002 2 2 4 2 C 003 1 D 004 5 3 3 3 3 3 Server Myrinet A 001 B 002 C 003 D 004 coordinator node A 001 B 002 Table T C 003 D 004 Platform Select * from T Client

Methodology • Identify the components with high system overhead. • For each such component, characterize the request distribution. • Come up with ways of optimization. • Quantify potential benefits from the optimization.

Sampling OS Statistics • Sample the statistics provided by stat, net/dev, process/stat. • User/system CPU % • # of pages faults • # of blocks read/written • # of reads/writes • # of packets sent/received • CPU utilization during I/O

Kernel Instrumentation • Instrument each system call in the kernel. Enter system call Exit system call unblock block resume execution

Operating System Profile • Considerable part of the execution time is taken by pread system call. • There is good overlap of computation with I/O for some queries. • More reads than writes.

TPC-H pread Overhead pread overhead = # of preads X overhead per pread.

page table user space 2 page cache 1 pread Optimization pread(dest, chunk) { for each page in the chunk { if the page is not in cache { bring it in from disk } copy the page into dest } } • Optimization: • Re-mapping the • buffer • Copy on write 30s

user space read only page cache Copy-on-write # of copy-on-write % reduction = 1 - # of preads

Operating System Profile • Socket calls are the next dominant system calls.

Message Characteristics Q11 Q16 Message Size (bytes) Message Inter-injection Time (Millisecond) Message Destination

Scheduling and Resource Management for Next-generation Clusters

Scheduling and Resource Management for Next-generation Clusters

Presentation Transcript

Next-Generation HIL Design Tools for Next-Generation Vehicles

Commodity Computing Clusters - next generation supercomputers?

Next Generation Management

Resource Database Assembly: The Next Generation

Next Generation Receivables Process Management

Next Generation Records Management

Next Generation Traffic Management Centers

Resource Scheduling

Next Generation Investment Risk Management

Next Generation Customer Experience Management

NEXT GENERATION MODEL MANAGEMENT AND INTEGRATION

OCLC’s Next Generation Metadata Management

Adaptive Resource Allocation: Self-Sizing for Next Generation Networks

Next Generation Information Management

Next Generation Master Data Management

Scheduling and Resource Management for Cruise Lines

Fuzzy Based Algorithm for Cloud Resource Management and Task Scheduling

Next Generation Warehouse Management System

Next Generation Fleet Management

Field Service Scheduling and Resource Management | KloudGin