700 likes | 843 Views
Scheduling and Resource Management for Next-generation Clusters. Yanyong Zhang Penn State University www.cse.psu.edu/~yyzhang. What is a Cluster?. Cost effective Easily scalable Highly available Readily upgradeable. Scientific & Engineering Applications.
E N D
Scheduling and Resource Management for Next-generation Clusters Yanyong Zhang Penn State University www.cse.psu.edu/~yyzhang
What is a Cluster? • Cost effective • Easily scalable • Highly available • Readily upgradeable
Scientific & Engineering Applications • HPTi win 5 year $15M procurement to provide systems for weather modeling (NOAA). (http://www.noaanews.noaa.gov/stories/s419.htm) • Sandia's expansion of their Alpha-based C-plant system. • Maui HPCC LosLobos Linux Super-cluster (http://www.dl.ac.uk/CFS/benchmarks/beowulf/tsld007.htm) • A performance-price ratio of … is demonstrated in simulations of wind instruments using a cluster of 20 …. (http://www.swiss.ai.mit.edu/~pas/p/sc95.html) • The PC cluster based parallel simulation environment and the technologies … will have a positive impact on networking research nationwide …. (http://www.osc.edu/press/releases/2001/approved.shtml)
Commercial Applications • Business applications • Transaction Processing (IBM DB2, oracle …) • Decision Support System (IBM DB2, oracle …) • Internet applications • Web serving / searching (Google.Com …) • Infowares (yahoo.Com, AOL.Com) • Email, eChat, ePhone, eBook,eBank, eSociety, eAnything • Computing portal
Resource Management • Each application is demanding • Several applications/users can be present at the same time Resource management and Quality-of-service become important.
P0 P1 P2 P3 P4 Arrival Q High Speed Network System Model 4 4 3 • Each node is • independent • Maximum MPL • Arrival queue
Two Phases in Resource Management • Allocation Issues • Admission Control • Arrival Queue Principle • Scheduling Issues (CPU Scheduling) • Resource Isolation • Co-allocation
RECV Scheduling skewness switch SEND Co-allocation / Co-scheduling P1 P0 P0 t0 t1 TIME
Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Contribution 4: optimizing clustered DB2 NEXT
Contribution 1:Boosting CPU Utilization at Supercomputing Centers
Response Time slowdown = Execute Time in Isolation Objective Response Time Wait Time Execute Time Wait in the arrival Q Wait in the ready/blocked Q minimize
5 3 2 6 2 2 Existing Techniques • Back Filling (BF) • Gang Scheduling (GS) • Migration (M) time 2 8 8 3 2 6 2 space # of CPUs = 14
Proposed Scheme • MBGS = GS + BF + M • Use GS as the basic framework • At each row of GS matrix, apply BF technique • Whenever GS matrix is re-calculated, M should be considered.
Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Contribution 4: optimizing clustered DB2 NEXT
Contribution 2:Reducing Response Times for Commercial Applications
Objective Response Time Wait Time Execute Time Wait in the arrival Q Wait in the ready/blocked Q • Minimize wait time • Minimize response time
wasted Previous Work I:Gang Scheduling (GS) (1) MINUTES ! (2) GS is not responsive enough !
Previous Work II:Dynamic Co-scheduling P1 P2 P3 P0 B D A C It’s A’s turn C just finishes I/O B just gets a msg Everybody else is blocked The scheduler on each node makes independent decision based on local events without global synchronizations.
How do you wait for a message? Busy Wait Spin Block Spin Yield No Explicit Reschedule Local SB SY What do you do on message arrival? Interrupt & Reschedule DCS DCS-SB DCS-SY Periodically Reschedule PB PB-SB PB-SY Dynamic Co-scheduling Heuristics
Simulation Study • A detailed simulator at a microsecond granularity • System parameters • System configurations (maximum MPL, to partition or not) • System overheads (context switch overheads, interrupt costs, costs associated with manipulating queues)
Simulation Study (Cont’d) • Application parameters • Injection load • Characteristics (CPU intensive, IO intensive, communication intensive or somewhere in the middle)
Impact of Workload Characteristics Comm intensive I/O intensive
Periodic Boost Heuristics • S1: Compute Phase • S2: S1 + Unconsumed Msg. • S3: Recv. + Msg. Arrived • S4: Recv. + No Msg. • A: S3-> {S2,S1} • B: S3->S2->S1 • C: {S3,S2,S1} • D: {S3,S2}->S1 • E: S2->S3->S1
P0 P1 P2 P3 Pp … … High Speed Network Analytical Modeling Study • The state space is impossible to handle. Dynamic arrival
_ _ _ _ _ jkB jA1, …, mA, ik, jkR, jk, iX i, jA, j1B,…,jPBi+, ,…, B , ik1,…,iM, n number of nodes l l=1 jk,l1,…,N, jk 1,…,mQ+mO, k1,…,P, N jk,1 B Q B jkR(l) 1,…,iM, M ik _ _ iY jA, jiM,jQ B jA1, …, mA, i, jR,j1B ,…, i+, jR(l) 1,…,iM, jQ 1,…,mQ+mO jkB1,…,N, Reduced State Space (much more tractable !! ) Analysis Description Number of jobs on node k Original State Space (impossible to handle!!) Assumption: The state of each processor is stochastically independent and identical to the state of the other processors.
Analysis Description (Cont) Address the state transition rates using Continuous Markov model; Build the Generator Matrix Q Get the invariant probability vector by solving Q = 0,and e = 1. Use fixed-point iteration to get the solution
… 1 IO 2 IO 1 C 2 C … … … … 1 C 2 C 1 B 2 IO 2 C 1 IO 2 C 1 IO 1 1 1 1 r1 1xP1 r1’ … 1 SP 2 C 1 2 C 1 B 1 C 2 C 1 C 2 C 1 SN 2 C 1x(1-P1) Q Q … Q 2 C 1 C Q Q … … 2 C 1 B 2 C 1 SN … 2 C 1 SP 1 C 2 * r1 = P( )x 1 IO 2 * 2 * 1 IO +{P( )+P( )}x 1 1 SN 2 * +P( )x 1 1 1/1+1/1+1/1 1/1+1/1 SB Example r2 = …
Results Optimal PB Frequency Optimal Spin Time for SB
Results – Optimal Quantum Length CPU Intensive Comm Intensive I/O Intensive
Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Contribution 4: optimizing clustered DB2 NEXT
Contribution 3:Scheduling Multiple Classes of Applications interactive real time batch
Objective BE RT How long did it take me to finish?? Response time How many deadlines have been missed? Miss rate cluster
Fairness Ratio (x:y) Cluster Resource x x+y y x+y
P0 P1 P0 P1 P0 P1 RT RT1 time time time RT2 BE BE 2DCS-TDM 2DCS-PS 1GS x:y = 2:1 How to Adhere to Fairness Ratio?
BE responsetime RT : BE = 2:1 RT : BE = 1:9 RT : BE = 9:1
RT Deadline Miss Rate RT : BE = 1:9 RT : BE = 2:1 RT : BE = 9:1
Outline • From OS’s perspective • Contribution 1: boosting the CPU utilization at supercomputing centers • Contribution 2: providing quick responses for commercial workloads • Contribution 3: scheduling multiple classes of applications • From application’s perspective • Characterizing decision support workloads on the clustered database server • Resource management for transaction processing workloads on the clustered database server NEXT
Experiment Setup • IBM DB2 Universal Database for Linux, EEE, Version 7.2 • 8 dual node Linux/Pentium cluster, that has 256 MB RAM and 18 GB disk on each node. • TPC-H workload. Queries are run sequentially (Q1 – Q20). Completion time for each query is measured.
A 001 4 4 4 2 B 002 2 2 4 2 C 003 1 D 004 5 3 3 3 3 3 Server Myrinet A 001 B 002 C 003 D 004 coordinator node A 001 B 002 Table T C 003 D 004 Platform Select * from T Client
Methodology • Identify the components with high system overhead. • For each such component, characterize the request distribution. • Come up with ways of optimization. • Quantify potential benefits from the optimization.
Sampling OS Statistics • Sample the statistics provided by stat, net/dev, process/stat. • User/system CPU % • # of pages faults • # of blocks read/written • # of reads/writes • # of packets sent/received • CPU utilization during I/O
Kernel Instrumentation • Instrument each system call in the kernel. Enter system call Exit system call unblock block resume execution
Operating System Profile • Considerable part of the execution time is taken by pread system call. • There is good overlap of computation with I/O for some queries. • More reads than writes.
TPC-H pread Overhead pread overhead = # of preads X overhead per pread.
page table user space 2 page cache 1 pread Optimization pread(dest, chunk) { for each page in the chunk { if the page is not in cache { bring it in from disk } copy the page into dest } } • Optimization: • Re-mapping the • buffer • Copy on write 30s
user space read only page cache Copy-on-write # of copy-on-write % reduction = 1 - # of preads
Operating System Profile • Socket calls are the next dominant system calls.
Message Characteristics Q11 Q16 Message Size (bytes) Message Inter-injection Time (Millisecond) Message Destination