Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids

Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids Jaehwan Lee, Pete Keleher, Alan Sussman Department of Computer Science University of Maryland

Multi-core is not enough • Multi-core CPU is the current trend of desktop computing • Not easy to exploit Multi-core in a single machine for high throughput computing • “Multicore is Bad news for Supercomputers”, S. Moore, IEEE Spectrum, 2008 • We have proposed decentralized solution for initial job placement for Multi-core Grids, but..

Motivation and Challenges • Why does dynamic scheduling need? • Stale load information from neighbors • Unpredictable job completion times can change the current load situation • Probabilistic Initial job assignment can be improved • Challenges for decentralized dynamic scheduling for multi-core grids • Multiple Resource Requirement • Decentralized Algorithm • No Job starvation

Our Contribution • New Decentralized Dynamic Scheduling Schemes for Multi-core Grids • Local scheduling • Internode scheduling • Queue Balancing • Experimental Result via extensive simulation • Performance competitive with an online centralized scheduler

Outline • Background • Related work • Our approach • Experimental Results • Conclusion & Future Work

Owner Node Initiate Matchmaking Heartbeat Route Job J Send Job J Find Heartbeat Insert Job J Run Node Injection Node J FIFO Job Queue Overall System Architecture • P2P grids Job J Peer-to-Peer Network (DHT - CAN) Clients Assign GUID to Job J

Matchmaking Mechanism in CAN Memory A D G Run Node J Pushing Job J FIFO Queue B E H Client Heartbeat Job J CPU >= CJ && Memory >= MJ Job J Insert J C F I Owner MJ CJ CPU

Job 1 Job 1 CPU: 2GHz Mem:2GB CPU: 2GHz Mem:2GB Approach for Multi-core nodes (Lee:IPDPS’10) Memory Free B’ Residue- node 1GB • Two logical nodes can represent a multi-core node • Max-node : Maximum & Static status • Residue-node : Current & Dynamic status • Resource management & matchmaking • Dual-CAN : Two distinct nodes consist of each separate CAN • Balloon Model : Residue-node can be represent as a simple structure on top of existing CAN 2GHz CPU Memory Primary CAN B A 3GB Secondary CAN Free Residue-node 1.5GHz 2GHz CPU 2GB B’ 1GB B A 2GB Max-node Qlen=1 Qlen=1 Max-node 3GB 2GHz 1.5GHz CPU Balloon Model Memory

Backfilling • Concept • Assumption • Job running time should be given • Conservative Backfilling : NOT to delay all other waiting jobs • EASY Backfilling : NOT to delay the first job in the queue • Inaccurate job running time estimation is closely related to the overall performance CPUs Job2 Job4 Job1 Job3 Time

Previous approach under multiple resource requirements • Backfilling under multiple resource requirement (Leinberger:SC’99) • Backfilling in single machine • Heuristic approaches : Backfill Lowest(BL), Backfill Balanced(BB) • Assumption cannot be applicable to our system • Job running time is not given • Job migration to balance K-resources between nodes (Leinberger:HCW’00) • Reduce local load imbalance by exchanging jobs, but not consider overall system loads • No backfilling scheme • Assumption : near-homogeneous environment

Dynamic Scheduling • After Initial Job assignment..(but before the job starts running) • Periodical dynamic scheduling is invoked • Costs for dynamic scheduling • Communication Cost • Nothing : Local scheduling • Minimum : Internode scheduling & Queue balancing • Job profile is small • CPU cost : Nothing • No preemptive scheduling : Once a job starts running, it won’t be stopped due to dynamic scheduling.

Local Scheduling • Extension of Backfilling under multiple resource requirement • Backfilling Counter (BC) • Every Job profile keeps this value, with initial value 0 • The number of other jobs which have bypassed the job in the waiting queue • Only a job with a BC is equal to or greater than BC of the job at the head of the queue can be backfilled • No job starvation or long waiting time J3 Running Job JR Backfilling BC Head of Queue 0 J1 0 J2 J3 0 J4 0 Queue

Which job should be backfilled? • If multiple jobs in the queue can be backfilled, choose the best job for better utilization • Backfill Balanced (BB) (Leinberger:SC’99) algorithm • Choose the job with minimum objective function (= BM x FM ) • Balance Measure (BM) • : normalized utilization for resource • : Job ’s normalized requirement for resource • Minimize unevenness across utilization of multiple resources • Fullness Measure (FM) • Maximize average utilization

Internode Scheduling Running Job JB Running Job JA BC BC J1 J1 0 2 • Extension of Local scheduling across nodes • Node Backfilling Counter (NBC) • BC of the job at the head of the node’s waiting queue • Only jobs whose BC is equal to or greater than NBC of the target node can be migrated • No job starvation or long waiting time J2 J2 0 2 J3 J3 0 0 Queue Queue Running Job JR J3 J4 NBC : 2 NBC : 0 BC J1 1 J2 1 J4 0 Queue

Internode Scheduling – PUSH vs. PULL • PUSH • A job sender initiates the process • Sender tries to match every job in the queue with residual free resources in its neighbors • If a job can be sent to multiple nodes, Pick the job with the objective function to prefer a node with the fastest CPU • PULL • A job receiver initiates the process • Receiver sends a PULL-Requestmessage to the potential sender (with maximum queue length) • Potential sender checks whether it has a job to be backfilled and the job satisfies BC condition. • If multiple jobs can be sent, choose the job with minimum objective function (= BMx FM ) • If no job can be found, send a PULL-Reject message to receiver. • The receiver continues to look for another potential sender among neighbors

Queue Balancing • Local scheduling & Internode scheduling • find the job which can start running immediately to use current residual resources • Proactive job migration for queue (load) balancing • Migrated job does not have to start immediately • (Normalized) Load of a node (Leinberger:HCW’00) • “Load” for multiple resources can be defined as follows • For each resource, sum all job’s requirements in the queue • Load of a node can be defined as a maximum of the above sums • PUSH & PULL schemes are available • Minimize total local loads (= sum of loads of neighbors, TLL) • Minimize the maximum local load among neighbors (MLL)

Experimental Setup • Event-driven Simulations • A set of nodes and events • 1000 initial nodes and 5000 job submissions • Jobs are submitted with a Poisson distribution (various interval T) • A node has 1,2,4 or 8 cores. • Job run time follows uniform distribution (30mins~90mins) • Node Capability (Job Requirements) • CPU, Memory, Disk and the number of cores • Job requirement for a resource can be omitted (Don’t care) • Job Constraint Ratio : The probability (Ratio) that each resource type for a job is specified • Steady stateexperiments

Experimental Setup • Performance metrics Matchmaking Cost Wait Time Running Time Job Turn-around Time Injected into the system Arrives at Owner Arrives at Run Node Starts Execution Finishes Execution

Comparison Models • CentralizedScheduler(CENT) • Online and global scheduling mechanism with a single waiting queue • Not feasible in a complete implementation of P2P system • Combination of our approach • Vanilla : No dynamic scheduling (Initial job assignment only) • L : Local scheduling only • LI : Local scheduling + Internode scheduling • LIQ : All three methods (Local scheduling + Internode scheduling + Queue balancing) • LI(Q)-PUSH/PULL : LI & LIQ has PUSH (PULL) options Job J CPU 2.0GHz Mem  500MB Disk  1GB Centralized Scheduler

Performance varying system loads • LIQ-PULL > LI-PULL > LIQ-PUSH > LI-PUSH > L >= Vanilla • Internode scheduling has a major role • PULL is better than PUSH • In overloaded system, PULL is better to spread information due to aggressive trial for a job (Demers:PODC’87) • Local scheduling cannot guarantee for better performance than Vanilla • Our Backfilling Counter cannot ensure not to delay other waiting jobs (different from conservative backfilling)

Overheads • PULL has higher cost than PUSH • Active search (lots of trials and errors ) • Other schemes are similar to Vanilla • Our schemes do not incur significant additional overhead

Performance varying Job Constraint Ratio • LIQ-PULL : best • LIQ == LI • LIQ-PULL is competitive to CENT • For 80% Job Constraint Ratio, • LIQ-PULL performance gets worse compared to other ratio cases • Reason: difficult to find a good (capable) neighbor for job migration because jobs get more highly constrained

Evaluation Summary • Performance • LIQ-PULL is competitive to CENT • Internode Scheduling has major impacts on performance • PULL is better than PUSH (more aggressive search) • Performance shows good consistently regardless of variant system load and job constraint ratio environment • Overheads • PULL > PUSH (more aggressive search) • Competitive to Vanilla

Conclusion and Future Work • New decentralized Dynamic Scheduling for Multi-core P2P Grids • Local Scheduling : Backfilling in a single node • Internode scheduling : Backfilling across neighbors • Queue Balancing : Proactive Queue(Load) adjustment among neighbors • Evaluation via simulation • LIQ-PULL is the best • Competitive performance to CENT • PULL is better than PUSH, but PULL has higher cost • Low overhead (similar to vanilla ) • Future work • Real experiment (co-operation with Astronomy Dept.) • Decentralized Resource Management for HeterogeneousAsymmetric Multi-processors

Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids Jaehwan Lee, Pete Keleher, Alan Sussman Department of Computer Science University of Maryland

Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids

Decentralized Dynamic Scheduling across Heterogeneous Multi-core Desktop Grids

Presentation Transcript

Desktop Grids

Heterogeneous Multi-Core Processors

Decentralized Resource Management for Multi-core Desktop Grids

Dynamic Scheduling

Dynamic Scheduling

Dynamic scheduling

Desktop Grids

Desktop Scheduling

Heterogeneous multi-core

Dynamic Scheduling

Dynamic Scheduling

Dynamic Multi Phase Scheduling for Heterogeneous Clusters

HPC across Heterogeneous Resources

Desktop Grids

Dynamic Service Aggregation in Heterogeneous Grids

Dynamic Service Aggregation in Heterogeneous Grids

Dynamic scheduling

Condor and Multi-core Scheduling

Whole-nodes / multi-core job scheduling

Heterogeneous Multi-Core Processors

Across Grids Conference