ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, SrimatChakradhar

Context • GPUs are used in supercomputers • Some of the top500 supercomputers use GPUs • Tianhe-1A • 14,336 Xeon X5670 processors • 7,168 Nvidia Tesla M2050 GPUs • Stampede • about 6,000 nodes: • Xeon E5-2680 8C, Intel Xeon Phi • GPUs are used in cloud computing • Need for resource managers and scheduling schemes for heterogeneous clusters including many-core GPUs

Categories of Scheduling Objectives • Traditional schedulers for supercomputers aim to improve system-wide metrics: throughput& latency • A market-based service world is emerging: focus on provider’s profit and user’s satisfaction • Cloud: pay-as-you-go model • Amazon: different users (On-Demand, Free, Spot, …) • Recent resource managers for supercomputers (e.g. MOAB) have the notion of service-level agreement (SLA)

Motivation State of the Art • Our Goal: • Reconsider market-based scheduling for heterogeneous clusters including GPUs • Open-source batch schedulers start to support GPUs • TORQUE, SLURM • Users’ guide mapping of jobs to heterogeneous nodes • Simple scheduling schemes (goals: throughput & latency) • Recent proposals describe runtime systems & virtualization frameworks for clusters with GPUs • [gViMHPCVirt '09][vCUDAIPDPS '09][rCUDAHPCS’10][gVirtuSEuro-Par 2010][our HPDC’11, CCGRID’12, HPDC’12] • Simple scheduling schemes (goals: throughput & latency) • Proposals on market-based scheduling policies focus on homogeneous CPU clusters • [Irwin HPDC’04][Sherwani Soft.Pract.Exp.’04]

Considerations • Community looking into code portability between CPU and GPU • OpenCL • PGI CUDA-x86 • MCUDA (CUDA-C), Ocelot, SWAN (CUDA-OpenCL), OpenMPC → Opportunity to flexibly schedulea job on CPU/GPU • In cloud environments oversubscription commonly used to reduce infrastructural costs → Use of resource sharing to improve performance by maximizing hardware utilization

Problem Formulation • Given a CPU-GPU cluster • Schedule a set of jobs on the cluster • To maximize the provider’s profit / aggregate user satisfaction • Exploit the portability offered by OpenCL • Flexibly map the job on to either CPU or GPU • Maximize resource utilization • Allow sharing of multi-core CPU or GPU Assumptions/Limitations • 1 multi-core CPU and 1 GPU per node • Single-node, single GPU jobs • Only space-sharing, limited to two jobs per resource

Value Function Market-based Scheduling Formulation • For each job, Linear-Decay Value Function [Irwin HPDC’04] • Max Value → Importance/Priority of job • Decay → Urgency of job • Delay due to: • queuing, execution on non-optimal resource, resource sharing Yield = maxValue – decay * delay Max Value Yield/Value Decay rate Execution time T

Overall Scheduling Approach Scheduling Flow Jobs arrive in batches Jobs are enqueued on their optimal resource. Phase 1 is oblivious of other jobs (based on optimal walltime) Phase 1:Mapping Enqueue into CPU Queue Enqueue into GPU Queue Phase 2:Sorting Inter-jobs scheduling considerations Sort jobs to Improve Yield Sort jobs to Improve Yield Phase 3:Re-mapping Different schemes: - Whento remap? - Whatto remap? Execute on CPU Execute on GPU

Phase 1: Mapping • Users provide walltimeon GPU and GPU • walltime used as indicator of optimal/non optimal resource • Each job is mapped onto its optimal resource NOTE: in our experiments we assumed maxValue = optimal walltime

Phase 2: Sorting • Sort jobs based on Reward [Irwin HPDC’04] • Present Value – f(maxValuei, discount_rate) • Value after discounting the risk of running a job • The shorter the job, the lower the risk • Opportunity Cost • Degradation in value due to the selection of one among several alternatives

Phase 3: Remapping • When to remap: • Uncoordinated schemes • queue is empty and resource is idle • Coordinated scheme • When CPU and GPU queues are imbalanced • What to remap: • Which job will have best reward on non-optimal resource? • Which job will suffer least reward penalty ?

Phase 3: Uncoordinated Schemes • Last Optimal Reward (LOR) • Remap job with least reward on optimal resource • Idea: least reward → least risk in moving • First Non-Optimal Reward (FNOR) • Compute the reward job could produce on non-optimal resource • Remap job with highest reward on non-optimal resource • Idea: consider non-optimal penalty • Last Non-Optimal Reward Penalty (LNORP) • Remap job with least reward degradation RewardDegradationi= OptimalRewardi - NonOptimalRewardi

Phase 3: Coordinated Scheme Coordinated Least Penalty (CORLP) • When to remap: imbalance between queues • Imbalance affected by: decay rates and execution times of jobs • Total Queuing-Delay Decay-Rate Product (TQDP) • Remap if |TQDPCPU – TQDPGPU| > threshold • What to remap • Remap job with least penalty degradation

Heuristic for Sharing Resource Sharing Heuristic • Limitation: Two jobs can space-share of CPU/GPU • Factors affecting sharing - Slowdown incurred by jobs using half of a resource + More resource available for other jobs • Jobs • Categorized as low, medium, high scaling (based on models/profiling) • When to enable sharing • Large fraction of jobs in pending queues with negative yield • What jobs share a resource • Scalability-DecayRatefactor • Jobs grouped based on scalability • Within each group, jobs are ordered by decay rate (urgency) • Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low decay)

Overall System Prototype Master Node Compute Node Compute Node Compute Node …

Overall System Prototype Submission Queue Master Node Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues TCP Communicator Scheduling Schemes & Policies CPU GPU Finished Queues CPU GPU Compute Node Compute Node Compute Node … Multi-core CPU Multi-core CPU Multi-core CPU GPU GPU GPU

Overall System Prototype Submission Queue Master Node Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues TCP Communicator Scheduling Schemes & Policies CPU GPU Finished Queues CPU GPU Compute Node Compute Node Compute Node Node-Level Runtime Node-Level Runtime Node-Level Runtime … Multi-core CPU Multi-core CPU Multi-core CPU GPU GPU GPU

Overall System Prototype Submission Queue Master Node Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues TCP Communicator Scheduling Schemes & Policies CPU GPU Finished Queues CPU GPU TCP Communicator Compute Node Compute Node Compute Node Node-Level Runtime Node-Level Runtime Node-Level Runtime … CPU Execution Processes GPU Execution Processes Multi-core CPU Multi-core CPU Multi-core CPU GPU GPU GPU OS-basedscheduling & sharing GPU Consolidation Framework

Overall System Prototype Submission Queue Master Node Centralized decision making Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues TCP Communicator Execution & sharing mechanisms Scheduling Schemes & Policies CPU GPU Finished Queues CPU GPU TCP Communicator Compute Node Compute Node Compute Node Node-Level Runtime Node-Level Runtime Node-Level Runtime … CPU Execution Processes GPU Execution Processes Multi-core CPU Multi-core CPU Multi-core CPU GPU GPU GPU OS-basedscheduling & sharing GPU Consolidation Framework Assumption: shared file system

GPU Sharing Framework GPU-related Node-Level Runtime CUDA app1 CUDA appN GPU execution processes (Front-End) … CUDA InterceptionLibrary CUDA InterceptionLibrary Front End – Back End Communication Channel GPU Consolidation Framework Back-End CUDA Runtime CUDA Driver GPU

GPU Sharing Framework GPU-related Node-Level Runtime CUDA app1 CUDA appN GPU execution processes (Front-End) CUDA calls arrive from Frontend … CUDA InterceptionLibrary CUDA InterceptionLibrary Back-End Server Manipulates kernel configurations to allow GPUspace sharing Front End – Back End Communication Channel GPU Consolidation Framework Virtual Context Back-End Workload Consolidator CUDA Runtime CUDA Driver CUDA stream1 CUDA stream2 CUDA streamN GPU Simplified version of our HPDC’11 runtime

Experimental Setup • 16-node cluster • CPU: 8-core Intel Xeon E5520 (2.27 GHz), 48 GB memory • GPU: Nvidia Tesla C2050 (1.15 GHz), 3GB device memory • 256-job workload • 10 benchmark programs • 3 configurations: small, large, very large datasets • Various application domains: scientific computations, financial analysis, data mining, machine learning • Baselines • TORQUE (always optimal resource) • Minimum Completion Time (MCT) [Maheswaran et.al, HCW’99]

Comparison with Torque-based Metrics Throughput & Latency 10-20% better ~ 20% better COMPLETION TIME AVERAGE LATENCY • Baselines suffer from idle resources • By privileging shorter jobs, our schemes reduce queuing delays

Results with Average Yield Metric Yield: Effect of Job Mix up to 8.8x better up to 2.3x better Skewed-GPU Skewed-CPU Uniform • Better on skewed job mixes: • More idle time in case of baseline schemes • More room for dynamic mapping

Results with Average Yield Metric Yield: Effect of Value Function up to 6.9x better up to 3.8x better • Adaptability of our schemes to different value functions

Results with Average Yield Metric Yield: Effect of System Load up to 8.2x better • As load increases, yield from baselines decreases linearly • Proposed schemes achieve initially increased yield and then sustained yield

Yield Improvements from Sharing Yield: Effect of Sharing up to 23x improvement Fraction of jobs to share • Careful space sharing can help performance by freeing resources • Excessive sharing can be detrimental to performance

Summary Conclusion • Value-based Scheduling on CPU-GPU clusters • Goal: improve aggregate yield • Coordinated and uncoordinated scheduling schemes for dynamic mapping • Automatic space sharing of resources based on heuristics • Prototypical framework for evaluating the proposed schemes • Improvement over state-of-the-art • Based on completion time & latency • Based on average yield

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters

Presentation Transcript

A Metric-based Framework for Automatic Taxonomy Induction

The BSP’s New Risk-Based Capital Adequacy Framework

CPU Scheduling

Results Based Management

Supervisory Framework for Risk Assessment and Risk-based Solvency

Conceptual Framework in English

Java Media Framework

Results Based Management: Logical Framework Matrix (LFM)

Resource Description Framework (RDF)

Zones of Peace, Zones of Chaos

Order and Chaos

Scheduling Parameter Sweep Workflow in the Grid

Chapter 6: CPU Scheduling

Chapter 6: CPU Scheduling

Chapter 5: Process Scheduling

Constraining the Helium Abundance in Globular Clusters

Clusters and groups of galaxies

Scheduling and Dispatch

Final topics: Scheduling Recap and some advanced topics

Chapter 4 Device Management and Disk Scheduling

ICE-TDB Textile Training course ( 15-20 march 2010, Suzhou)

HISTORY OF SCHEDULING