Mapping and Scheduling

Mapping and Scheduling W+A: Chapter 4 CSE 160/Berman

Outline • Mapping and Scheduling • Static Mapping Strategies • Dynamic Mapping Strategies • Scheduling CSE 160/Berman

Mapping and Scheduling Models • Basic Models: • Program model is a task graph with dependencies • Platform model is set of processors with interconnection network CSE 160/Berman

Mapping and Scheduling • Mapping and scheduling involve the following activities • Select a set of resources on which to schedule the task(s) of the application. • Assign application task(s) to compute resources. • Distribute data or co-locate data and computation. • Order tasks on compute resources. • Order communication between tasks. CSE 160/Berman

Select a set of resources on which to schedule the task(s) of the application. Assign application task(s) to compute resources. Distribute data or co-locate data and computation. Order tasks on compute resources. Order communication between tasks. 1 = resource selection 1-3: generally termed as mapping 4-5: generally termed as scheduling For many researchers, scheduling is also used to describe activities 1-5. Mapping is an assignment of tasks in space Scheduling focuses on ordering in time Mapping and Scheduling Terminology CSE 160/Berman

Goals • Want the mapping and scheduling algorithms and models to promote the assignment/ordering with the smallest execution time • Accuracy vs. Ranking Model Real Stuff Model Real Stuff A x A’ optimum A optimum B B’ B CSE 160/Berman

3 P1 P1 1 3 2 7 1 1 4 P2 P2 1 7 4 2 1 2 2 What is the best mapping? CSE 160/Berman

Static and Dynamic Mapping Strategies • Static methods generate the partitioning prior to execution • Static mapping strategies work well when we can reasonably predict the time to perform application tasks during execution • When it is not easy to predict task execution time, dynamic strategies may be more performance-efficient • Dynamic methods generate the partitioning during execution • For example, workqueue and M/S are dynamic methods CSE 160/Berman

P13 P17 P2 P3 P5 P7 P11 Static Mapping • Static mapping can involve • partitioning of tasks (functional decomposition) • Sieve of Eratosthenes an example • partitioning of data (data decomposition) • Fixed decomposition of Mandelbrot (k blocks per processor) is an example of this CSE 160/Berman

Load Balancing • Load Balancing = strategy to partition application so that • All processors perform an equivalent amount of work • (All processors finish in an equivalent amount of time. This is really time-balancing) • May take different amounts of time to do equivalent amounts of work • Load balancing an important technique in parallel processing • Many ways to achieve a balanced load • Both dynamic and static load balancing techniques CSE 160/Berman

Static and Dynamic Mapping for the N-body Problem • The N-body problem: Given n bodies in 3D space, determine the gravitational force Fbetween them at any given point in time. where G is the gravitational constant, r is the distance between the bodies, and are the masses of the bodies CSE 160/Berman

Exact N-body serial pseudo-code • At each time t, velocity v and position x of body i may change • Real problem a bit more complicated than this. See 4.2.3 in book • For (t=0: t<max; t++) • For (i=0; i<N; i++) { • F= Force_routine(i); • v[i]_new = v[i]+F*dt; • x[i]_new=x[i]+v[i]_new*dt; • } • For (i=0; i<nmax; i++) { • x[i] = x[i]_new; • v[i]=v[i]_new; • } CSE 160/Berman

Exact N-body and static partitioning • Can parallelize n-body by tagging velocity and position for each body and updating bodies using correctly tagged information. • This can be implemented as a data parallel algorithm. What is the worst-case complexity of complexity for a single iteration? • How should we partition this? • Static partitioning can be a bad strategy for n-body problem. • Load can be very unbalanced for some configurations CSE 160/Berman

Improving the complexity of the N-body code • Complexity of serial n-body algorithm very large: O(n^2) for each iteration. • Communication structure not local – each body must gather data from all other bodies. • Most interesting problems are when n is large – not feasible to use exact method for this • Barnes-Hut algorithm is well-known approximation to exact n-body problem and can be efficiently parallelized. CSE 160/Berman

Barnes-Hut Approximation • Barnes-Hut algorithm based on the observation that a cluster of distant bodies can be approximated as a single distant body • Total mass = aggregate of bodies in cluster • Distance to cluster = distance to center of mass of the cluster • This clustering idea can be applied recursively CSE 160/Berman

Barnes-Hut idea • Dynamic divide and conquer approach: • Each region (cube) of space divided into 8 subcubes • If subcube contains more than 1 body, it is recursively subdivided • If subcube contains no bodies, it is removed from consideration • 2D example on right – each 2D region divided into 4 subregions CSE 160/Berman

Barnes-Hut idea • For 3D decomposition, result is an octtree • For 2D decomposition, result is a quadtree, (pictured below). CSE 160/Berman

Barnes Hut Pseudo-code • For (t=0; t< tmax; t++) { Build octtree; Compute total mass and center; Traverse the tree, computing the forces Update the position and velocity of all bodies } • Notes: • Total mass and center of mass of each subcube stored at its root • Tree traversal stops at a node when the clustering approximation can be used for a particular body • In the gravitational n-body problem described here, this can happen when where r is the distance to the center of mass of a subcube of side d and c is a constant. CSE 160/Berman

Barnes-Hut Complexity • Partitioning is dynamic: Whole octtree must be reconstructed for each time step because bodies will have moved. • Constructing tree can be done in O(nlogn) • Computing forces can be done in O(nlogn) • Barnes-Hut for one iteration is O(nlogn) [compare to O(n^2) for one iteration with exact solution] CSE 160/Berman

Generalizing the Barnes-Hut approach • Approach can be used for applications which repeatedly performs some calculation on particles/bodies/data indexed by position. • Recursive Bisection: • Divide region in half so that particles are balanced each time • Map rectangular regions onto processors so that load is balanced CSE 160/Berman

Recursive Bisection Programming Issues • How do we keep track of the regions mapped to each processor? • What should the density of each region be? [granularity!] • What is the complexity of performing the partitioning? How often should we repartition to optimize the load balance? • How can locality of communication or processor configuration be leveraged? CSE 160/Berman

Scheduling • Application scheduling: ordering and allocation of tasks/communication/data to processors • Application-centric performance measure, e.g. minimal execution time • Job Scheduling: ordering and allocation of jobs on an MPP • System-centric performance measure, e.g. processor utilization, throughput CSE 160/Berman

Job Scheduling Strategies • Gang-scheduling • Batch scheduling using backfilling CSE 160/Berman

Gang scheduling • Gang scheduling is a technique for allocating a collection of jobs on a MPP • One or more jobs clustered as a gang • Gangs share time slices on whole machine • Strategy combines time-sharing (gangs get time slices) and space-sharing (gangs partition space) approaches • Many flavors of gang scheduling in the literature CSE 160/Berman

Gang Scheduling • Formal definition from Dror Feitelson: • Gang scheduling is a scheme that combines three features: • The threads of a set of jobs are grouped into gangs with the threads in a single job considered to be a single gang. • The threads in each gang execute simultaneously on distinct PEs, using a 1-1 mapping. • Time slicing is used, with all the threads in a gang being preempted and rescheduled at the same time. CSE 160/Berman

Why gang scheduling? • Gang scheduling promotes efficient performance of individual jobs as well as efficient utilization and fair allocation of machine resources. • Gang scheduling leads to two desirable properties: • It promotes efficient fine-grain interactions among the threads of a gang, since they are executing simultaneously. • Periodic preemption prevents long jobs from monopolizing system resources. • overhead of preemption can reduce performance and so must be implemented efficiently). • Used as the scheduling policy for CM-5, Meiko CS-2, Paragon, etc. CSE 160/Berman

Batch Job Scheduling • Problem: How to schedule jobs waiting in a queue to run on a multicomputer? • Each job requests some number n of nodes and some time t to run • Goal: promote utilization of machine, fairness to jobs, short queue wait times CSE 160/Berman

One approach: Backfilling • Main idea:pack the jobs in the processor/time space • Allow job at the head of the queue to be scheduled in the first available slot. • If other jobs in the queue can run without changing the start time of previous jobs in the queue, schedule them. • Promote jobs if they can start earlier • Many versions of backfilling: • EASY: Promote jobs as long as they don’t delay the start time of the first job in the queue • Conservative: Promote jobs as long as they don’t delay the start time of any job in the queue. CSE 160/Berman

processors time Backfilling Example • Submitting five requests… CSE 160/Berman

processors processors time time Backfilling Example • Submitting five requests… • Using Backfilling... CSE 160/Berman

processors processors time time Backfilling Example CSE 160/Berman

processors time Backfilling Example • Existing job finishes • Backfilling promotes yellow job and then schedules purple job processors processors time CSE 160/Berman

Backfilling Scheduling • Backfilling used in Maui Scheduler at SDSC on SP-2, PBS at NASA, Computing Condominium Scheduler at Penn State, etc. • Backfilling Issues: • What if the processors of the platform have different capacities (are not homogeneous) ? • What if some jobs get priority over others? • Should parallel jobs be treated separately than serial jobs? • If multiple queues are used, how should they be administered? • Should users be charged to wait in the queue as well as run on the machine? CSE 160/Berman

Optimizing Application Performance • Backfilling and MPP scheduling strategies typically optimize for throughput • Optimizing throughput and optimizing application performance (e.g. execution time) can often conflict • How can applications optimize performance in an MPP environment? • Moldable jobs = jobs which can run with more than one partition size • Question: What is the optimal partition size for moldable jobs? • We can answer this question when the MPP scheduler runs a conservative backfilling strategy and publishes the list of available nodes.

Optimizing Applications targeted to a Batch-scheduled MPP • SA = generic AppLeS scheduler developed for jobs submitted to backfilling MPP • uses the availability list of the MPP scheduler to determine the size of the partition to be requested by the application • Speedup curve known for Gas applications • Static = jobs submitted without SA • Workload taken from KTH (Swedish Royal Institute of Technology) • Experiments developed by Walfredo Cirne

Mapping and Scheduling

Mapping and Scheduling

Presentation Transcript

Scheduling and Budgeting

Bump Mapping and Environment Mapping

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems

Trace Scheduling, Superblock Scheduling, and Hyperblock Scheduling

Writing and Scheduling

Middle School Scheduling Mapping Out our Students’ Futures

Scheduling and Reviews

Scheduling and Staffing

Scheduling and Scheduling Philosophies

Energy-Efficient Mapping and Scheduling for DVS Enabled Distributed Embedded Systems

Task Orchestration : Scheduling and Mapping on Multicore Systems

Planning and Scheduling

Recruiting and Scheduling

Nominations and Scheduling

Threads and Scheduling

Ant Colony Optimization for Mapping, Scheduling and Placing in Reconfigurable Systems

Scheduling and Optimization

Scheduling and Calendar

Queues and Scheduling

Staffing and Scheduling

Flexibility Driven Scheduling and Mapping for Distributed Real-Time Systems

SCHEDULING AND SEQUENCING -