370 likes | 598 Views
Mapping and Scheduling. W+A: Chapter 4. Outline. Mapping and Scheduling Static Mapping Strategies Dynamic Mapping Strategies Scheduling. Mapping and Scheduling Models. Basic Models: Program model is a task graph with dependencies
E N D
Mapping and Scheduling W+A: Chapter 4 CSE 160/Berman
Outline • Mapping and Scheduling • Static Mapping Strategies • Dynamic Mapping Strategies • Scheduling CSE 160/Berman
Mapping and Scheduling Models • Basic Models: • Program model is a task graph with dependencies • Platform model is set of processors with interconnection network CSE 160/Berman
Mapping and Scheduling • Mapping and scheduling involve the following activities • Select a set of resources on which to schedule the task(s) of the application. • Assign application task(s) to compute resources. • Distribute data or co-locate data and computation. • Order tasks on compute resources. • Order communication between tasks. CSE 160/Berman
Select a set of resources on which to schedule the task(s) of the application. Assign application task(s) to compute resources. Distribute data or co-locate data and computation. Order tasks on compute resources. Order communication between tasks. 1 = resource selection 1-3: generally termed as mapping 4-5: generally termed as scheduling For many researchers, scheduling is also used to describe activities 1-5. Mapping is an assignment of tasks in space Scheduling focuses on ordering in time Mapping and Scheduling Terminology CSE 160/Berman
Goals • Want the mapping and scheduling algorithms and models to promote the assignment/ordering with the smallest execution time • Accuracy vs. Ranking Model Real Stuff Model Real Stuff A x A’ optimum A optimum B B’ B CSE 160/Berman
3 P1 P1 1 3 2 7 1 1 4 P2 P2 1 7 4 2 1 2 2 What is the best mapping? CSE 160/Berman
Static and Dynamic Mapping Strategies • Static methods generate the partitioning prior to execution • Static mapping strategies work well when we can reasonably predict the time to perform application tasks during execution • When it is not easy to predict task execution time, dynamic strategies may be more performance-efficient • Dynamic methods generate the partitioning during execution • For example, workqueue and M/S are dynamic methods CSE 160/Berman
P13 P17 P2 P3 P5 P7 P11 Static Mapping • Static mapping can involve • partitioning of tasks (functional decomposition) • Sieve of Eratosthenes an example • partitioning of data (data decomposition) • Fixed decomposition of Mandelbrot (k blocks per processor) is an example of this CSE 160/Berman
Load Balancing • Load Balancing = strategy to partition application so that • All processors perform an equivalent amount of work • (All processors finish in an equivalent amount of time. This is really time-balancing) • May take different amounts of time to do equivalent amounts of work • Load balancing an important technique in parallel processing • Many ways to achieve a balanced load • Both dynamic and static load balancing techniques CSE 160/Berman
Static and Dynamic Mapping for the N-body Problem • The N-body problem: Given n bodies in 3D space, determine the gravitational force Fbetween them at any given point in time. where G is the gravitational constant, r is the distance between the bodies, and are the masses of the bodies CSE 160/Berman
Exact N-body serial pseudo-code • At each time t, velocity v and position x of body i may change • Real problem a bit more complicated than this. See 4.2.3 in book • For (t=0: t<max; t++) • For (i=0; i<N; i++) { • F= Force_routine(i); • v[i]_new = v[i]+F*dt; • x[i]_new=x[i]+v[i]_new*dt; • } • For (i=0; i<nmax; i++) { • x[i] = x[i]_new; • v[i]=v[i]_new; • } CSE 160/Berman
Exact N-body and static partitioning • Can parallelize n-body by tagging velocity and position for each body and updating bodies using correctly tagged information. • This can be implemented as a data parallel algorithm. What is the worst-case complexity of complexity for a single iteration? • How should we partition this? • Static partitioning can be a bad strategy for n-body problem. • Load can be very unbalanced for some configurations CSE 160/Berman
Improving the complexity of the N-body code • Complexity of serial n-body algorithm very large: O(n^2) for each iteration. • Communication structure not local – each body must gather data from all other bodies. • Most interesting problems are when n is large – not feasible to use exact method for this • Barnes-Hut algorithm is well-known approximation to exact n-body problem and can be efficiently parallelized. CSE 160/Berman
Barnes-Hut Approximation • Barnes-Hut algorithm based on the observation that a cluster of distant bodies can be approximated as a single distant body • Total mass = aggregate of bodies in cluster • Distance to cluster = distance to center of mass of the cluster • This clustering idea can be applied recursively CSE 160/Berman
Barnes-Hut idea • Dynamic divide and conquer approach: • Each region (cube) of space divided into 8 subcubes • If subcube contains more than 1 body, it is recursively subdivided • If subcube contains no bodies, it is removed from consideration • 2D example on right – each 2D region divided into 4 subregions CSE 160/Berman
Barnes-Hut idea • For 3D decomposition, result is an octtree • For 2D decomposition, result is a quadtree, (pictured below). CSE 160/Berman
Barnes Hut Pseudo-code • For (t=0; t< tmax; t++) { Build octtree; Compute total mass and center; Traverse the tree, computing the forces Update the position and velocity of all bodies } • Notes: • Total mass and center of mass of each subcube stored at its root • Tree traversal stops at a node when the clustering approximation can be used for a particular body • In the gravitational n-body problem described here, this can happen when where r is the distance to the center of mass of a subcube of side d and c is a constant. CSE 160/Berman
Barnes-Hut Complexity • Partitioning is dynamic: Whole octtree must be reconstructed for each time step because bodies will have moved. • Constructing tree can be done in O(nlogn) • Computing forces can be done in O(nlogn) • Barnes-Hut for one iteration is O(nlogn) [compare to O(n^2) for one iteration with exact solution] CSE 160/Berman
Generalizing the Barnes-Hut approach • Approach can be used for applications which repeatedly performs some calculation on particles/bodies/data indexed by position. • Recursive Bisection: • Divide region in half so that particles are balanced each time • Map rectangular regions onto processors so that load is balanced CSE 160/Berman
Recursive Bisection Programming Issues • How do we keep track of the regions mapped to each processor? • What should the density of each region be? [granularity!] • What is the complexity of performing the partitioning? How often should we repartition to optimize the load balance? • How can locality of communication or processor configuration be leveraged? CSE 160/Berman
Scheduling • Application scheduling: ordering and allocation of tasks/communication/data to processors • Application-centric performance measure, e.g. minimal execution time • Job Scheduling: ordering and allocation of jobs on an MPP • System-centric performance measure, e.g. processor utilization, throughput CSE 160/Berman
Job Scheduling Strategies • Gang-scheduling • Batch scheduling using backfilling CSE 160/Berman
Gang scheduling • Gang scheduling is a technique for allocating a collection of jobs on a MPP • One or more jobs clustered as a gang • Gangs share time slices on whole machine • Strategy combines time-sharing (gangs get time slices) and space-sharing (gangs partition space) approaches • Many flavors of gang scheduling in the literature CSE 160/Berman
Gang Scheduling • Formal definition from Dror Feitelson: • Gang scheduling is a scheme that combines three features: • The threads of a set of jobs are grouped into gangs with the threads in a single job considered to be a single gang. • The threads in each gang execute simultaneously on distinct PEs, using a 1-1 mapping. • Time slicing is used, with all the threads in a gang being preempted and rescheduled at the same time. CSE 160/Berman
Why gang scheduling? • Gang scheduling promotes efficient performance of individual jobs as well as efficient utilization and fair allocation of machine resources. • Gang scheduling leads to two desirable properties: • It promotes efficient fine-grain interactions among the threads of a gang, since they are executing simultaneously. • Periodic preemption prevents long jobs from monopolizing system resources. • overhead of preemption can reduce performance and so must be implemented efficiently). • Used as the scheduling policy for CM-5, Meiko CS-2, Paragon, etc. CSE 160/Berman
Batch Job Scheduling • Problem: How to schedule jobs waiting in a queue to run on a multicomputer? • Each job requests some number n of nodes and some time t to run • Goal: promote utilization of machine, fairness to jobs, short queue wait times CSE 160/Berman
One approach: Backfilling • Main idea:pack the jobs in the processor/time space • Allow job at the head of the queue to be scheduled in the first available slot. • If other jobs in the queue can run without changing the start time of previous jobs in the queue, schedule them. • Promote jobs if they can start earlier • Many versions of backfilling: • EASY: Promote jobs as long as they don’t delay the start time of the first job in the queue • Conservative: Promote jobs as long as they don’t delay the start time of any job in the queue. CSE 160/Berman
processors time Backfilling Example • Submitting five requests… CSE 160/Berman
processors processors time time Backfilling Example • Submitting five requests… • Using Backfilling... CSE 160/Berman
processors processors time time Backfilling Example CSE 160/Berman
processors processors time time Backfilling Example CSE 160/Berman
processors time Backfilling Example • Existing job finishes • Backfilling promotes yellow job and then schedules purple job processors processors time CSE 160/Berman
Backfilling Scheduling • Backfilling used in Maui Scheduler at SDSC on SP-2, PBS at NASA, Computing Condominium Scheduler at Penn State, etc. • Backfilling Issues: • What if the processors of the platform have different capacities (are not homogeneous) ? • What if some jobs get priority over others? • Should parallel jobs be treated separately than serial jobs? • If multiple queues are used, how should they be administered? • Should users be charged to wait in the queue as well as run on the machine? CSE 160/Berman
Optimizing Application Performance • Backfilling and MPP scheduling strategies typically optimize for throughput • Optimizing throughput and optimizing application performance (e.g. execution time) can often conflict • How can applications optimize performance in an MPP environment? • Moldable jobs = jobs which can run with more than one partition size • Question: What is the optimal partition size for moldable jobs? • We can answer this question when the MPP scheduler runs a conservative backfilling strategy and publishes the list of available nodes.
Optimizing Applications targeted to a Batch-scheduled MPP • SA = generic AppLeS scheduler developed for jobs submitted to backfilling MPP • uses the availability list of the MPP scheduler to determine the size of the partition to be requested by the application • Speedup curve known for Gas applications • Static = jobs submitted without SA • Workload taken from KTH (Swedish Royal Institute of Technology) • Experiments developed by Walfredo Cirne