410 likes | 535 Views
A Survey on Scheduling Methods of Task-Parallel Processing. Chikayama and Taura Lab M1 48-096415 Jun Nakashima. Agenda. Introduction Basic Scheduling Methods Challenges and solutions Consideration Summary. Motivation. Thread and task have many in common Both are unit of execution
E N D
A Survey on Scheduling Methods of Task-ParallelProcessing Chikayama and Taura Lab M1 48-096415 Jun Nakashima
Agenda • Introduction • Basic Scheduling Methods • Challenges and solutions • Consideration • Summary
Motivation • Thread and task have many in common • Both are unit of execution • Multiple threads/tasks may be executed simultaneously • Scheduling methods of tasks can be useful for that of threads
Background • Demand of exploitingdynamic and irregular parallelism • Simple parallelization (pthread,OpenMP,…) isnotefficient • Few threads : Difficulties of load balancing • Many threads : Good load balance but overhead is not bearable • Example: • N-Queenspuzzle • Strassen’s algorithm (matrix-matrix product) • LU Factorization of sparse matrix
Task-Parallel Processing • Decompose entire process into tasks and execute them in parallel • Task : Unit of execution much lighter than thread • Fairness of tasks are not considered • May be deferred or suspended • Representation of dependence • Task creation by a task • Wait for child tasks • Programming environments with task support : • Cilk,X10, Intel TBB, OpenMP(>3.0),etc…
Task-Parallel Processing(2) A simple example Task graph tasktask_fib(n) { if (n<=1)return 1; t1=create_task(task_fib(n-2)); //create task t2=create_task(task_fib(n-1)); ret1=task_wait(t1); //wait for children ret2=task_wait(t2); return ret1+ret2; } fib(n) fib(n-2) fib(n-1) fib(n-4) fib(n-3) fib(n-3) fib(n-2) Tasks of same color can be executed in parallel
Basic execution model • Forks threads up to the number of CPU cores • Each thread has queues for tasks • Assign a task by one thread fib(n) fib(n) Thread1 Thread2 fib(n-2) fib(n-2) fib(n-1) fib(n-1) fib(n-4) fib(n-3) fib(n-3) fib(n-2)
Agenda • Introduction • Basic Scheduling Methods • Challenges and solutions • Consideration • Summary
Basic scheduling strategy : Breadth-Firstand Work-first Breadth-First Work-First At task creation Parent task always suspends and run child Continue parent when child task is finished • At task creation : • Enqueue new task • Execute child when parent task suspends Thread Thread ready running running ready running ready fib(n-4) waiting running running ready fib(n) fib(n-2) fib(n-1) fib(n) fib(n-2)
Work stealing • Load-balancing technique of threads for work-first scheduler • Idle threads steal runnable tasks from other threads • Basic strategy : FIFO • Steals oldest task in the task queue • Victim thread should be chosen at random Thread Thread fib(n-4) running ready fib(n-2) running ready fib(n) running ready fib(n-1) Steal request Steals oldest task
Effect of Work Stealing Thread 2’s task Task graph of previous page • Old task tends to create many tasks in the future • Especially recursive parallelism Thread 1’s task fib(n) fib(n-2) fib(n-1) fib(n-2) fib(n-4) fib(n-3) fib(n-3)
Lazy Task Creation • Save continuation of parent task instead of creating child task • Continuation is lighter than task • At work stealing, crate task from continuation and steal it Thread Thread fib(n-4) running ready fib(n-2) fib(n-2) running ready fib(n) fib(n) running ready fib(n-1) fib(n) Steal request Create task and steal it Continutation (≠ Task)
Cut-off • Execute child task sequentially instead of creating • To avoid too fine-grained tasks • Basic cut-off strategy • Amount of tasks • Recursive depth fib(n) Execute serially fib(n-2) fib(n-1) fib(n-2) fib(n-4) fib(n-4) fib(n-3) fib(n-3) fib(n-3)
Agenda • Introduction • Basic Scheduling Methods • Challenges and solutions • Consideration • Summary
Challenges • Architecture-aware scheduling • Scalable implementation • Determination of cut-off threshold
Architecture-aware scheduling • Basic methods are not considered of architecture • In some architecture performance is degraded • Example : NUMA architecture Interconnect Core1 Core2 Core3 Core4 Memory Memory
NUMA Architecture • NUMA = Non Uniform Memory Access • Memory access cost depends on CPU core and address • Considering locality is very important! Remote memory access is slow Local memory access is fast Interconnect Core1 Core2 Core3 Core4 Memory Memory
A bad case on NUMA • When a thread steals a task of remote CPU • More remote memory access Local memory access Remote memory access task Core1 Core2 Core3 Core4 Memory Memory data
Affinity Bubble-Scheduler • Scheduling Dynamic OpenMP Applications over Multicore Architecture(Broquedis et al.) • Locality-aware thread scheduler • Based on BubbleSched: • Framework to implement scheduler on hieratical architecture • Threads are grouped by bubbles • Scheduler uses bubbles as hints
What is bubble? task task • Group of tasks and bubbles • Describes affinities of tasks • Call library function to create • Grouped tasks use shared data task task task task task task
Initial task distribution • Explode bubbles hieratically Core1 Core2 Core3 Core4 task task task task task task task task Explode the root bubble Divide to balance load Explode a bubble to distribute to 2 CPU cores
NUMA-aware Work Stealing • Idle threads steal tasks from as local thread as possible Core1 Core2 Core3 Core4 task task task task task task task Steals from local
Challenges • Architecture-aware scheduling • Affinity Bubble-scheduler • Scalable implementation • Determination of cut-off threshold
Scalable implementation Need to lock the entire queue • When operating task queues, threads have to acquire a lock • Because task queues may be accessed by multiple threads • Task queue operation occur every task creation and destruction • Locks may be serious bottleneck! Finished! Thread Thread task task task Steal request
A simple way to decrease locks • Double Task Queue per thread • One for local and one for public • Tasks are stolen only from public queue • Local queue is lock-free lock-free! Thread Thread task local task task public task Need to lock the public queue only Steal request
Problem of double task queue • When task is moved, memory copy is required Thread Task copy is required local task public
Split Task Queues • Scalable Work Stealing (Dinal et al.) • Split task queue by “split pointer” • From head to split pointer: Localpotion • From split pointer to tail: Publicpotion lock-free! Thread Thread task local task task public
Split Task Queues • Move pointer to head if public potion gets empty • This operation is lock-free • Move pointer to tail if local potion gets empty • Task copy is not required Thread Thread task local task task task task task public
And more… • In “Scalable work stealing” (Dianl et al.) • Efficient task creation • Initialize task queue directly • Better amount of tasks to steal • Half of public queue
Challenges • Architecture-aware scheduling • Affinity Bubble-Scheduler • Scalable implementation • Split Task Queues • Determination of cut-off threshold
Determination of cut-off threshold • Appropriate cut-off threshold cannot be determined simply • Depends on algorithm, scheduling methods, and input data • Too large : Tasks become too coarse-grained • Leads to load imbalance • Too small : Tasks become too fine-grained • Large overhead
Profile-based cut-off determination • An adaptive cut-off for task parallelism (Duran et al.) • Use 2 profiling methods • Full Mode • Minimal Mode • Estimate execution time and decide cut-off
Full Mode • Measure every tasks’ execution time • Heavy overhead • Complete information fib(n) Collect execution time fib(n-2) fib(n-1) ??? XXX ms ??? YYY ms fib(n-2) ??? ZZZ ms fib(n-4) fib(n-3) fib(n-3)
Minimal Mode • Measure execution time of “real tasks” • Small overhead • Incomplete information • Cut-off tasks are not measured fib(n) fib(n-2) fib(n-1) Collect execution time These tasks are not measured ??? XXX ms ??? YYY ms fib(n-2) fib(n-2) fib(n-4) fib(n-4) fib(n-3) fib(n-3) fib(n-3) fib(n-3)
Adaptive Profiling • Collects execution time for each depth of recursion • Use Full Mode until enough information is collected • After that, use Minimal Mode fib(n) Profiled(Full Mode) Maybe not profiled(Minimal Mode) fib(n-2) fib(n-1) fib(n-2) fib(n-4) fib(n-3) fib(n-3) 4 1 Execution order 2 3
Cut-off strategy • Estimates execution time of the task by collected information • Average of previous executions • If estimated execution time is smaller than threshold, apply cut-off How long the task will take ? fib(n) If estimated time is larger, create new task and execute in parallel If estimated time is smaller, execute serially fib(n-2) fib(n-1) fib(n-2) fib(n-2) fib(n-4) fib(n-3) fib(n-3) fib(n-3) 4 1 Execution order 2 3
Agenda • Introduction • Basic Scheduling Methods • Challenges and solutions • Consideration • Summary
Consideration • When adopting task methods into thread scheduling, it is necessary to consider side-effect • Main difference between task and thread is fairness • Fairness : Runnable threads take equal CPU time (based on priority) • Any thread never keeps CPU forever
Consideration of fairness • Affinity Bubble Scheduler • Originally designed for threads • Split task queues • Data structure for reducing locks improves scalability • Basic idea does not impede fairness • Profile-based cut-off • Can apply cut-off only short-lived thread • It makes easier to apply cut-off
Summary • Basic scheduling methods • Challenges and solutions • Architecture-aware scheduling • Affinity Bubble-Scheduler • Scalable implementation • Split Task Queues • Determination of cut-off threshold • Profile-based cut-off • Consideration • These solutions are NOT SO harmful for fairness