A Survey on Scheduling Methods of Task-Parallel Processing

A Survey on Scheduling Methods of Task-ParallelProcessing Chikayama and Taura Lab M1 48-096415 Jun Nakashima

Agenda • Introduction • Basic Scheduling Methods • Challenges and solutions • Consideration • Summary

Motivation • Thread and task have many in common • Both are unit of execution • Multiple threads/tasks may be executed simultaneously • Scheduling methods of tasks can be useful for that of threads

Background • Demand of exploitingdynamic and irregular parallelism • Simple parallelization (pthread,OpenMP,…) isnotefficient • Few threads : Difficulties of load balancing • Many threads : Good load balance but overhead is not bearable • Example: • N-Queenspuzzle • Strassen’s algorithm (matrix-matrix product) • LU Factorization of sparse matrix

Task-Parallel Processing • Decompose entire process into tasks and execute them in parallel • Task : Unit of execution much lighter than thread • Fairness of tasks are not considered • May be deferred or suspended • Representation of dependence • Task creation by a task • Wait for child tasks • Programming environments with task support : • Cilk,X10, Intel TBB, OpenMP(>3.0),etc…

Task-Parallel Processing(2) A simple example Task graph tasktask_fib(n) { if (n<=1)return 1; t1=create_task(task_fib(n-2)); //create task t2=create_task(task_fib(n-1)); ret1=task_wait(t1); //wait for children ret2=task_wait(t2); return ret1+ret2; } fib(n) fib(n-2) fib(n-1) fib(n-4) fib(n-3) fib(n-3) fib(n-2) Tasks of same color can be executed in parallel

Basic execution model • Forks threads up to the number of CPU cores • Each thread has queues for tasks • Assign a task by one thread fib(n) fib(n) Thread1 Thread2 fib(n-2) fib(n-2) fib(n-1) fib(n-1) fib(n-4) fib(n-3) fib(n-3) fib(n-2)

Basic scheduling strategy : Breadth-Firstand Work-first Breadth-First Work-First At task creation Parent task always suspends and run child Continue parent when child task is finished • At task creation : • Enqueue new task • Execute child when parent task suspends Thread Thread ready running running ready running ready fib(n-4) waiting running running ready fib(n) fib(n-2) fib(n-1) fib(n) fib(n-2)

Work stealing • Load-balancing technique of threads for work-first scheduler • Idle threads steal runnable tasks from other threads • Basic strategy : FIFO • Steals oldest task in the task queue • Victim thread should be chosen at random Thread Thread fib(n-4) running ready fib(n-2) running ready fib(n) running ready fib(n-1) Steal request Steals oldest task

Effect of Work Stealing Thread 2’s task Task graph of previous page • Old task tends to create many tasks in the future • Especially recursive parallelism Thread 1’s task fib(n) fib(n-2) fib(n-1) fib(n-2) fib(n-4) fib(n-3) fib(n-3)

Lazy Task Creation • Save continuation of parent task instead of creating child task • Continuation is lighter than task • At work stealing, crate task from continuation and steal it Thread Thread fib(n-4) running ready fib(n-2) fib(n-2) running ready fib(n) fib(n) running ready fib(n-1) fib(n) Steal request Create task and steal it Continutation (≠ Task)

Cut-off • Execute child task sequentially instead of creating • To avoid too fine-grained tasks • Basic cut-off strategy • Amount of tasks • Recursive depth fib(n) Execute serially fib(n-2) fib(n-1) fib(n-2) fib(n-4) fib(n-4) fib(n-3) fib(n-3) fib(n-3)

Challenges • Architecture-aware scheduling • Scalable implementation • Determination of cut-off threshold

Architecture-aware scheduling • Basic methods are not considered of architecture • In some architecture performance is degraded • Example : NUMA architecture Interconnect Core1 Core2 Core3 Core4 Memory Memory

NUMA Architecture • NUMA = Non Uniform Memory Access • Memory access cost depends on CPU core and address • Considering locality is very important! Remote memory access is slow Local memory access is fast Interconnect Core1 Core2 Core3 Core4 Memory Memory

A bad case on NUMA • When a thread steals a task of remote CPU • More remote memory access Local memory access Remote memory access task Core1 Core2 Core3 Core4 Memory Memory data

Affinity Bubble-Scheduler • Scheduling Dynamic OpenMP Applications over Multicore Architecture(Broquedis et al.) • Locality-aware thread scheduler • Based on BubbleSched: • Framework to implement scheduler on hieratical architecture • Threads are grouped by bubbles • Scheduler uses bubbles as hints

What is bubble? task task • Group of tasks and bubbles • Describes affinities of tasks • Call library function to create • Grouped tasks use shared data task task task task task task

Initial task distribution • Explode bubbles hieratically Core1 Core2 Core3 Core4 task task task task task task task task Explode the root bubble Divide to balance load Explode a bubble to distribute to 2 CPU cores

NUMA-aware Work Stealing • Idle threads steal tasks from as local thread as possible Core1 Core2 Core3 Core4 task task task task task task task Steals from local

Challenges • Architecture-aware scheduling • Affinity Bubble-scheduler • Scalable implementation • Determination of cut-off threshold

Scalable implementation Need to lock the entire queue • When operating task queues, threads have to acquire a lock • Because task queues may be accessed by multiple threads • Task queue operation occur every task creation and destruction • Locks may be serious bottleneck! Finished! Thread Thread task task task Steal request

A simple way to decrease locks • Double Task Queue per thread • One for local and one for public • Tasks are stolen only from public queue • Local queue is lock-free lock-free! Thread Thread task local task task public task Need to lock the public queue only Steal request

Problem of double task queue • When task is moved, memory copy is required Thread Task copy is required local task public

Split Task Queues • Scalable Work Stealing (Dinal et al.) • Split task queue by “split pointer” • From head to split pointer: Localpotion • From split pointer to tail: Publicpotion lock-free! Thread Thread task local task task public

Split Task Queues • Move pointer to head if public potion gets empty • This operation is lock-free • Move pointer to tail if local potion gets empty • Task copy is not required Thread Thread task local task task task task task public

And more… • In “Scalable work stealing” (Dianl et al.) • Efficient task creation • Initialize task queue directly • Better amount of tasks to steal • Half of public queue

Challenges • Architecture-aware scheduling • Affinity Bubble-Scheduler • Scalable implementation • Split Task Queues • Determination of cut-off threshold

Determination of cut-off threshold • Appropriate cut-off threshold cannot be determined simply • Depends on algorithm, scheduling methods, and input data • Too large : Tasks become too coarse-grained • Leads to load imbalance • Too small : Tasks become too fine-grained • Large overhead

Profile-based cut-off determination • An adaptive cut-off for task parallelism (Duran et al.) • Use 2 profiling methods • Full Mode • Minimal Mode • Estimate execution time and decide cut-off

Full Mode • Measure every tasks’ execution time • Heavy overhead • Complete information fib(n) Collect execution time fib(n-2) fib(n-1) ??? XXX ms ??? YYY ms fib(n-2) ??? ZZZ ms fib(n-4) fib(n-3) fib(n-3)

Minimal Mode • Measure execution time of “real tasks” • Small overhead • Incomplete information • Cut-off tasks are not measured fib(n) fib(n-2) fib(n-1) Collect execution time These tasks are not measured ??? XXX ms ??? YYY ms fib(n-2) fib(n-2) fib(n-4) fib(n-4) fib(n-3) fib(n-3) fib(n-3) fib(n-3)

Adaptive Profiling • Collects execution time for each depth of recursion • Use Full Mode until enough information is collected • After that, use Minimal Mode fib(n) Profiled(Full Mode) Maybe not profiled(Minimal Mode) fib(n-2) fib(n-1) fib(n-2) fib(n-4) fib(n-3) fib(n-3) 4 1 Execution order 2 3

Cut-off strategy • Estimates execution time of the task by collected information • Average of previous executions • If estimated execution time is smaller than threshold, apply cut-off How long the task will take ? fib(n) If estimated time is larger, create new task and execute in parallel If estimated time is smaller, execute serially fib(n-2) fib(n-1) fib(n-2) fib(n-2) fib(n-4) fib(n-3) fib(n-3) fib(n-3) 4 1 Execution order 2 3

Consideration • When adopting task methods into thread scheduling, it is necessary to consider side-effect • Main difference between task and thread is fairness • Fairness : Runnable threads take equal CPU time (based on priority) • Any thread never keeps CPU forever

Consideration of fairness • Affinity Bubble Scheduler • Originally designed for threads • Split task queues • Data structure for reducing locks improves scalability • Basic idea does not impede fairness • Profile-based cut-off • Can apply cut-off only short-lived thread • It makes easier to apply cut-off

Summary • Basic scheduling methods • Challenges and solutions • Architecture-aware scheduling • Affinity Bubble-Scheduler • Scalable implementation • Split Task Queues • Determination of cut-off threshold • Profile-based cut-off • Consideration • These solutions are NOT SO harmful for fairness

Thanks for your attention!

A Survey on Scheduling Methods of Task-Parallel Processing

A Survey on Scheduling Methods of Task-Parallel Processing

Presentation Transcript

Implementation of Parallel Processing Techniques on Graphical Processing Units

A Survey of Localization Methods

Parallel Processing

A Survey of Parallel T ree-based Methods on Option Pricing

Parallel Processing

PARALLEL PROCESSING

Scheduling of parallel processes

Parallel Processing

Parallel Processing

Parallel Machine Scheduling

First impressions on Brewer processing methods survey

Parallel Processing

Parallel Processing

Implementation of Parallel Processing Techniques on Graphical Processing Units

Scheduling on Parallel Systems

Task scheduling

Scheduling on Parallel Systems

Implementation of Parallel Processing Techniques on Graphical Processing Units

A Survey of Parallel Computer Architectures

Parallel Processing

AN INTRODUCTION ON PARALLEL PROCESSING