330 likes | 338 Views
Evaluating the potential of age-based scheduling in improving performance on asymmetric multiprocessors and overcoming challenges in thread scheduling.
E N D
Age Based Scheduling for Asymmetric MultiprocessorsNagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim
Outline Background and Motivation Age Based Scheduling Evaluation Conclusion 2
Heterogeneous Architectures where all cores have same ISA but different performance PEA PEB PEB PEB PEB Asymmetric (Chip) Multiprocessors Heterogeneous Architecture
Asymmetric (Chip) Multiprocessors Potential for better performance than SMPs occupying same area and consuming same power Core0 Core1 Core0 Core2 Core3 Core1 Core2 Core3 Symmetric Chip Multiprocessor (SMP/CMP) Asymmetric Chip Multiprocessor (AMP/ACMP) 4
AMPs present new challenges • Thread Scheduling is one among them 5
Scheduling in Multiprocessor OSes • Thread Assignment • assign to least loaded core • Load Balancing • make load on all cores uniform • Idle Balancing • move threads from busy cores to idle core
Scheduling in Multiprocessor OSes • Assume that all cores are identical • Results in bad performance and application instability Parsec benchmarks on a (real) AMP using the Linux Scheduler
Problem with current Scheduling Not taking advantage of fast core
Outline • Background and Motivation • Age Based Scheduling (ABS) • Evaluation • Conclusion
main thread fork barrier … barrier barrier barrier join Motivation for Age Based Scheduling • Many compute-intensive multithreaded applications follow fork-join model • Milestones (barriers) in thread execution … … … … Application Model
Symmetry of Applications • Threads created together are symmetric • Based on instruction count • Degree of Symmetry = Std Dev / Average Degree of Symmetry of Parsec Benchmarks (Symmetric benchmarks are benchmarks with degree of symmetry <= 0.1)
Insight • Difficult to predict absolute execution duration, so predict relative execution duration barrier T4 T1 T2 T3 exe_dur (T1) = exe_dur (T2) = exe_dur (T3) = exe_dur (T4) execution duration = ? barrier 12
Putting together • Applications follow fork-join model with milestones in between • Many applications are symmetric • Easy to predict relative execution duration to next milestone Age Based Scheduling 13
What is Age? Age is the progress made by a thread towards its next milestone 14
execution Age Calculation • Threads created together have the same age • As a thread executes, it ages • Reset age when milestone crossed tA = X tA = 0 tA = 30 tA = 0 creation milestone (barrier) milestone (termination) tB = 0 tB = 50 tB = 0 tA – age of thread A tB – age of thread B X – Unknown, assumed to be a large value
Age Based Scheduling Algorithm To make a Scheduling decision: • Calculate remaining execution durationto next milestone based on age • Assign threads with longerremaining execution durations to fast core – Longest Job to Fast Core First (LJFCF)
Application of LJFCF • Apply whenever • Thread is created • A core becomes idle • Reassignment timer expires (for load balancing) 17
execution Working of the Algorithm tA = 0 tA = 30 Age at barrier = X T1 creation milestone (barrier) milestone (termination) rem_exe = (X – 30) 18
execution Remaining Execution Duration (I) • Track progress of threads • Using Prediction [AGE] • Predict all threads have same inter-milestone distance tA = X tA = 0 tA = 0 tA = X milestone (termination) creation milestone (barrier) tB = 0 tB = X tA – age of thread A tB – age of thread B
execution Remaining Execution Duration (II) • Using Profiling [AGE(PROF)] • threads have different inter-milestone distances calculated based on a metric obtained by profiling tA = X tA = 0 tA = X tA = 0 creation milestone (barrier) milestone (termination) tB = 0 tB = rX tA – age of thread A tB – age of thread B r is from profiler Only one r value for each thread
Working of the Algorithm C A A C D B fast slow slow slow rem_exeA = 50 rem_exeB = 70 rem_exeC = 90 rem_exeD = 30 rem_exeC = 90 rem_exeA = 50 21
Benefit of Age Based Scheduling • Asymmetry aware • Utilizes all cores • Gives all threads opportunities to run on fast cores
Implementation • OS • Track progress using Performance Counters • Disable counter on Interrupts • Compiler (AGE[PROF]) • Passing profiled information • one value for each thread
Outline • Background and Motivation • Age Based Scheduling • Evaluation • Conclusion
Evaluation • Simulation based experiments • Trace + execution hybrid simulator • Lock, barriers are modeled • Context switch and migration overhead simulated • 10 ms time slice for each thread • Machine configuration • 1 fast, 7 slow, 8:1 speed ratio (others are in the paper) • Benchmarks • Symmetric • Parsec (simmedium input) • Asymmetric • Splash-2 • OMPSCR • SuperLU
LJFCF vs Other Policies (I) • Parsec Baseline: SCALEDLD * - Default Linux Policy which performs considerable worse than other policies is not shown
LJFCF vs Other Policies (II) • Asymmetric Benchmarks Baseline: SCALEDLD 28
Idle Cycles • Linux Scheduler – Most of the idle cycles contributed by fast core • SCALEDLD – keeps same thread(s) on fast core • AGE – assigns different threads to fast core
Different AMP Configurations X/1 : Ratio of speeds of Fast and Slow cores is X:1 • Need for asymmetry aware scheduling increases as cores become more asymmetric • AGE based policies show more improvement over SCALEDLD as asymmetry increases
Outline • Background and Motivation • Age Based Scheduling • Evaluation • Conclusion
Conclusion • Age based scheduling (ABS) for Asymmetric Multiprocessors • ABS assumes threads created at the same time are symmetric • ABS assigns threads to cores based on their predicted remaining execution durations • Predictions are made based on Age of threads • Improvement of 10.4% (Pred) and 13.2% (Prof) for Parsec and 7.6% (Pred) and 9.4% (Prof) for Asymmetric benchmarks over Li’s mechanism