390 likes | 511 Views
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors. Matt DeVuyst Rakesh Kumar Dean Tullsen. Some Definitions. Core 1. Core 2. Core 3. Balanced schedule: A schedule of threads to contexts such that the number of threads per core is equal
E N D
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen
Some Definitions Core 1 Core 2 Core 3 • Balanced schedule: • A schedule of threads to contexts such that the number of threads per core is equal • Unbalanced schedule: • A schedule of threads to contexts such that the number of threads per core is not equal Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Thread 6 Thread 7 Core 1 Core 2 Core 3 Thread 1 Thread 4 Thread 5 Thread 2 Thread 6 Thread 3 Thread 7 IPDPS: DeVuyst, Kumar, Tullsen
Why a CMP of SMT cores? • Chip makers are manufacturing more Chip Multiprocessors (CMP) with Simultaneous Multithreading (SMT) • Power5 • Niagra • Very little work has been done on thread scheduling for such an architecture • Scheduling on this architecture is challenging IPDPS: DeVuyst, Kumar, Tullsen
Application Diversity • Different applications have different needs • One way to effectively cope with application diversity is hardware heterogeneity [Kumar03] IPDPS: DeVuyst, Kumar, Tullsen
Hardware Heterogeneity Threads Cores IPDPS: DeVuyst, Kumar, Tullsen
Application Diversity • Different applications have different needs • One way to effectively cope with application diversity is hardware heterogeneity • Another way to deal with application diversity is soft heterogeneity IPDPS: DeVuyst, Kumar, Tullsen
Soft Heterogeneity Threads SMT Cores IPDPS: DeVuyst, Kumar, Tullsen
Scheduling Complexity • Given a 4 core CMP,with 4 contexts per core,and 12 threads • There are 15,400 balanced schedules • There are 644,875 unbalanced schedules Core Context IPDPS: DeVuyst, Kumar, Tullsen
Our Goals • Find good scheduling policies • System-level scheduling • → Granularity is an OS time-slice • Optimize for both power and performance • Performance • Power • Energy • Energy Delay Product (EDP) • = Energy * Performance IPDPS: DeVuyst, Kumar, Tullsen
Outline • Architecture • Methodology • Scheduling Policies • Conclusions IPDPS: DeVuyst, Kumar, Tullsen
Architecture • 4 SMT cores • 4 contexts per core • Shared L2, L3 • Cores can be power-gated Ctx Ctx Ctx Ctx Shared L1s Shared L1s Ctx Ctx Ctx Ctx L2 and L3 Caches Ctx Ctx Ctx Ctx Shared L1s Shared L1s Ctx Ctx Ctx Ctx IPDPS: DeVuyst, Kumar, Tullsen
Methodology • Benchmarks • 12 SPEC 2k benchmarks • TLP varied from 4,6,8,12,16 • 8 benchmark sets for each level of TLP • Each benchmark is given fair coverage • Dynamic scheduling policies seeded with the best static schedule • A variant of SMTSIM and a CMP-aware version of Wattch IPDPS: DeVuyst, Kumar, Tullsen
Outline • Architecture • Methodology • Scheduling Policies • Naïve balanced scheduling policy • Sampling-based policies • Electron policies • Conclusions IPDPS: DeVuyst, Kumar, Tullsen
Naïve Balanced Scheduling Policy • Main idea • Spreading threads evenly across cores results in good resource utilization • How it works • Each thread is assigned to a context such that the resulting schedule is balanced. • The schedule is changed randomly over time. • This was our baseline for comparison • Easy to implement • Most common IPDPS: DeVuyst, Kumar, Tullsen
What We Learn From Static Schedules Baseline is Naïve Balanced Dynamic Policy IPDPS: DeVuyst, Kumar, Tullsen
Outline • Architecture • Methodology • Scheduling Policies • Naïve balanced scheduling policy • Sampling-based policies • Electron policies • Conclusions IPDPS: DeVuyst, Kumar, Tullsen
Sampling-based Policies • Main idea • Try different schedules to find an effective one • Oblivious to underlying hardware • How they work • Two alternating phases • Sampling phase: different schedules are sampled • Steady phase: best schedule from sampling phase is used • Steady phase is much longer than sampling phase IPDPS: DeVuyst, Kumar, Tullsen
Sampling-based Policies IPDPS: DeVuyst, Kumar, Tullsen
Sampling-based Policies IPDPS: DeVuyst, Kumar, Tullsen
Sampling-based Policies IPDPS: DeVuyst, Kumar, Tullsen
Outline • Architecture • Methodology • Scheduling Policies • Naïve balanced scheduling policy • Sampling-based policies • Symbiosis policies [Snavely02] • “Prefer Last” policies • Electron policies • Conclusions IPDPS: DeVuyst, Kumar, Tullsen
Symbiosis Policy • Main idea • Some threads run well together, others do not • How it works • Sampling phase: random schedules created, performance sampled. • Steady phase: the schedule in which threads achieve the most symbiosis is run • Two versions: • Balanced: only balanced schedules considered • Unbalanced IPDPS: DeVuyst, Kumar, Tullsen
Symbiosis Policy Baseline is Naïve Balanced IPDPS: DeVuyst, Kumar, Tullsen
Outline • Architecture • Methodology • Scheduling Policies • Naïve balanced scheduling policy • Sampling-based policies • Symbiosis policies • “Prefer Last” policies • Electron policies • Conclusions IPDPS: DeVuyst, Kumar, Tullsen
“Prefer Last” Policies • Main idea • Current schedules has merit • A similar schedule might be a little better • How they work • Create multiple permutations on the current schedule • Create a few random samples to prevent remaining in only local minima • Sample schedules and pick the best IPDPS: DeVuyst, Kumar, Tullsen
“Prefer Last” Policies IPDPS: DeVuyst, Kumar, Tullsen
“Prefer Last” Policies IPDPS: DeVuyst, Kumar, Tullsen
“Prefer Last” Policies IPDPS: DeVuyst, Kumar, Tullsen
“Prefer Last” Policies IPDPS: DeVuyst, Kumar, Tullsen
Sampling Based Policies IPDPS: DeVuyst, Kumar, Tullsen
Sampling Based Policies IPDPS: DeVuyst, Kumar, Tullsen
Issues With Sampling Based Policies • Non-scalable • Search space grows → number of samples grow • Overhead of sampling • Some schedules result in improvement • …but most just make things worse IPDPS: DeVuyst, Kumar, Tullsen
Outline • Architecture • Methodology • Scheduling Policies • Naïve balanced scheduling policy • Sampling-based policies • Electron policies • Conclusions IPDPS: DeVuyst, Kumar, Tullsen
Electron Policies • Main idea • One core attracts a thread • Another core repels a thread. • How it works (EDP) • Highest EDP core identified • Lowest EDP core identified • A thread running on the low EDP core is moved to the high EDP core IPDPS: DeVuyst, Kumar, Tullsen
Electron Policies Core 1 Core 2 Core with the highest EDP t1 t2 t3 Core 3 Core 4 t4 t5 t6 t7 Core with the lowest EDP t8 IPDPS: DeVuyst, Kumar, Tullsen
Electron Policy Results IPDPS: DeVuyst, Kumar, Tullsen
Outline • Architecture • Methodology • Scheduling Policies • Naïve balanced scheduling policy • Sampling-based policies • Electron policies • Conclusions IPDPS: DeVuyst, Kumar, Tullsen
Conclusions • A good scheduling policy for a CMP of SMTs must consider unbalanced schedules to achieve the most efficiency. • “Prefer Last” policies yield more energy savings than symbiotic scheduling policies and the naïve balanced policy. • Electron policies have low overhead and are particularly effective well when TLP is high. IPDPS: DeVuyst, Kumar, Tullsen
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen