290 likes | 416 Views
Architectural Support for Enhanced SMT Job Scheduling. Alex Settle Joshua Kihm Andy Janiszewski Daniel A. Connors University of Colorado at Boulder. Introduction. Shared memory systems of SMT processors limit performance Threads continuously compete for shared cache resources
E N D
Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A. Connors University of Colorado at Boulder
Introduction • Shared memory systems of SMT processors limit performance • Threads continuously compete for shared cache resources • Interference between threads causes workload slowdown • Detecting thread interference is a challenge for real systems • Low level cache monitoring • Difficult to exploit run-time data • Goal • Design the performance monitoring hardware required to capture thread interference information that can be exposed to the operating system scheduler to improve workload performance Colorado Architecture Research Group
Simultaneous Multithreading (SMT) • Concurrently executes instructions from different contexts • Thread level parallelism (TLP) • Improves instruction level parallelism (ILP) • Improves utilization of base processor • Intel Pentium 4 Xeon • 2 level cache hierarchy • Instruction trace cache • 8K data cache 4 way associative 64 bytes per line • 512K L2 cache – Unified 8 way associative 64 bytes per line • 2 way SMT Colorado Architecture Research Group
Inter-thread Interference • Competition for shared resources • Memory system • Buses • Physical cache storage • Fetch and issue queues • Functional units • Threads evict cache data belonging to other threads • Increase in cache misses • Diminishes processor utilization • Inter-thread kick outs (ITKO) • Measured in simulator • Thread id of evicted cache line compared to new cache line • Increased ITKO leads to decrease in IPC Colorado Architecture Research Group
ITKO to IPC Correlation Level 3 Cache • IPC recorded for each phase interval • High ITKO rate leads to significant drop in IPC • Large variability in IPC over workload lifetime – cache interference Colorado Architecture Research Group
Related Work • Different levels of addressing the interference problem • Compiler • [Kumar,Tullsen; Micro 02] Procedure placement optimization • workload fixed at compile time • [J. Lo; Micro 97] Tailoring compiler optimizations for SMT • Effects of traditional optimizations on SMT performance • Static optimizations • Operating System • [Tullsen, Snavely; ASPLOS 00] Symbiotic job scheduling • Profile based, simulated OS and architecture • [J. Lo; ISCA 98] Data cache address remapping • workload dependent, data base applications • Microarchitecture • [Brown; Micro 01] - Issue policy feedback from memory system • Improved fetch and issue resource allocation • Does not tackle inter-thread interference Colorado Architecture Research Group
Motivation • Improve performance by reducing inter-thread interference • Multi-faceted problem • Dependent on thread pairings • Occurs at low-level cache line granularity • Difficult to detect at runtime • OS scheduling decisions affect microarchitecture performance • Observed on both simulator and real system • Observation • Cache access footprints vary over program lifetimes • Accesses are concentrated in small cache regions Colorado Architecture Research Group
Concentration of L2-Cache Access • Cache access and miss footprints vary across program phases • Intervals with high access and miss rates are concentrated in small physical regions of the cache (green, red) • Current performance counters can not detect that activity is concentrated in small regions Colorado Architecture Research Group
Cache Use Map: Runtime Monitoring • Spatial locality • vertical • Temporal locality • horizontal Colorado Architecture Research Group
Benchmark Pairings ITKO • gzip/mesa • mesa/equake • gzip/equake • mesa/perl • equake/perl • gzip/perl • Yellow represents very high interference • Interference is dependent on job mix Colorado Architecture Research Group
Performance Guided Scheduling Theory • gzip • equake Total ITKOs Best Static: 2.91 Million Dynamic: 2.55 Million Total ITKOs Best Static: 2.91 Million Dynamic: 2.55 Million • perl • mesa Total ITKOs Best Static: 7.30 Million Dynamic: 6.70 Million Total ITKOs Best Static: 2.91 Million Dynamic: 2.90 Million • Each phase scheduler selects jobs with least interference Colorado Architecture Research Group
Solution to Inter-thread Interference • Predict future interference • Capture inter-thread interference behavior • Introduce cache line activity counters • Expose to operating system • Current schedulers use symmetric multiprocessing (SMP) algorithms for SMT processors • Activity based job scheduler • Schedule for minimal inter-thread interference Colorado Architecture Research Group
1760 2511 4271 1 0 1 1234 0 2204 526 1 0 Xi>4096? Xi>2048? Xi>1024? 876 1 1474 3678 0 1635 1 1 1067 1 1137 0 1 1220 0 254 Activity Vectors • Interface between OS and microarchitecture • Divide cache into Super Sets • Access counters assigned to each super set • One vector bit corresponds to each counter • Bit is set when threshold is exceeded • Job scheduler • Compare active vector with jobs in run queue • Selects job with fewest common set bits 7949 Thresholds established through static analysis Global median across all benchmarks Colorado Architecture Research Group
Vector Prediction - Simulator • Use last vector to approximate next vector • Average accuracy 91% • Simple and effective Colorado Architecture Research Group
OS Scheduling Algorithm Run queue 1 Run queue 0 vectors • perlbmk • gzip • mesa • OS task OS task mcf ammp parser CMP Twolf vector • twolf • CPU 0 Physical processor CPU 1 • Weighted sum of vectors at each level • Vectors from L2 given highest weight Colorado Architecture Research Group
Activity Vector Procedure • Real system • Modified Linux kernel 2.6.0 • Tested on Intel P4 Xeon Hyper-threading • Emulated activity counter registers • Generate vectors off-line • Valgrind memory simulator • Text file output • Copy vectors to kernel memory space • Activate vector scheduler • Time and run workloads • Simulator • Vector hardware • Simulated OS Colorado Architecture Research Group
Workloads - Xeon • 8 Spec 2000 jobs per workload • Combination of integer and floating point applications • Run to completion in parallel with OS level jobs Colorado Architecture Research Group
Comparison of Scheduling Algorithms • Default Linux vs. Activity based • More than 30% of default scheduler decisions could have been improved by the activity based scheduler Colorado Architecture Research Group
Activity Vector Performance - Xeon Colorado Architecture Research Group
Comparing Activity Vectors to Existing Performance Counters - Simulation On average activity schedule makes different decisions than the performance counter based schedule 23% of the time Colorado Architecture Research Group
ITKO Reduction - Simulation Colorado Architecture Research Group
Contributions • Interference analysis of cache accesses • Introduce fine grained performance counters • General purpose adaptable optimization • Expose microarchitecture to OS • Workload independent • Tested on a real SMT machine • Implemented on Linux kernel • 2 way SMT core Colorado Architecture Research Group
Activity Based Scheduling Summary • Prevents inter-thread interference • Monitors cache access behavior • Co-schedules jobs with expected low interference • Adapts to phased workload behavior • Performance improvements • Greater than 30% opportunity to improve the default Linux scheduling decisions • 22% Reduction in inter-thread interference • 5% Improvement in execution time Colorado Architecture Research Group
Thank You Colorado Architecture Research Group
Super Set Size • What happens when we change the number of super sets used. Can we include a graph here? • Slide 17 once we have the data… • May want to include the tree chart Colorado Architecture Research Group
Performance Challenges • Difficult to detect interference • Inter-thread interference is a multi-faceted problem • Occurs at low-level cache line granularity • Temporal variability in benchmark memory requests • Dependent on thread pairings • OS scheduling decisions affect performance • Current systems • Increased cache associativity • Could use PMU register feedback Colorado Architecture Research Group
1 1234 0 526 0 Xi>1024? 876 1 1635 1 1067 1 1137 0 1000 0 254 Activity Vectors • Interface between OS and microarchitecture • Divide cache into Super Sets • Access counters assigned to each super set • One vector bit corresponds to each counter • Bit is set when threshold is exceeded • Job scheduler • Compare active vector with jobs in run queue • Selects job with fewest common set bits Expect no interference Expect interference Colorado Architecture Research Group
OS Scheduling • OS scheduling important when more jobs than contexts • Current schedulers use symmetric multiprocessing (SMP) algorithms for SMT processors • Proposed work • For each time interval co-schedule jobs whose cache accesses are in different regions Colorado Architecture Research Group
Prevent jobs from running together during program phases where they exhibit high degrees of cache interference Colorado Architecture Research Group