440 likes | 456 Views
USING OS OBSERVATIONS TO IMPROVE PERFORMANCE IN MULTICORE SYSTEMS. Presented By Ankit Patel. Authors: Rob Knauerhase Paul Brett Barbara Hohlt Tong Li Scott Hahn. Paper summary. Shows practical way of improving performance of a system by modifying the operating system
E N D
USING OS OBSERVATIONS TOIMPROVE PERFORMANCE INMULTICORE SYSTEMS Presented By Ankit Patel Authors: Rob Knauerhase Paul Brett Barbara Hohlt Tong Li Scott Hahn
Paper summary • Shows practical way of improving performance of a system by modifying the operating system • Paper also shows that features provided by multicore processor in the operating system kernel can significantly improve the performance of the system
Today’s operating systems don’t handle complexity of multicore processors • Goal of this research paper is to show that the OS can use data obtained from dynamic runtime environment observation of task behavior. • This paper will also demonstrate the utility of observation based policy
Background • Multicore processor has several independent processing units in a single chip. Also known as Chip Multiprocessing (CMP) • E.g. dual & quad core processors from Intel and AMD • They share certain resources in chip e.g. Last Level Cache
Another type of multicore system • Simultanious multithreading (SMT) • Looks like multicore, but implement each virtual core with combination of functional units within a processor. E.g. Pantium 4 HT technology based processors.
Problem • As number of core increased within a die, complexity of design also increases for multithreaded cores. • Current OS is completely unaware of internal design of multicore processor. • So it neither exploit CMP features nor avoid CMP challenges.
From OS level • Current operating systems exploit multicore systems by using multiprocessing kernel e.g. Linux and Windows uses Symmetric Multiprocessing or SMP, where kernel treat each core as independent processor.
Features provided by processor to improve performance • Modern processors also include hardware features for monitoring the CPU’s performance and behavior • For example, the Intel Vtune system allows programmers to optimize for cache usage and floating-point and multimedia extensions (MMX) instruction usage.
OS should make an observation of the behavior of threads running in the system • These observations, combined with knowledge of the processor architecture, allow the implementation of different policies in the OS.
But this is expensive and CEO will ask for business value before investing????
Advantages • Good policies can improve overall system performance • Improve application performance • Decrease system power consumption, • Huge issue in data centers • or provide arbitrary user-defined combinations of these benefits.
…and this is enough business value to market new Operating System in the market…Don’t you think?
…Very Exciting....Can you show how this can be done in OS kernel??
Testing environment used in this paper • Linux (kernel 2.6.20) on Intel Xeon 5300 series (contains two 4MB LLC arrays, each shared by two cores. • Mac OS-X (Darwin) on Intel Core 2 Duo (contains one 2MB LLC shared by two cores • NO Microsoft WINDOWS • WHY ???????????
Challenges • Cache interference in the last-level cache (LLC): • If a task runs on core A, it can use the entire LLC. Another task, running on core B, shares the LLC resource; the resulting contention slows both tasks. Worse, the amount of contention is quite dynamic because it depends on each task’s behavior at a given time. This behavior is impossible for the application to know at compile time • Lack of intelligent thread migration: • Linux and OS X include migration for basic load balancing, but they treat each core equivalently, without the notion of resources shared among cores. • No accommodation of cores with different features: • Current Intel-compatible multicore implementations feature cores that are exact copies of each other. In the future, however, some cores will likely be functionally asymmetric for reasons of power, die area, cost, and complexity.
Observation subsystem • inspects relevant performance monitoring counters and kernel data structures and gathers information on a perthread basis.
Processor counters used for several measurable events • LLC misses (INVALID_L2_RQSTS), • LLC references (L2_RQSTS), • instructions retired (INSTR_RETIRED.ANY) • core cycles (CPU_CLK_UNHALTED.CORE) • reference cycles (CPU_CLK_UNHALTED.REF).
Policy: Reducing cache interference • Cache misses per cycle are the best indication of cache interference. • How to predict future behavior from historical behavior: • Basing the weight on the metrics from the immediately past quantum (that is, using temporal locality as our predictor) is the best solution.
From Implementation Point of View • Implemented observation and scheduler modification in Linux and Macintosh OS X kernel. • Added cache interference policy among cores sharing LLC. • When core 0 is ready to be assigned a new task, the scheduler examines the weights of tasks on other cores and chooses a task whose weight best complements the corunning tasks for core 0. Thus, heavy and light tasks tend to be coscheduled on the shared cache, avoiding the interference that results from coscheduling two heavy tasks.
Results • Linux: 30% improvement in worst case scenario than current implementation and 6% overall performance improvement • Mac OS X: 3% overall performance improvement
Policy: Migrating across caches • Observations to affect task migration decisions in the OS. • Goal is to distribute cache-heavy threads throughout the system, not only helping spread out cache load, but also providing more opportunity for OBS-L to achieve benefits with its local policies.
Results • Linux load balancer produced between 8 and 18 percent speedups for the heavy cachebuster tasks. • With the addition of OBS-X, cachebuster performance increased between 12 percent and 62 percent. • WHY ??? • OBS-X distributed the cache-heavy tasks across LLC groups, so it minimizes the scheduling of heavy tasks together
Policy: Addressing fairness • Under this policy (implemented as OBS-C), the system computes the difference in weights between any coscheduled tasks and transfers CPU time (tcredit) proportionally from the heavier to the lighter task
Policy: Accommodating functional asymmetry • Simulate functional asymmetry in an existing multicore system. • used CPU-specific flags to disable floating-point (FP) and SIMD instructions (including MMX and streaming SIMD) instructions on a subset of cores in the system • A second version of the policy (OBS-F2) monitors the task after it migrates
Another Feature of the policy • OBS-F tracks the accumulation of unavailable instructions over time (total number and number of faulting quanta). Using the task’s history, OBS-F can determine that the frequency of faulting instructions is high enough that the task should be banned from a core for the rest of its life in the system, saving both migration costs and cache interference caused by a task’s frequently moving to and from FP-enabled cores.
Other observation-driven policies • Reducing functional-unit interference • Observation of contention will allow the OS to implement analogous policies - either to migrate a task somewhere with less contention or to credit CPU time back to a task that has suffered disproportionately from contention. • Multicore power management • Implementations allow cores to change power states independently. Initial research indicates that OS-level observation—of both hardware power events and per-task activity—will enable better power management policies. • For Example:tasks that are largely memory bound might run on a lower-power core and obtain similar performance, whereas computation-intensive tasks might benefit from migration to a higher-speed or higher- power core.
Virtualization: • Runtime observation of behavior is even more important for VMs because an even greater opacity prevents knowledge of what a VM is doing or will do next.
Write a simple program and execute it on windows: one with normal priority and other with one higher priority. And see how normal priority process suffers. Compare the same on Linux
Research Avenues • Needs more research in OS to manage cache • Change to kernel requires full understanding of OS kernel’s memory management and scheduler (atleast)…..they themselves are big research area. • Due to fast progress in processor technology and slow progress in OS kernel, OS kernel underutilize the functionality provided by modern multicore processor. This is could be another research area to explore how to better utilize processor to it full potential
Business Value • If we implement at least one of the above mentioned policy, then it could be a big turn-around for the new OS and add more business value to the OS vendor.
Summary • OS can help by making dynamic observations of task behavior and then implementing smarter policies based on the results of these observations.