Many-Core Operating Systems

Many-Core Operating Systems Burton SmithTechnical FellowAdvanced Strategies and Policy

The von Neumann Premise • Simply put: “there is exactly one program counter” • It has led to some artifacts: • Synchronous coprocessor coroutining (e.g. 8087) • Interrupts for asynchronous concurrency • Demand paging • to make memory allocation incremental • to let |virtual| > |physical| • And some serious problems: • The memory wall (insufficient memory concurrency) • The ILP wall (diminished improvement in ILP) • The power wall (the cost of run-time ILP exploitation) • Given multiple program counters, what should we change? • Scheduling? • Synchronization?

Computing is at a Crossroads • Continual performance improvement is our lifeblood • It encourages people to buy new hardware • It opens up new software possibilities • Single-thread performance is nearing the end of the line • But Moore’s Law will continue for some time to come • What can we do with all those transistors? • Computation needs to become as parallel as possible • Henceforth, serial means slow • Systems must support general purpose parallel computing • The alternative to all this is commoditization • New many-core chips will need new system software • And vice versa! • This talk is about interplay between OS and hardware

Many-Core OS Challenges • Architecture of the parallel virtual machine • Processor management • Multiple processors • A mix of in-order and out-of-order CPUs • GPUs and other performance accelerators • I/O processors and devices • Memory management • Performance problems due to paging • TLB pressure from larger working sets • Bandwidth resources • Quality of service (time management) • For media applications, games, real-time apps, etc. • For deadlines

The Parallel Virtual Machine • What should the interface that the OS presents to parallel application software look like? • Stable, negotiated resource allocation • Isolation among protection domains • Freedom from bottlenecks in OS services • The key objective is fine-grain application parallelism • We need the whole tree, not just the low-hanging fruit

Fine-grain Parallelism • Exploitable parallelism grows as task granularity shrinks • But dependences among tasks become more numerous • Inter-task dependence enforcement demands scheduling • A task needing a value from elsewhere must wait for it • User-level work scheduling is called for • No privilege change is needed to stop or restart a task • Locality (e.g. cache content) can be better preserved • Todays OS and hardware don’t encourage waiting • OS thread scheduling makes blocking dangerous • Instruction sets encourage non-blocking approaches • Busy-waiting wastes instruction issue opportunities • Impact: • Better instruction set support for blocking synchronization • Changes to OS processor and memory resource management

Multithreading and Synchronization • Fine-grain multithreading can use TLP to tolerate latency • Memory latency • Other operation latency, e.g. branch latency • Synchronization latency • In the latter case, some architectural support is helpful • To stop issuing from a context while it is waiting • To resume issuing when the wait is over • To free up the context if and when a wait becomes long • The benefits: • Waiting does not consume issue slots • Overhead is automatically amortized • I talked about this stuff in my 1996 FCRC keynote

Resource Scheduling Consequences • Since the user runtime is scheduling work on processors, the OS should not attempt to do the same • An asynchronous OS API is a necessary corollary • Scheduling memory via demand paging is also problematic • Instead, the two schedulers should negotiate • The application tells the OS its resource needs/desires • The OS makes decisions based on the big picture: • Availability of resources • Appropriateness of power consumption level • Requirements for quality of service • The OS can preempt resources to reclaim them • But with notification, so the application can rearrange things • Resources should be time- and space-shared in chunks • Scheduling turns into a bin-packing problem

Bin Packing • The more resources allocated, the more swapping overhead • It would be nice to amortize it… • The more resources you get, the longer you may keep them • Roughly, this means scheduling = packing squarish blocks • QOS applications might need long rectangles instead • When the blocks don’t fit, the OS can morph them a little • Or cut corners when absolutely necessary Quantity of resource Time

What About Priority Scheduling? • Priorities are appropriate for some kinds of scheduling • Especially when some things to be scheduled are optional • If it all has to be done, how do the priorities get set? • The answer is usually “ad-hoc, and often!” • Fairness is seldom maintained in the process • Quality of service needs a different approach • “How much work must be done before the next deadline?” • Even highly interactive tasks can benefit • Deadlines are harder to implement than priorities • Then again, so is bin packing compared to fixed quanta • Fairness can also be based on quality-of-service concepts • Relative work rates rather than absolute • “In the next 16 milliseconds, give level i activities r times as many processor-seconds as level i-1 activities”

Heterogeneous Processors • There are two kinds of heterogeny: • In architecture, i.e. different instruction sets • In implementation, i.e. different performance characteristics • Both are likely to be important • A single application might ask for a heterogeneous mix • Failure in the HA case might need multiple versions or JIT • In the HI case, scheduling might be based on instrumentation • A key question is whether a processor is time-sharable • If not, the OS has to dedicate it to one application at a time • With user-level scheduling and some support for preemption, application state save and restore can be done at user level

Virtual Memory Design Alternatives • Swapping instead of demand paging • Address-space names/identifiers • TLB shootdown becomes a rarer event • Hardware TLB coherence • Two-dimensional addressing (segmentation w/o registers) • To assist with variable granularity memory allocation • To help mitigate upward pressure on TLB size • To leverage persistent memory via segment sharing • A variation of mmap()might suffice for this purpose • To accommodate variations in memory bank architecture • Local versus global, for example

Physical Memory Bank Architecture • Consider this example • An application is using 31 cores, about half of them • 50% of its cache misses are stack references • The stacks are all allocated in a compact virtual region • How many of the 128 memory banks are available? • Interleaving addresses across the banks is a solution • Page granularity is the standard choice • If memory access is non-uniform, this is not the best idea • Stacks should be allocated near their processors • So should compiler-allocated temporary arrays on the heap • Is it one bank architecture scheme fits all, or not? • If not, how do we manage the virtual address space?

“Hot Spots” • When processors share memory, they can interfere • Not only data races, but also bandwidth oversubscription • Within an application, this creates performance problems • Hardware help is needed to discover “where” these are • Beween applications, interference is even more serious • Performance unpredictability • Denial of service • Covert-channel signaling • Bandwidth is a resource like any other • We need to be able to partition and isolate it

I/O Architecture • Direct memory access is usually a good way to do I/O • Today’s DMA mostly demands “wired down” pages • This leads to lots of data copying and other OS warts • But I/O devices are getting smarter all the time • Transistors are cheaper than almost anything else • Why not treat I/O devices like heterogeneous processors? • Teach them to do virtual address translation • Allocate them to real-time or sensor-intensive applications • Allocate them to a not-very-trusted “driver application” • Address space sharing can be partial, as it is now • There is a problem, though: inter-domain signaling (IPC) • This is what interrupts do • I have some issues with interrupts

Interrupts • Interrupts are OK when there is only one processor • Some people avoid them to make systems more predictable • If there are many processors, which one do you interrupt? • The usual solution: “just pick one and leave it to software” • A better idea is to signal via an address space you already share (perhaps only in part) with the intended recipient • “The DMA at address <a> is (ready)(done)” • This is kinda like doing programmed I/O via device CSRs • It’s also the way the CDC 6600 and 7600 did things • You may not want to have the signal recipient busy-wait

Conclusions • It is time to rethink some of the basics of computing • There is lots of work for everyone to do • e.g. I’ve left out compilers, debuggers, and applications • We need basic research as well as industrial development • Research in computer systems is deprecated these days • In the USA, NSF and DOD need to take the initiative

Many-Core Operating Systems

Many-Core Operating Systems

Presentation Transcript

Operating Systems

Operating Systems

Beyond Auto-Parallelization: Compilers for Many-Core Systems

Operating Systems

Operating Systems

Operating Systems Real-Time Operating Systems

Operating Systems

Operating Systems

Network Operating Systems versus Operating Systems

Operating Systems

Operating Systems

Operating Systems

Operating Systems

Operating Systems

Operating Systems

Operating Systems

Programming Many-Core Systems with GRAMPS

Programming many core systems Marco Bekooij

Operating Systems Real-Time Operating Systems

OPERATING SYSTEMS SYSTEMS

Programming many core systems Marco Bekooij