460 likes | 477 Views
Capriccio: Scalable Threads for Internet Service. Introduction. Internet services have ever-increasing scalability demands Current hardware is meeting these demands Software has lagged behind Recent approaches are event-based Pipeline stages of events. Drawbacks of Events.
E N D
Introduction • Internet services have ever-increasing scalability demands • Current hardware is meeting these demands • Software has lagged behind • Recent approaches are event-based • Pipeline stages of events
Drawbacks of Events • Events systems hide the control flow • Difficult to understand and debug • Eventually evolved into call-and-return event pairs • Programmers need to match related events • Need to save/restore states • Capriccio: instead of event-based model, fix the thread-based model
Goals of Capriccio • Support for existing thread API • Little changes to existing applications • Scalability to thousands of threads • One thread per execution • Flexibility to address application-specific needs Threads Ideal Ease of Programming Events Threads Performance
Thread Design Principles • Kernel-level threads are for true concurrency • User-level threads provide a clean programming model with useful invariants and semantics • Decouple user from kernel level threads • More portable
Capriccio • Thread package • All thread operations are O(1) • Linked stacks • Address the problem of stack allocation for large numbers of threads • Combination of compile-time and run-time analysis • Resource-aware scheduler
Thread Design and Scalability • POSIX API • Backward compatible
User-Level Threads + Performance + Flexibility - Complex preemption - Bad interaction with kernel scheduler
Flexibility • Decoupling user and kernel threads allows faster innovation • Can use new kernel thread features without changing application code • Scheduler tailored for applications • Lightweight
Performance • Reduce the overhead of thread synchronization • No kernel crossing for preemptive threading • More efficient memory management at user level
Disadvantages • Need to replace blocking calls with nonblocking ones to hold the CPU • Translation overhead • Problems with multiple processors • Synchronization becomes more expensive
Context Switches • Built on top of Edgar Toernig’s coroutine library • Fast context switches when threads voluntarily yield
I/O • Capriccio intercepts blocking I/O calls • Uses epoll for asynchronous I/O
Scheduling • Very much like an event-driven application • Events are hidden from programmers
Synchronization • Supports cooperative threading on single-CPU machines • Requires only Boolean checks
Threading Microbenchmarks • SMP, two 2.4 GHz Xeon processors • 1 GB memory • two 10 K RPM SCSI Ultra II hard drives • Linux 2.5.70 • Compared Capriccio, LinuxThreads, and Native POSIX Threads for Linux
Thread Scalability • Producer-consumer microbenchmark • LinuxThreads begin to degrade after 20 threads • NPTL degrades after 100 • Capriccio scales to 32K producers and consumers (64K threads total)
I/O Performance • Network performance • Token passing among pipes • Simulates the effect of slow client links • 10% overhead compared to epoll • Twice as fast as both LinuxThreads and NPTL when more than 1000 threads • Disk I/O comparable to kernel threads
Linked Stack Management • LinuxThreads allocates 2MB per stack • 1 GB of VM holds only 500 threads Fixed Stacks
Linked Stack Management • But most threads consumes only a few KB of stack space at a given time • Dynamic stack allocation can significantly reduce the size of VM Linked Stack
Compiler Analysis and Linked Stacks • Whole-program analysis • Based on the call graph • Problematic for recursions • Static estimation may be too conservative
Compiler Analysis and Linked Stacks • Grow and shrink the stack size on demand • Insert checkpoints to determine whether we need to allocate more before the next checkpoint • Result in noncontiguous stacks
Placing Checkpoints • One checkpoint in every cycle in the call graph • Bound the size between checkpoints with the deepest call path
Dealing with Special Cases • Function pointers • Don’t know what procedure to call at compile time • Can find a potential set of procedures
Dealing with Special Cases • External functions • Allow programmers to annotate external library functions with trusted stack bounds • Allow larger stack chunks to be linked for external functions
Tuning the Algorithm • Stack space can be wasted • Internal and external fragmentation • Tradeoffs • Number of stack linkings • External fragmentation
Memory Benefits • Tuning can be application-specific • No preallocation of large stacks • Reduced requirement to run a large numbers of threads • Better paging behavior • Stacks—LIFO
Case Study: Apache 2.0.44 • Maximum stack allocation chunk: 2KB • Apache under SPECweb99 • Overall slowdown is about 3% • Dynamic allocation 0.1% • Link to large chunks for external functions 0.5% • Stack removal 10%
Resource-Aware Scheduling • Advantages of event-based scheduling • Tailored for applications • With event handlers • Events provide two important pieces of information for scheduling • Whether a process is close to completion • Whether a system is overloaded
Resource-Aware Scheduling • Thread-based • View applications as sequence of stages, separated by blocking calls • Analogous to event-based scheduler
Blocking Graph • Node: A location in the program that blocked • Edge: between two nodes if they were consecutive blocking points • Generated at runtime
Resource-Aware Scheduling 1. Keep track of resource utilization 2. Annotate each node with resource used and its outgoing edges 3. Dynamically prioritize nodes • Prefer nodes that release resources
Resources • CPU • Memory (malloc) • File descriptors (open, close)
Pitfalls • Tricky to determine the maximum capacity of a resource • Thrashing depends on the workload • Disk can handle more requests that are sequential instead of random • Resources interact • VM vs. disk • Applications may manage memory themselves
Yield Profiling • User threads are problematic if a thread fails to yield • They are easy to detect, since their running times are orders of magnitude larger • Yield profiling identifies places where programs fail to yield sufficiently often
Web Server Performance • 4x500 MHz Pentium server • 2GB memory • Intel e1000 Gigabit Ethernet card • Linux 2.4.20 • Workload: requests for 3.2 GB of static file data
Web Server Performance • Request frequencies match those of the SPECweb99 • A client connects to a server repeated and issue a series of five requests, separated by 20ms pauses • Apache’s performance improved by 15% with Capriccio
Resource-Aware Admission Control • Consumer-producer applications • Producer loops, adding memory, and randomly touching pages • Consumer loops, removing memory from the pool and freeing it • Fast producer may run out of virtual address space
Resource-Aware Admission Control • Touching pages too quickly will cause thrashing • Capriccio can quickly detect the overload conditions and limit the number of producers
Programming Models for High Concurrency • Event • Application-specific optimization • Thread • Efficient thread runtimes
User-Level Threads • Capriccio is unique • Blocking graph • Resource-aware scheduling • Target at a large number of blocking threads • POSIX compliant
Application-Specific Optimization • Most approaches require programmers to tailor their application to manage resources • Nonstandard APIs, less portable
Stack Management • No garbage collection
Future Work • Multi-CPU machines • Profiling tools for system tuning