Capriccio: Scalable Threads for Internet Service

Capriccio: Scalable Threads for Internet Service

Introduction • Internet services have ever-increasing scalability demands • Current hardware is meeting these demands • Software has lagged behind • Recent approaches are event-based • Pipeline stages of events

Drawbacks of Events • Events systems hide the control flow • Difficult to understand and debug • Eventually evolved into call-and-return event pairs • Programmers need to match related events • Need to save/restore states • Capriccio: instead of event-based model, fix the thread-based model

Goals of Capriccio • Support for existing thread API • Little changes to existing applications • Scalability to thousands of threads • One thread per execution • Flexibility to address application-specific needs Threads Ideal Ease of Programming Events Threads Performance

Thread Design Principles • Kernel-level threads are for true concurrency • User-level threads provide a clean programming model with useful invariants and semantics • Decouple user from kernel level threads • More portable

Capriccio • Thread package • All thread operations are O(1) • Linked stacks • Address the problem of stack allocation for large numbers of threads • Combination of compile-time and run-time analysis • Resource-aware scheduler

Thread Design and Scalability • POSIX API • Backward compatible

User-Level Threads + Performance + Flexibility - Complex preemption - Bad interaction with kernel scheduler

Flexibility • Decoupling user and kernel threads allows faster innovation • Can use new kernel thread features without changing application code • Scheduler tailored for applications • Lightweight

Performance • Reduce the overhead of thread synchronization • No kernel crossing for preemptive threading • More efficient memory management at user level

Disadvantages • Need to replace blocking calls with nonblocking ones to hold the CPU • Translation overhead • Problems with multiple processors • Synchronization becomes more expensive

Context Switches • Built on top of Edgar Toernig’s coroutine library • Fast context switches when threads voluntarily yield

I/O • Capriccio intercepts blocking I/O calls • Uses epoll for asynchronous I/O

Scheduling • Very much like an event-driven application • Events are hidden from programmers

Synchronization • Supports cooperative threading on single-CPU machines • Requires only Boolean checks

Threading Microbenchmarks • SMP, two 2.4 GHz Xeon processors • 1 GB memory • two 10 K RPM SCSI Ultra II hard drives • Linux 2.5.70 • Compared Capriccio, LinuxThreads, and Native POSIX Threads for Linux

Latencies of Thread Primitives

Thread Scalability • Producer-consumer microbenchmark • LinuxThreads begin to degrade after 20 threads • NPTL degrades after 100 • Capriccio scales to 32K producers and consumers (64K threads total)

Thread Scalability

I/O Performance • Network performance • Token passing among pipes • Simulates the effect of slow client links • 10% overhead compared to epoll • Twice as fast as both LinuxThreads and NPTL when more than 1000 threads • Disk I/O comparable to kernel threads

Linked Stack Management • LinuxThreads allocates 2MB per stack • 1 GB of VM holds only 500 threads Fixed Stacks

Linked Stack Management • But most threads consumes only a few KB of stack space at a given time • Dynamic stack allocation can significantly reduce the size of VM Linked Stack

Compiler Analysis and Linked Stacks • Whole-program analysis • Based on the call graph • Problematic for recursions • Static estimation may be too conservative

Compiler Analysis and Linked Stacks • Grow and shrink the stack size on demand • Insert checkpoints to determine whether we need to allocate more before the next checkpoint • Result in noncontiguous stacks

Placing Checkpoints • One checkpoint in every cycle in the call graph • Bound the size between checkpoints with the deepest call path

Dealing with Special Cases • Function pointers • Don’t know what procedure to call at compile time • Can find a potential set of procedures

Dealing with Special Cases • External functions • Allow programmers to annotate external library functions with trusted stack bounds • Allow larger stack chunks to be linked for external functions

Tuning the Algorithm • Stack space can be wasted • Internal and external fragmentation • Tradeoffs • Number of stack linkings • External fragmentation

Memory Benefits • Tuning can be application-specific • No preallocation of large stacks • Reduced requirement to run a large numbers of threads • Better paging behavior • Stacks—LIFO

Case Study: Apache 2.0.44 • Maximum stack allocation chunk: 2KB • Apache under SPECweb99 • Overall slowdown is about 3% • Dynamic allocation 0.1% • Link to large chunks for external functions 0.5% • Stack removal 10%

Resource-Aware Scheduling • Advantages of event-based scheduling • Tailored for applications • With event handlers • Events provide two important pieces of information for scheduling • Whether a process is close to completion • Whether a system is overloaded

Resource-Aware Scheduling • Thread-based • View applications as sequence of stages, separated by blocking calls • Analogous to event-based scheduler

Blocking Graph • Node: A location in the program that blocked • Edge: between two nodes if they were consecutive blocking points • Generated at runtime

Resource-Aware Scheduling 1. Keep track of resource utilization 2. Annotate each node with resource used and its outgoing edges 3. Dynamically prioritize nodes • Prefer nodes that release resources

Resources • CPU • Memory (malloc) • File descriptors (open, close)

Pitfalls • Tricky to determine the maximum capacity of a resource • Thrashing depends on the workload • Disk can handle more requests that are sequential instead of random • Resources interact • VM vs. disk • Applications may manage memory themselves

Yield Profiling • User threads are problematic if a thread fails to yield • They are easy to detect, since their running times are orders of magnitude larger • Yield profiling identifies places where programs fail to yield sufficiently often

Web Server Performance • 4x500 MHz Pentium server • 2GB memory • Intel e1000 Gigabit Ethernet card • Linux 2.4.20 • Workload: requests for 3.2 GB of static file data

Web Server Performance • Request frequencies match those of the SPECweb99 • A client connects to a server repeated and issue a series of five requests, separated by 20ms pauses • Apache’s performance improved by 15% with Capriccio

Resource-Aware Admission Control • Consumer-producer applications • Producer loops, adding memory, and randomly touching pages • Consumer loops, removing memory from the pool and freeing it • Fast producer may run out of virtual address space

Resource-Aware Admission Control • Touching pages too quickly will cause thrashing • Capriccio can quickly detect the overload conditions and limit the number of producers

Programming Models for High Concurrency • Event • Application-specific optimization • Thread • Efficient thread runtimes

User-Level Threads • Capriccio is unique • Blocking graph • Resource-aware scheduling • Target at a large number of blocking threads • POSIX compliant

Application-Specific Optimization • Most approaches require programmers to tailor their application to manage resources • Nonstandard APIs, less portable

Stack Management • No garbage collection

Future Work • Multi-CPU machines • Profiling tools for system tuning

Capriccio: Scalable Threads for Internet Service

Capriccio: Scalable Threads for Internet Service

Presentation Transcript

Chapter 5: Threads

Jerry Breecher

Threads, Gerenciamento de Threads

Chapter 4: Threads

How to add external threads to a rod

Chapter 5: Threads

Threads

Chapter 4: Threads

EML 2023 – Threads and Fasteners

Chap 5 Threads 线程

CS238 Lecture 5 Threads

A Scalable and Explicit Event Delivery Mechanism for UNIX

Threads

Capriccio: Scalable Threads for Internet Services

Module 5: Threads

Threads

Capriccio: Scalable Threads for Internet Services

Chapter 4: Threads

Chapter 4: Threads

Chapter 4: Threads