1 / 46

Capriccio: Scalable Threads for Internet Service

Capriccio introduces a thread-based model to address the scalability demands of internet services, focusing on ease of programming, performance, and flexibility. The design principles and an in-depth study of its advantages and challenges are discussed.

lecea
Download Presentation

Capriccio: Scalable Threads for Internet Service

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Capriccio: Scalable Threads for Internet Service

  2. Introduction • Internet services have ever-increasing scalability demands • Current hardware is meeting these demands • Software has lagged behind • Recent approaches are event-based • Pipeline stages of events

  3. Drawbacks of Events • Events systems hide the control flow • Difficult to understand and debug • Eventually evolved into call-and-return event pairs • Programmers need to match related events • Need to save/restore states • Capriccio: instead of event-based model, fix the thread-based model

  4. Goals of Capriccio • Support for existing thread API • Little changes to existing applications • Scalability to thousands of threads • One thread per execution • Flexibility to address application-specific needs Threads Ideal Ease of Programming Events Threads Performance

  5. Thread Design Principles • Kernel-level threads are for true concurrency • User-level threads provide a clean programming model with useful invariants and semantics • Decouple user from kernel level threads • More portable

  6. Capriccio • Thread package • All thread operations are O(1) • Linked stacks • Address the problem of stack allocation for large numbers of threads • Combination of compile-time and run-time analysis • Resource-aware scheduler

  7. Thread Design and Scalability • POSIX API • Backward compatible

  8. User-Level Threads + Performance + Flexibility - Complex preemption - Bad interaction with kernel scheduler

  9. Flexibility • Decoupling user and kernel threads allows faster innovation • Can use new kernel thread features without changing application code • Scheduler tailored for applications • Lightweight

  10. Performance • Reduce the overhead of thread synchronization • No kernel crossing for preemptive threading • More efficient memory management at user level

  11. Disadvantages • Need to replace blocking calls with nonblocking ones to hold the CPU • Translation overhead • Problems with multiple processors • Synchronization becomes more expensive

  12. Context Switches • Built on top of Edgar Toernig’s coroutine library • Fast context switches when threads voluntarily yield

  13. I/O • Capriccio intercepts blocking I/O calls • Uses epoll for asynchronous I/O

  14. Scheduling • Very much like an event-driven application • Events are hidden from programmers

  15. Synchronization • Supports cooperative threading on single-CPU machines • Requires only Boolean checks

  16. Threading Microbenchmarks • SMP, two 2.4 GHz Xeon processors • 1 GB memory • two 10 K RPM SCSI Ultra II hard drives • Linux 2.5.70 • Compared Capriccio, LinuxThreads, and Native POSIX Threads for Linux

  17. Latencies of Thread Primitives

  18. Thread Scalability • Producer-consumer microbenchmark • LinuxThreads begin to degrade after 20 threads • NPTL degrades after 100 • Capriccio scales to 32K producers and consumers (64K threads total)

  19. Thread Scalability

  20. I/O Performance • Network performance • Token passing among pipes • Simulates the effect of slow client links • 10% overhead compared to epoll • Twice as fast as both LinuxThreads and NPTL when more than 1000 threads • Disk I/O comparable to kernel threads

  21. Linked Stack Management • LinuxThreads allocates 2MB per stack • 1 GB of VM holds only 500 threads Fixed Stacks

  22. Linked Stack Management • But most threads consumes only a few KB of stack space at a given time • Dynamic stack allocation can significantly reduce the size of VM Linked Stack

  23. Compiler Analysis and Linked Stacks • Whole-program analysis • Based on the call graph • Problematic for recursions • Static estimation may be too conservative

  24. Compiler Analysis and Linked Stacks • Grow and shrink the stack size on demand • Insert checkpoints to determine whether we need to allocate more before the next checkpoint • Result in noncontiguous stacks

  25. Placing Checkpoints • One checkpoint in every cycle in the call graph • Bound the size between checkpoints with the deepest call path

  26. Dealing with Special Cases • Function pointers • Don’t know what procedure to call at compile time • Can find a potential set of procedures

  27. Dealing with Special Cases • External functions • Allow programmers to annotate external library functions with trusted stack bounds • Allow larger stack chunks to be linked for external functions

  28. Tuning the Algorithm • Stack space can be wasted • Internal and external fragmentation • Tradeoffs • Number of stack linkings • External fragmentation

  29. Memory Benefits • Tuning can be application-specific • No preallocation of large stacks • Reduced requirement to run a large numbers of threads • Better paging behavior • Stacks—LIFO

  30. Case Study: Apache 2.0.44 • Maximum stack allocation chunk: 2KB • Apache under SPECweb99 • Overall slowdown is about 3% • Dynamic allocation 0.1% • Link to large chunks for external functions 0.5% • Stack removal 10%

  31. Resource-Aware Scheduling • Advantages of event-based scheduling • Tailored for applications • With event handlers • Events provide two important pieces of information for scheduling • Whether a process is close to completion • Whether a system is overloaded

  32. Resource-Aware Scheduling • Thread-based • View applications as sequence of stages, separated by blocking calls • Analogous to event-based scheduler

  33. Blocking Graph • Node: A location in the program that blocked • Edge: between two nodes if they were consecutive blocking points • Generated at runtime

  34. Resource-Aware Scheduling 1. Keep track of resource utilization 2. Annotate each node with resource used and its outgoing edges 3. Dynamically prioritize nodes • Prefer nodes that release resources

  35. Resources • CPU • Memory (malloc) • File descriptors (open, close)

  36. Pitfalls • Tricky to determine the maximum capacity of a resource • Thrashing depends on the workload • Disk can handle more requests that are sequential instead of random • Resources interact • VM vs. disk • Applications may manage memory themselves

  37. Yield Profiling • User threads are problematic if a thread fails to yield • They are easy to detect, since their running times are orders of magnitude larger • Yield profiling identifies places where programs fail to yield sufficiently often

  38. Web Server Performance • 4x500 MHz Pentium server • 2GB memory • Intel e1000 Gigabit Ethernet card • Linux 2.4.20 • Workload: requests for 3.2 GB of static file data

  39. Web Server Performance • Request frequencies match those of the SPECweb99 • A client connects to a server repeated and issue a series of five requests, separated by 20ms pauses • Apache’s performance improved by 15% with Capriccio

  40. Resource-Aware Admission Control • Consumer-producer applications • Producer loops, adding memory, and randomly touching pages • Consumer loops, removing memory from the pool and freeing it • Fast producer may run out of virtual address space

  41. Resource-Aware Admission Control • Touching pages too quickly will cause thrashing • Capriccio can quickly detect the overload conditions and limit the number of producers

  42. Programming Models for High Concurrency • Event • Application-specific optimization • Thread • Efficient thread runtimes

  43. User-Level Threads • Capriccio is unique • Blocking graph • Resource-aware scheduling • Target at a large number of blocking threads • POSIX compliant

  44. Application-Specific Optimization • Most approaches require programmers to tailor their application to manage resources • Nonstandard APIs, less portable

  45. Stack Management • No garbage collection

  46. Future Work • Multi-CPU machines • Profiling tools for system tuning

More Related