320 likes | 469 Views
Capriccio: Scalable Threads for Internet Services (von Behren). Kenneth Chiu. Background. Non-blocking I/O, async I/O NB Usually doesn’t work well for disks. Async I/O Issue a request, get completion. epoll()/poll() convoy: tendency for threads to “bunch up” priority inversion
E N D
Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu
Background • Non-blocking I/O, async I/O • NB • Usually doesn’t work well for disks. • Async I/O • Issue a request, get completion. • epoll()/poll() • convoy: tendency for threads to “bunch up” • priority inversion • call graph • average, weighted moving average • capriccio: improvisatory style, free form
The Problem • Web “transactions” involve a number of steps which must be performed in sequence. • For high-throughput, we want to service many of these requests concurrently. • When does concurrency help? When does it not? • If we use a single thread per request, we will have too many threads. • If we multiplex requests on a small set of threads, it’s more difficult.
Read two numbers and add while (true) { fd = get_read_ready(); state = lookup(fd); if (state.step == READING_FIRST) { c = read(fd, …, bytes_left); if (have enough) { state.step == READING_SECOND; } } else if (state.step == READING_SECOND) { … } while (true) { int n1, n2; readexact(fd, &n1, 4); readexact(fd, &n2, 4); printf(“%d\n”, n1 + n2); }
The Case for User-Level Threads • Flexibility • Level of indirection between applications and the kernel, which helps decouple the two. • Kernel-level thread scheduling must handle all applications. User-level can be tailored. • Lightweight which means can use zillions of them. • Performance • Cooperative scheduling is nearly free. • Do not require kernel crossing for uncontended locks. (Why do contended locks require kernel crossings?) • Disadvantages • Non-blocking I/O requires an additional system call. (Why?) • SMPs
Implementation • Context switches • Built on coroutine library. • I/O • Intercept blocking system calls, use epoll() and AIO for disk. • Can be less efficient • Scheduling • Main scheduling loop looks very much like an event-driven application. (What is an EDA?) • Makes it relatively easy to switch schedulers. • Synchronization • Cooperative threading on UP. • Efficiency • All O(1), except sleep queue.
Benchmarks • 2 X 2.4 GHz Xeon, 1 GB memory, 2 X 10K RPM SCSI, GigE. • 2 X 1.2 GHz US III • Linux 2.5.70, epoll(), AIO. • Solaris 8 • Capriccio, LinuxThreads, NPTL
Thread Scalability • Producer-consumer
Thread Scalability • Drop between 100 and 1000 to cache footprint.
I/O Performance • pipetest • Pass a number of tokens among a set of pipes. • Disk scheduling • A number of threads perform random 4 KB reads from a 1 GB file. • Disk I/O through buffer cache • 200 threads reading with a fixed miss rate.
I/O out of buffer. • Performance is lower due to AIO.
Thread Stacks • If a lot of threads, the cumulative stack space can be quite large. • Solution: Use a dynamic allocation policy and allocate on demand. Link stack chunks together. • Problem: How do you link stack chunks together? How do you know when to link a new one?
Weighed Call Graph • Use static analysis to create a weighted call graph. • Each node is weighed by the maximum stack space that that function might consume. (Why is it maximum, and not exact?) • Now what?
Bounds • Most real-world programs use recursion. • Even without, static bound wastes too much. • Instead insert checkpoints at key places to link in new stack chunks. • Chunks switched right before arguments are pushed.
Placing Checkpoints • Make sure one checkpoint in every cycle by inserting in back edges. (How?) (Is this efficient?) • Then make sure each path (sum) is not too long.
Function B is executing. • Function D, both ways. • Recursion.
Special Cases • Function pointers • Difficult, but they try to analyze. • External functions • Allow annotations. • Alternatively, link in a large chunk. • Variable length arrays • C99
Question • What kind of a problem is this? • Is it being solved at the right level?
Admission Control • We’ve seen many graphs where performance degrades as some variable increases. • Scheduling in Capriccio is to keep performance in the “good” part of the curve.
Blocking Graph • Each node is a location where the program blocked. • Location is call chain. • Generated at run time. • Annotate with resource usage: • Average running time (with exponentially-weighted “moving” average), memory, stack, sockets, etc. • Maintain a run queue for each node. Admit threads till resources reach maximum capacity.
Pitfalls • Too many non-linear effects to predict. • One solution is to use some kind of instrumentation, plus feedback control. • But even detecting that is hard.
Summary • Control flow maintains state. Control flow can be swapped for explicit maintenance. • Threads perform two functions: • Maintain state (logical threads of programming model) • Allow concurrency (kernel) • Should separate the two, since the overhead of concurrency is not necessary when just want to maintain state. • Cooperative multitasking has been denigrated before, but can be good.