Flash: An Efficient and Portable Web Server

18-845 Internet Services Flash: An Efficient and Portable Web Server Authors • Vivek S. Pai • Peter Druschel • Willy Zwaenepoel Presenter • Anuwat Jongpairat

Outline • Background • Server Architectures • Performance Characteristics • Cost/Benefits of Optimizations and Features • Flash Implementation • Performance Evaluation • Conclusion

Disk Blocking Network Blocking Background • Basic HTTP request processing steps: Accept Conn Read Request Find File Send Header Read File Send Data Start End

Server Architectures • Iterative • Multi-process (MP) • Multi-thread (MT) • Single-process event-driven (SPED) • Asymmetric multi-process event-driven (AMPED)

Iterative Architecture • Serve one request at a time • Very simple to implement • Inefficient • Does not interleave request processing steps

Multi-process (MP) Architecture • One process per client • Disk accesses, CPU processing, network communications overlap naturally. • Cache optimization on global information is difficult. • Many requests  Context-switching

Multi-thread (MT) Architecture • One thread per client • Threads sharing some address space. • Cache optimization on shared information is easy. • Synchronization of global variables • Less context-switching • Requires kernel thread support

Send Header Read File Send Data Accept Conn Read Request Find File Event Dispatcher Single-process Event-Driven (SPED) Architecture • Single event-driven process performs all client processing, and disk and network activities. • Single address space  No synchronization needed & using low resource • Network I/O is non-blocking but disk reads still cause main process to be blocked.

Send Header Read File Send Data Accept Conn Read Request Find File Event Dispatcher Helper 1 Helper k Asymmetric Multi-process Event-driven (AMPED)

Asymmetric Multi-process Event-driven (AMPED) [2] • Multiple helper processes (or threads) • Main process performs request processing while... • Helpers perform potentially blocking (synchronous) disk I/O operations which are supported on all Unix systems.

Performance Characteristics • Disk Operations • Memory Effects • Disk Utilization

Disk Operations • For MP and MT, only the process or thread doing disk I/O goes to sleep while the others are still running. • SPED will be blocked on disk I/O operations • AMPED’s main process will still be running while helpers perform disk I/O operations.

Memory Effects • Memory used by processes affects available cache size • MT and MP memory consumption grows as number of clients increases • SPED takes small amount of memory • AMPED’s helpers cause some overhead but use small memory.

Disk Utilization • Concurrent disk requests can benefit from multiple disks and disk head scheduling. • MP, MT, AMPED can cause concurrent disk requests, taking benefit of multiple disks and disk head scheduling. • But SPED causes at most one disk request at a time.

Cost and Benefits of Optimizations and Features • Information Gathering • Application-level Caching • Long-lived Connections

Information Gathering • Gathering information about requests for accounting purpose or to improve performance. • MP model must use interprocess communication to combine data. • Need synchronization in MT model. • SPED and AMPED need neither IPC nor synchronization.

Application-level Caching • Can use application-level caching to cache response headers of files or memory-mapped files • Per-process cache in MP architecture wastes memory. • For MT architecture, a single cache needs synchronization. • SPED and AMPED can use single cache.

Long-lived Connections • Due to slow links, persistent connections (HTTP 1.1), or WAN. • May cause a large number of simultaneous connections on server and resources are occupied for an extended period. • SPED and AMPED cost file descriptors and state information in kernel. • MP and MT cost additional overhead and memory per client.

Implementation: Flash • Based on AMPED architecture • Optimizations • Aggressive Caching • Pathname translation caching • Response header caching • Memory-mapped files • “Gather writes” (writev) • Dynamic Content Generation • Memory Residency Test

Pathname Translation Caching • Maintains mappings between requested uri and actual path, e.g., /~beavis :: /home/users/beavis/public_html/index.html • Reduce number of calls to helper processes to do pathname translations • Less helper processes needed  Save memory

Response Header Caching • Response header containing file information is sent along with file content • Cache the response headers of frequently requested files

Memory-mapped File Caching • Reduces the number of unnecessary map/unmap operations on frequently requested files • Use LRU algorithm to unmap inactive “chunks” of files • mincore() is used to test memory residency

Helper Helper Caching in Flash Accept Conn Read Request Find File Send Header Read File Send Data Start End Pathname Translation Cache Response Header Cache Mapped File Caching

Cache Optimization Contribution • Pathname tran > mmap > response header

Gather Writes • writev() system call writes data to file or socket from discontiguous buffers in one operation. • Use this to send response header and file content at once

Dynamic Content Generation • Server forks CGI processes as necessary • Server allows CGI processes to be persistent reducing the cost of forking the same application for multiple requests

Memory Residency Test • On most modern UNIX systems, can use mincore() to check if a mapped file pages are in memory or not • If mincore() not available, use mlock()to lock memory pages.

Performance Evaluation • Compare Flash (AMPED) to • Apache 1.3.1 (MP) • Zeus 1.30 (SPED) • Flash-MP • Flash-MT • Flash-SPED

Synthetic Workload • Clients request the same file repeatedly. • Servers should perform at their best since file content is always in cache. • For cached workload, the choice of architecture has little impact on performance

Trace-based Experiment • More realistic workload • Traces from CS web server and from Owlnet server providing personal web pages, Rice University. • Flash achieves highest throughput

Trace-based Experiments [2]

ECE Trace Experiment • Evaluate server’s performance against dataset size ranging from 15 to 150 MB • Use truncated traces from ECE Dept. web server • Clients replay truncated traces as a loop to generate requests at a given dataset size

ECE Trace Experiment [2] • Throughput decreases as dataset size increases • Significant drop when working set size exceeds effective memory cache size due to disk I/O operations • Flash has good performance on both cached workload and disk-bound workload

ECE Trace Experiment [3] • The results confirm that SPED architecture performs well on cached workload but performs poorly on disk-bound workload. • On disk-bound work load, Flash has highest throughput since it causes less context-switching and is more memory-efficient than MP.

Performance under WAN conditions • Use persistent connections to simulate long-lived connetions in WAN • MP performs poorly due to per-process overhead • Flash, SPED, and MT grows initially. • MT declines due to • Per-thread switching • Memory overhead

Conclusion • Concurrent Server Architectures • SPED good on cached workload • MT and MP good on disk-bound workload • AMPED matches performance of SPED, MT, and MP on both types of workload. • Flash: AMPED implementation with aggressive caching and optimization. • Flash exceeds Zeus by 30%, Apache by 50%.

Flash: An Efficient and Portable Web Server