1 / 62

MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS

MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS. Garba Ya’u Isa Master’s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum & Minerals Saturday, 7 th June 2003. Outline. Introduction Problem Statement Analysis of Memory Accesses

Download Presentation

MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS Garba Ya’u Isa Master’s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum & Minerals Saturday, 7th June 2003

  2. Outline • Introduction • Problem Statement • Analysis of Memory Accesses • Measurement Based Performance Evaluation • Design and Implementation of Prototype • Contributions • Conclusions • Future Work

  3. Introduction • Processor and memory performance discrepancy • Growing network bandwidth • Data rates in Terabits per second possible • Gigabit per second LANs already deployed • High throughput servers in network infrastructure • Streaming media servers • Web servers • Software Routers

  4. Dealing with Performance Gap • Hierarchical memory architecture • temporal locality • spatial locality • Constrains • Characteristics of network payload data: • Large  won’t fit into cache • Hardly reusable  poor temporal locality

  5. Network servers should: Deliver high throughput Respond to requests with low latency Respond to large number of clients Our goal Identify specific conditions at which server memory becomes a bottleneck Includes: cache, main memory, and virtual memory Benefits Better server design that alleviates memory bottlenecks Optimal performance can be achieved Constraints Large amount of data flowing through CPU and memory Writing code to optimize memory utilization is a challenge Problem Statement

  6. Analysis of Memory Accesses: Data Flow Analysis Four data transfer paths: • Memory-CPU • Memory-memory • Memory-I/O • Memory-network

  7. Latency Model and Memory Overhead • Each transaction involves: • CPU cycles • Data transfers: one or more of four identified types • Transaction latency: • Ttrans = Tcpu + n1Tm-c + n2Tm-m + n3Tm-disk + n4Tm-net • Tcpu Total CPU time needed for the transaction • Tm-c  Time to transfer entire PDU from memory to CPU for proc. • Tm-m  Latency of memory-memory copy of a PDU • Tm-disk  Latency of memory-I/O read/write of a block of data • Tm-net  Latency of memory-network read/write of a PDU • ni  Number of each type of data movement operations

  8. Memory-CPU Transfers • PDU Processing • checksum computation and header updating • Typically, one-way data flow (memory to CPU via cache) • Memory stall cycles • Number of memory stall cycles = (IC)(AR)(MR)(MP) • Cache miss rate • Worst case: MR = 1 (not as bad!) • Best case: MR = 0 (trivial)

  9. Memory-CPU Transfers cont. • Cache overhead in various cases: • Worst case: MR = 1, MP = 10 and (MR)(MP) 10 • Best case: MR = 0  trivial • Average case: MR = 0.1, MP = 10 and (MR)(MP)1 • Memory-CPU latency dependent on internal bus bandwidth • Tm-c = S/32Bi usec where S is the PDU size and Bi is the internal bus bandwidth in MB/s

  10. Memory-memory transfer: Due to memory copy of PDU between protocol layers Transfers through caches and CPU Stride =1 (contiguous) Transfer involves memorycacheCPUcachememory data movement Latency: Dependent on internal (system) bus bandwidth Tm-m = 2S/Bi usec Memory-Memory Transfers

  11. Memory-I/O and Memory-Network Transfers • Memory-network transfers: • Passes over the I/O bus • DMA can be used • Again, stride = 1 (contiguous) • Latency: • Limiting factor is the I/O bus bandwidth • Tm-net = S/Be usec

  12. Latency of Reference Applications • RTP Transaction Latency 1 • HTTP Transaction Latency 2 • IP Transaction Latency 3

  13. Peak Throughputs • Assumptions • CPU usage latency compared to data transfer latency is negligible and can be ignored • Bus contention from multiple simultaneously executed transactions do not result in any additional overhead • Server Throughput = S/T • S = size of transaction data • T = latency of a transaction given by equations 1, 2 and 3

  14. Peak Throughputs cont.

  15. Measurement Based PerformanceEvaluation • Experimental Testbed • Dual boot server (Pentium IV 2.0 GHz) • 256 MB RAM • 1.0 GHz NIC • Closed LAN (Cisco catalyst 1.0 GHz 3550 switch) • Tools • Intel Vtune • Windows Performance Monitor • Netstat • Linux tools: vmstat, sar, iostat

  16. Platforms and Applications • Platforms • Linux (kernel 2.4.7-10) • Windows 2000 • Applications • Streaming media servers • Darwin streaming server • Windows media server • Web servers • Apache web server • Microsoft Internet Information server • Software router • Linux kernel IP forwarding

  17. Analysis of Operating System Role • Memory Throughput Test • ECT (extended copy transfer) – memperf • Locality of reference: • temporal locality – varying working set size (block size) • spatial locality – varying access pattern (strides)

  18. Analysis of Operating System Role cont. • Context switching overhead

  19. Streaming Media Servers Experimental Design • Factors • Number of streams (streaming clients) • Media encoding rate (56kbps and 300kbps) • Stream distribution (unique and multiple media) • Metrics • Cache miss (L1 and L2 cache) • Page fault rate • Throughput • Benchmarking Tools • DSS - streaming load tool • WMS – media load simulator

  20. Cache Performance • L1 cache misses (56kbps)

  21. Cache Performance cont. • L1 cache misses (300 kbps)

  22. Memory Performance • Page fault (300kbps)

  23. Throughput • Throughput (300kbps)

  24. Summary: Streaming Media Server Memory Performance • Highest degradation in cache performance (both L1 and L2) when the number of clients is large and the encoding rate is 300kbps with multiple multimedia objects. • When clients demand unique media objects, page fault rate is constant. However, if the request is for multiple objects, the page fault rate increases with the number of clients. • Throughput increases with number of clients. Higher encoding rate - 300kbps, also accounts for more throughputs. Darwin streaming server has less throughput compared to Windows media server.

  25. Web Servers Experimental Design • Factors • Number of web clients • Document size • Metrics • Cache miss (L1 and L2 cache) • Page fault rate • Throughput • Transactions/sec (connection rate) • Average latency • Benchmarking Tool • Webstone

  26. Transactions

  27. L1 Cache Miss

  28. Page Fault

  29. Throughput

  30. Summary: Web Server Memory Performance Evaluation Comparing Apache and IIS for an average file size of 10K

  31. Software Router • Experimental Design • Factors • Routing configurations • TCP message size (64bytes, 10 Kbytes, and 64 Kbytes) • Metrics • Throughput • Number of context switching • Number of active pages • Benchmarking Tool • Netperf

  32. Software Router Throughput

  33. CPU Utilization

  34. Context Switching

  35. Active Page

  36. Summary: Software Router Performance Evaluation • Maximum throughput of 449 Mbps for configuration number 2 - full duplex one-to-one communication. • Highest CPU utilization was 84% • Highest context switching rate was 5378/sec • Number of active pages fairly uniformly distributed. Indicates low memory activity.

  37. Design, Implementation and Evaluation of Prototype DB-RTP Server Architecture • Implementation • Linux platform (C) • Our implementation of RTSP/RTP (why?)

  38. Double Buffering and Synchronization Buffer read Buffer write

  39. RTP Server Throughput

  40. Jitter

  41. Summary: DB-RTP Server Performance Evaluation • Throughput • DB-RTP server – 63.85 Mbps • RTP server – 59 Mbps. • Both servers exhibit steady jitter, but DB-RTP has relatively lower jitter compared to RTP server.

  42. Contributions • Cache overhead analysis. • Memory latency and bandwidth analysis • Measurement-based performance evaluation • Design, implementation, and evaluation of a prototype streaming server - Double Buffer RTP (DB-RTP) server.

  43. Conclusions • High throughput is possible with server design enhancement. • Server throughput is significantly degraded by excessive cache misses and page faults. • Latency hiding with pre-fetching and buffering can improve throughput and jitter performance

  44. Future Work • Server Development • hybrid = multiplexing + multithreading • Special Architectures (Network processors & ASICs) • resource scheduling • investigation of the role I/O • use of IRAM (intelligent RAM) architectures • integrated network infrastructure server

  45. Thank you

  46. Array restructuring Array Padding go back Loop nest transformation

  47. Testbeds Software router testbed Streaming media/web server testbed go back

  48. Communication Configurations go back

  49. Backup slides

  50. Memory Performance Page fault 300 kbps 56 kbps

More Related