310 likes | 493 Views
Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware. Yaoping Ruan Princeton University. Vivek Pai, Princeton University Erich Nahum , IBM T.J. Watson John Tracey , IBM T.J. Watson. Motivation. Network servers Throughput matters Hardware intensive
E N D
Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware Yaoping Ruan Princeton University Vivek Pai, Princeton University Erich Nahum, IBM T.J. Watson John Tracey, IBM T.J. Watson
Motivation • Network servers • Throughput matters • Hardware intensive • Simultaneous Multithreading (SMT) • Processor support for high throughput • Simulated since mid-90s • Now - Intel Xeon/Pentium 4 (Hyper-Threading), IBM POWER5 available http://www.cs.princeton.edu/~yruan
How Does SMT Work? • Simultaneous execution of multiple jobs • Higher utilization of functional units cycles (direction of data flow) Job 1 Processor 1 Job 2 Processor 2 (Colored blocks are functional units currently in use) Job 1&2 SMT processor http://www.cs.princeton.edu/~yruan
Pipeline Execution Units Shared Resource Cache Hierarchy System Bus Main Memory SMT Architecture Appear as multi-processors for OS and app. DuplicatedResource Architectural State Registers #1 Architectural State Registers #2 http://www.cs.princeton.edu/~yruan
Contributions • Detailed analysis of multiple real hardware platforms and server packages • Includes previously ignored OS overheads • Micro-architectural performance analysis • Demonstrates dominance of memory hierarchy • Comparison with simulation studies • Explain why SMT provides relatively small benefits on real hardware • Overly-aggressive memory simulation yielded higher expected benefits http://www.cs.princeton.edu/~yruan
Outline • Background • Measurement methodology • Throughput & improvement • Micro-architectural performance • Discussion http://www.cs.princeton.edu/~yruan
Measurements Overview • Metrics • Server throughput • Throughput improvements (relative speedups) • Architectural features (CPI, miss ratio, etc.) • Multiple configurations • Hardware platforms (clock speed, cache, etc.) • Server software (Apache, Flash, TUX, etc.) • Kernel configuration (uniprocessor and multiprocessor) http://www.cs.princeton.edu/~yruan
Hardware Platforms • Three models of Xeon processors Clock rate Cache http://www.cs.princeton.edu/~yruan
Web Servers • 5 Web server packages • Apache-MP: multi-process • Apache-MT: multi-thread • Flash: event-driven • TUX: in-kernel • Haboob: Java server, staged multi-thread model • Benchmark • SPECweb96 and SPECweb99 http://www.cs.princeton.edu/~yruan
SMT on on 2 # CPUs 1 kernel Multiprocessor kernel System Configuration • 5 configuration labels • # CPUs, SMT on/off, kernel type (T – # threads, P – # processors) http://www.cs.princeton.edu/~yruan
Outline • Background • Measurement methodology • Throughput & improvement • Single processor • Dual-processor • Micro-architectural performance • Discussion http://www.cs.princeton.edu/~yruan
1200 Apache-MP, 3.06GHz 1000 4T vs. 2P 800 600 Throughput (Mb/s) 2T vs. 1P-UP 2T vs. 1P-MP 400 200 0 1P-UP 1P-MP 2T w/ SMT 2P 4T w/ SMT Throughput Evaluation singleprocessor dual-processor http://www.cs.princeton.edu/~yruan
40 2T vs. 1P-MP 30 20 Throughput improvement (%) 10 0 Apache-MP Apache-MT Flash TUX Haboob -10 2.0GHz 3.06GHz 3.06GHz L3 Improvement on Single Processor 2T : 2 threads, multiprocessor kernel 1P-MP: 1 thread, multiprocessor kernel http://www.cs.princeton.edu/~yruan
40 2T vs. 1P-UP Kernel overhead 30 20 Throughput improvement (%) 10 0 Apache-MP Apache-MT Flash TUX Haboob -10 2.0GHz 3.06GHz 3.06GHz L3 Improvement on Single Processor 2T : 2 threads, Multiprocessor kernel 1P-UP: 1 threads, Uniprocessor kernel http://www.cs.princeton.edu/~yruan
40 4T vs. 2P 30 20 Throughput improvement (%) 10 0 -10 Apache-MP Apache-MT Flash TUX Haboob -20 2.0GHz 3.06GHz 3.06GHz L3 Improvement on Dual-processor 4T: 4 threads (2 processors, 2T/Processor) 2P: 2 physical processors (SMT disabled) • 2.0GHz & 3.06GHz with L3 are better • Memory is still the bottleneck http://www.cs.princeton.edu/~yruan
Micro-architectural Analysis • Use Oprofile • In-house patch to measure extra events • About 25 performance events • Cache miss/hit • TLB miss/hit • Branches • Pipeline stall, clear, etc. • Bus utilization http://www.cs.princeton.edu/~yruan
20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% Apache-MP Apache-MT Flash TUX Haboob 1P-UP 1P-MP 2P 4T(SMT) 2T(SMT) L1 Instruction Cache Miss Rate http://www.cs.princeton.edu/~yruan
10% 8% 6% 4% 2% 0% Apache-MP Apache-MT Flash TUX Haboob 1P-UP 1P-MP 2P 4T(SMT) 2T(SMT) L2 Cache Miss Rate • Instruction & data unified • Lower rate in SMT due to higher L1 misses http://www.cs.princeton.edu/~yruan
work L1 Miss L2 Miss ITLB DTLB Branch Clear Buffer 16 Apache-MP 14 12 others 10 L2 Miss 8 6 L1 Miss 4 2 work 0 1P-UP 1P-MP 2T 2P 4T Putting Events Together Cycles per Instruction (CPI) http://www.cs.princeton.edu/~yruan
Non-overlapped CPI • L1/L2 miss penalty dominates http://www.cs.princeton.edu/~yruan
Measuring Bus Utilization • Event: FSB_DATA_ACTIVITY • CPU cycles when the bus is busy • Normalized to CPU speed • Comparable across all CPU clock rate http://www.cs.princeton.edu/~yruan
Apache-MP 20 15 Bus Utilization (%) 10 5 0 2P 2T 4T 1P-UP 1P-MP 2.0GHz 3.06GHz 3.06GHz L3 Bus Utilization Results • 2.0GHz & 3.06GHz L3 have less data transfer cycles • Lower memory latency in 2.0GHz & 3.06GHz with L3 • Coefficient of correlation between bus utilization & speedups : 0.62 ~ 0.95 http://www.cs.princeton.edu/~yruan
Outline • Background • Measurement parameters • Throughput speedup • Micro-architectural performance • Discussion • Compare to simulation • Other Web workloads http://www.cs.princeton.edu/~yruan
100% 90% 80% Simulation 70% 60% Throughput improvement 50% 40% Uniprocessor kernel Dual processor 30% 20% Multiprocessor kernel 10% 0% -10% SMT Performance on Web Servers http://www.cs.princeton.edu/~yruan
Compare to Simulation http://www.cs.princeton.edu/~yruan
Processor Development Trend Simulated models: 62-cycle mem 32 KB L1 256 KB L2 90-cycle mem 128 KB L1 16384 KB L2 90-cycle mem 64 KB L1 16384 KB L2 2000 2003 1996 Actual processors: 74-cycle mem 16 KB L1 256 KB L2 94-cycle mem 16 KB L1 512 KB L2 350-cycle mem 8-12 KB L1 512 KB L2 http://www.cs.princeton.edu/~yruan
SMT on SPECweb99 • SPECweb99 results in paper • Dynamic + static • Multiple programs • CGI requests, user profile logging, etc. • Speedup very close to static-only workloads • No more negative speedups in Flash • May be due to better sharing of resources of different programs http://www.cs.princeton.edu/~yruan
Summary • More realistic speedup evaluation of SMT • 3 processors, 5 servers, 2 kernels • Exposed factors not previously examined • 5~15% speedup in our best cases • Detailed analysis of memory hierarchy impact on SMT performance • All other architecture overheads secondary • Reasons why simulation results were overly optimistic http://www.cs.princeton.edu/~yruan
Thank you http://www.cs.princeton.edu/~yruan
Future Work • Ways of improving Simultaneous Multithreading performance • Server performance on POWER5 • Using execution driven simulation for deeper understanding • Study Chip Multiprocessor (CMP) • Intel, AMD, and IBM http://www.cs.princeton.edu/~yruan
0.30 0.25 0.20 0.15 0.10 0.05 0.00 Apache-MP Apache-MT Flash TUX Haboob Pipeline Clears (per Byte) • Conditions when the whole pipeline needs to be flushed 1T-UP 1T-MP 2T 2P 4T http://www.cs.princeton.edu/~yruan