Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware Yaoping Ruan Princeton University Vivek Pai, Princeton University Erich Nahum, IBM T.J. Watson John Tracey, IBM T.J. Watson

Motivation • Network servers • Throughput matters • Hardware intensive • Simultaneous Multithreading (SMT) • Processor support for high throughput • Simulated since mid-90s • Now - Intel Xeon/Pentium 4 (Hyper-Threading), IBM POWER5 available http://www.cs.princeton.edu/~yruan

How Does SMT Work? • Simultaneous execution of multiple jobs • Higher utilization of functional units cycles (direction of data flow) Job 1 Processor 1 Job 2 Processor 2 (Colored blocks are functional units currently in use) Job 1&2 SMT processor http://www.cs.princeton.edu/~yruan

Pipeline Execution Units Shared Resource Cache Hierarchy System Bus Main Memory SMT Architecture Appear as multi-processors for OS and app. DuplicatedResource Architectural State Registers #1 Architectural State Registers #2 http://www.cs.princeton.edu/~yruan

Contributions • Detailed analysis of multiple real hardware platforms and server packages • Includes previously ignored OS overheads • Micro-architectural performance analysis • Demonstrates dominance of memory hierarchy • Comparison with simulation studies • Explain why SMT provides relatively small benefits on real hardware • Overly-aggressive memory simulation yielded higher expected benefits http://www.cs.princeton.edu/~yruan

Outline • Background • Measurement methodology • Throughput & improvement • Micro-architectural performance • Discussion http://www.cs.princeton.edu/~yruan

Measurements Overview • Metrics • Server throughput • Throughput improvements (relative speedups) • Architectural features (CPI, miss ratio, etc.) • Multiple configurations • Hardware platforms (clock speed, cache, etc.) • Server software (Apache, Flash, TUX, etc.) • Kernel configuration (uniprocessor and multiprocessor) http://www.cs.princeton.edu/~yruan

Hardware Platforms • Three models of Xeon processors Clock rate Cache http://www.cs.princeton.edu/~yruan

Web Servers • 5 Web server packages • Apache-MP: multi-process • Apache-MT: multi-thread • Flash: event-driven • TUX: in-kernel • Haboob: Java server, staged multi-thread model • Benchmark • SPECweb96 and SPECweb99 http://www.cs.princeton.edu/~yruan

SMT on on 2 # CPUs 1 kernel Multiprocessor kernel System Configuration • 5 configuration labels • # CPUs, SMT on/off, kernel type (T – # threads, P – # processors) http://www.cs.princeton.edu/~yruan

Outline • Background • Measurement methodology • Throughput & improvement • Single processor • Dual-processor • Micro-architectural performance • Discussion http://www.cs.princeton.edu/~yruan

1200 Apache-MP, 3.06GHz 1000 4T vs. 2P 800 600 Throughput (Mb/s) 2T vs. 1P-UP 2T vs. 1P-MP 400 200 0 1P-UP 1P-MP 2T w/ SMT 2P 4T w/ SMT Throughput Evaluation singleprocessor dual-processor http://www.cs.princeton.edu/~yruan

40 2T vs. 1P-MP 30 20 Throughput improvement (%) 10 0 Apache-MP Apache-MT Flash TUX Haboob -10 2.0GHz 3.06GHz 3.06GHz L3 Improvement on Single Processor 2T : 2 threads, multiprocessor kernel 1P-MP: 1 thread, multiprocessor kernel http://www.cs.princeton.edu/~yruan

40 2T vs. 1P-UP Kernel overhead 30 20 Throughput improvement (%) 10 0 Apache-MP Apache-MT Flash TUX Haboob -10 2.0GHz 3.06GHz 3.06GHz L3 Improvement on Single Processor 2T : 2 threads, Multiprocessor kernel 1P-UP: 1 threads, Uniprocessor kernel http://www.cs.princeton.edu/~yruan

40 4T vs. 2P 30 20 Throughput improvement (%) 10 0 -10 Apache-MP Apache-MT Flash TUX Haboob -20 2.0GHz 3.06GHz 3.06GHz L3 Improvement on Dual-processor 4T: 4 threads (2 processors, 2T/Processor) 2P: 2 physical processors (SMT disabled) • 2.0GHz & 3.06GHz with L3 are better • Memory is still the bottleneck http://www.cs.princeton.edu/~yruan

Micro-architectural Analysis • Use Oprofile • In-house patch to measure extra events • About 25 performance events • Cache miss/hit • TLB miss/hit • Branches • Pipeline stall, clear, etc. • Bus utilization http://www.cs.princeton.edu/~yruan

20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% Apache-MP Apache-MT Flash TUX Haboob 1P-UP 1P-MP 2P 4T(SMT) 2T(SMT) L1 Instruction Cache Miss Rate http://www.cs.princeton.edu/~yruan

10% 8% 6% 4% 2% 0% Apache-MP Apache-MT Flash TUX Haboob 1P-UP 1P-MP 2P 4T(SMT) 2T(SMT) L2 Cache Miss Rate • Instruction & data unified • Lower rate in SMT due to higher L1 misses http://www.cs.princeton.edu/~yruan

work L1 Miss L2 Miss ITLB DTLB Branch Clear Buffer 16 Apache-MP 14 12 others 10 L2 Miss 8 6 L1 Miss 4 2 work 0 1P-UP 1P-MP 2T 2P 4T Putting Events Together Cycles per Instruction (CPI) http://www.cs.princeton.edu/~yruan

Non-overlapped CPI • L1/L2 miss penalty dominates http://www.cs.princeton.edu/~yruan

Measuring Bus Utilization • Event: FSB_DATA_ACTIVITY • CPU cycles when the bus is busy • Normalized to CPU speed • Comparable across all CPU clock rate http://www.cs.princeton.edu/~yruan

Apache-MP 20 15 Bus Utilization (%) 10 5 0 2P 2T 4T 1P-UP 1P-MP 2.0GHz 3.06GHz 3.06GHz L3 Bus Utilization Results • 2.0GHz & 3.06GHz L3 have less data transfer cycles • Lower memory latency in 2.0GHz & 3.06GHz with L3 • Coefficient of correlation between bus utilization & speedups : 0.62 ~ 0.95 http://www.cs.princeton.edu/~yruan

Outline • Background • Measurement parameters • Throughput speedup • Micro-architectural performance • Discussion • Compare to simulation • Other Web workloads http://www.cs.princeton.edu/~yruan

100% 90% 80% Simulation 70% 60% Throughput improvement 50% 40% Uniprocessor kernel Dual processor 30% 20% Multiprocessor kernel 10% 0% -10% SMT Performance on Web Servers http://www.cs.princeton.edu/~yruan

Compare to Simulation http://www.cs.princeton.edu/~yruan

Processor Development Trend Simulated models: 62-cycle mem 32 KB L1 256 KB L2 90-cycle mem 128 KB L1 16384 KB L2 90-cycle mem 64 KB L1 16384 KB L2 2000 2003 1996 Actual processors: 74-cycle mem 16 KB L1 256 KB L2 94-cycle mem 16 KB L1 512 KB L2 350-cycle mem 8-12 KB L1 512 KB L2 http://www.cs.princeton.edu/~yruan

SMT on SPECweb99 • SPECweb99 results in paper • Dynamic + static • Multiple programs • CGI requests, user profile logging, etc. • Speedup very close to static-only workloads • No more negative speedups in Flash • May be due to better sharing of resources of different programs http://www.cs.princeton.edu/~yruan

Summary • More realistic speedup evaluation of SMT • 3 processors, 5 servers, 2 kernels • Exposed factors not previously examined • 5~15% speedup in our best cases • Detailed analysis of memory hierarchy impact on SMT performance • All other architecture overheads secondary • Reasons why simulation results were overly optimistic http://www.cs.princeton.edu/~yruan

Thank you http://www.cs.princeton.edu/~yruan

Future Work • Ways of improving Simultaneous Multithreading performance • Server performance on POWER5 • Using execution driven simulation for deeper understanding • Study Chip Multiprocessor (CMP) • Intel, AMD, and IBM http://www.cs.princeton.edu/~yruan

0.30 0.25 0.20 0.15 0.10 0.05 0.00 Apache-MP Apache-MT Flash TUX Haboob Pipeline Clears (per Byte) • Conditions when the whole pipeline needs to be flushed 1T-UP 1T-MP 2T 2P 4T http://www.cs.princeton.edu/~yruan

Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Presentation Transcript

Simultaneous Multithreading (SMT)

Transient Fault Detection via Simultaneous Multithreading

Hardware Multithreading

Using Data Strategically: Evaluating the Fiscal Impact of ...

SIMULTANEOUS MULTITHREADING

Symbiotic Jobscheduling for a Simultaneous Multithreading Processor

The Impact of Network on Video Quality

Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)

Simultaneous Multithreading: Multiplying Alpha Performance

Hardware Multithreading

Evaluating and Comparing the Impact of Software Faults on Web Servers

Simultaneous Multithreading (SMT)

Computer Architecture Lec 10 –Simultaneous Multithreading

Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Hardware Fault Tolerance Through Simultaneous Multithreading (part 3)

Transient Fault Detection via Simultaneous Multithreading

Hardware Multithreading

Evaluating the Impact of IT on the Organization

Limits to ILP and Simultaneous Multithreading

Improving Database Performance on Simultaneous Multithreading Processors

Evaluating The Impact Of COVID-19 On FinTech