Design and Evaluation of Architectures for Commercial Applications

Design and Evaluation of Architectures for Commercial Applications Part III: architecture studies Luiz André Barroso

Overview (3) • Day III: architecture studies • Memory system characterization • Impact of out-of-order processors • Simultaneous multithreading • Final remarks UPC, February 1999

Memory system performance studies • Collaboration with Kourosh Gharachorloo and Edouard Bugnion • Presented at ISCA’98 UPC, February 1999

Motivations • Market shift for high-performance systems • yesterday: technical/numerical applications • today: databases, Web servers, e-mail services, etc. • Bottleneck shift in commercial application • yesterday: I/O • today: memory system • Lack of data on behavior of commercial workloads • Re-evaluate memory system design trade-offs UPC, February 1999

Bottleneck Shift • Just a few years back [Thakkar&Sweiger90] I/O was the only important bottleneck • Since then, several improvements: • better DB engines can tolerate I/O latencies • better OS’s do more efficient I/O operations and are more scalable • better parallelism in the disk subsystem (RAIDs) provide more bandwidth • … and memory keeps getting “slower” • faster processors • bigger machines • Result: memory system is a primary factor today UPC, February 1999

Workloads • OLTP (on-line transaction processing) • modeled after TPC-B, using Oracle7 DB engine • short transactions, intense process communication & context switching • multiple transactions in-transit • DSS (decision support systems) • modeled after TPC-D, using Oracle7 • long running transactions, low process communication • parallelized queries • AltaVista • Web index search application using custom threads package • medium sized transactions, low process communication • multiple transactions in-transit UPC, February 1999

Methodology: Platform • AlphaServer4100 5/300 • 4x 300 MHz processors (8KB/8KB I/D caches, 96KB L2 cache) • 2MB board-level cache • 2GB main memory • latencies: 1:7:21:80/125 cycles • 3-channel HSZ disk array controller • Digital Unix 4.0B UPC, February 1999

Methodology: Tools • Monitoring tools: • IPROBE • DCPI • ATOM • Simulation tools: • tracing: preliminary user-level studies • SimOS-Alpha: full system simulation, including OS UPC, February 1999

Scaling • Workload sizes make them difficult to study • Scaling the problem size is critical • Validation criteria: similar memory system behavior to larger run • Requires good understanding of workload • make sure system is well tuned • keep SGA many times larger than hardware caches (1GB) • use the same number of servers/processor as audit-sized runs (4-8/CPU) UPC, February 1999

CPU Cycle Breakdown • Very high CPI for OLTP • Instruction and data related stalls are equally important UPC, February 1999

Cache behavior UPC, February 1999

Stall Cycle Breakdown • OLTP dominated by non-primary cache and memory stalls • DSS and AltaVista stalls are mostly Scache hits UPC, February 1999

Impact of On-Chip Cache Size P=4; 2MB, 2-way off-chip cache • 64KB on-chip caches are enough for DSS UPC, February 1999

OLTP: Effect of Off-Chip Cache Organization P=4 • Significant benefits from large off-chip caches (up to 8MB) UPC, February 1999

OLTP: Impact of system size P=4; 2MB, 2-way off-chip cache • Communication misses become dominant for larger systems UPC, February 1999

OLTP: Contribution of Dirty Misses P=4, 8MB Bcache • Shared metadata is the important region • 80% of off-chip misses • 95% of dirty misses • Fraction of dirty misses increases with cache and system size UPC, February 1999

OLTP: Impact of Off-Chip Cache Line Size P=4; 2MB, 2-way off-chip cache • Good spatial locality on communication for OLTP • Very little false sharing in Oracle itself UPC, February 1999

Summary of Results • On-chip cache • 64KB I/D sufficient for DSS & AltaVista • Off-chip cache • OLTP benefits from larger caches (up to 8MB) • Dirty misses • Can become dominant for OLTP UPC, February 1999

Conclusion • Memory system is the current challenge in DB performance • Careful scaling enables detailed studies • Combination of monitoring and simulation is very powerful • Diverging memory system designs • OLTP benefits from large off-chip caches, fast communication • DSS & AltaVista may perform better without an off-chip cache UPC, February 1999

Impact of out-of-order processors • Collaboration with: • Kourosh Gharachorloo (Compaq) • Parthasarathy Ranghanathan and Sarita Adve (Rice) • Presented at ASPLOS’98 UPC, February 1999

Motivation • Databases fastest-growing market for shared-memory servers • Online transaction processing (OLTP) • Decision-support systems (DSS) • But current systems optimized for engineering/scientific workloads • Aggressive use of Instruction-Level Parallelism (ILP) • Multiple issue, out-of-order issue, • non-blocking loads, speculative execution • Need to re-evaluate system design for database workloads UPC, February 1999

Contributions • Detailed simulation study of Oracle with ILP processors • Is ILP design complexity warranted for database workloads? • Improve performance (1.5X OLTP, 2.6X DSS) • Reduce performance gap between consistency models • How can we improve performance for OLTP workloads? • OLTP limited by instruction and migratory data misses • Small stream buffer close to perfect instruction cache • Prefetching/flush appear promising UPC, February 1999

Simulation Environment - Workloads • Oracle 7.3.2 commercial DBMS engine • Database workloads • Online transaction processing (OLTP) - TPC-B-like • Day-to-day business operations • Decision-support System (DSS) - TPC-D/Query 6 • Offline business analysis UPC, February 1999

Simulation Environment - Methodology • Used RSIM - Rice Simulator for ILP Multiprocessors • Detailed simulation of processor, memory, and network • But simulating commercial-grade database engine hard • Some simplifications • Similar to Lo et al. and Barroso et al., ISCA’98 UPC, February 1999

Simulation Methodology - Simplifications • Trace-driven simulation • OS/system-call simulation • OS not a large component • Model only key effects • Page-mapping, TLB misses, process scheduling • System-call and I/O time dilation effects • Multiple processes per processor to hide I/O latency • Database scaling UPC, February 1999

Simulated Environment - Hardware • 4-processor shared-memory system - 8 processes per processor • Directory-based MESI protocol with invalidations • Next-generation processing nodes • Aggressive ILP processor • 128 KB 2-way separate instruction and data L1 caches • 8M 4-way unified L2 cache • Representative miss latencies UPC, February 1999

Outline • Motivation • Simulation Environment • Impact of ILP on Database Workloads • Multiple issue and OOO issue for OLTP • Multiple outstanding misses for OLTP • ILP techniques for DSS • ILP-enabled consistency optimizations • Improving Performance of OLTP • Conclusions UPC, February 1999

Multiple Issue and OOO Issue for OLTP • Multiple issue and OOO improve performance by 1.5X • But 4-way, 64-element window enough • Instruction misses and dirty misses are key bottlenecks 100.0 92.1 90.1 88.8 86.8 74.3 68.4 67.8 In-order processors Out-of-order processors UPC, February 1999

Multiple Outstanding Misses for OLTP • Support for two distinct outstanding misses enough • Data-dependent computation 100.0 83.2 79.4 79.4 UPC, February 1999

Impact of ILP Techniques for DSS • Multiple issue and OOO improve performance by 2.6X • 4-way, 64-element window, 4 outstanding misses enough • Memory is not a bottleneck 100.0 89.2 74.1 68.1 68.4 52.1 39.7 39.0 In-order processors Out-of-order processors UPC, February 1999

ILP-Enabled Consistency Optimizations • Memory consistency model of shared-memory system • Specifies ordering and overlap of memory operations • Performance /programmability tradeoff • Sequential consistency (SC) • Processor consistency (PC) • Release consistency (RC) • ILP-enabled consistency optimizations • Hardware prefetching, Speculative loads • Impact on database workloads? UPC, February 1999

100 88 74 72 68 68 Without optimizations With optimizations ILP-Enabled Consistency Optimizations SC: sequential consistency PC: processor consistency RC: release consistency • ILP-enabled optimizations • OLTP: RC only 1.1X better than SC (was 1.4X) • DSS: RC only 1.18X better than SC (was 1.85X) • Consistency model choice in hardware less important UPC, February 1999

Outline • Motivation • Simulation Environment • Impact of ILP on Database Workloads • Improving Performance of OLTP • Improving OLTP - Instruction Misses • Improving OLTP - Dirty misses • Conclusions UPC, February 1999

Improving OLTP - Instruction Misses • 4-element instruction cache stream buffer • hardware prefetching of instructions • 1.21X performance improvement • Simple and effective for database servers 100 83 71 UPC, February 1999

Improving OLTP - Dirty Misses • Dirty misses • Mostly to migratory data • Due to few instructions in critical sections • Solutions for migratory reads • Software prefetching + producer-initiated flushes • Preliminary results without access to source code • 1.14X performance improvement UPC, February 1999

Summary • Detailed simulation study of Oracle with out-of-order processors • Impact of ILP techniques on database workloads • Improve performance (1.5X OLTP, 2.6X DSS) • Reduce performance gap between consistency models • Improving performance of OLTP • OLTP limited by instruction and migratory data misses • Small stream buffer close to perfect instruction cache • Prefetching/flush appear promising UPC, February 1999

Simultaneous Multithreading (SMT) • Collaboration with: • Kourosh Gharachorloo (Compaq) • Jack Lo, Susan Eggers, Hank Levy, Sujay Parekh (U.Washington) • Exploit multithreaded nature of commercial applications • Aggressive Wide-issue OOO superscalar saturates at 4-issue slots • Potential to increase utilization of issue slots • Potential to exploit parallelism in the memory system UPC, February 1999

SMT: what is it? • SMT enables multiple threads to issue instructions to multiple functional units in a single cycle • SMT exploits instruction-level & thread-level parallelism • Hides long latencies • Increases resource utilization and instruction throughput fine-grain multithreading superscalar SMT thread 1 thread 2 thread 3 thread 4 UPC, February 1999

SMT and database workloads • Pro: • SMT a good match, can take advantage of SMT’s multithreading HW • Low throughput • High cache miss rates • Con: • Fine-grain interleaving can cause cache interference • What software techniques can help avoid interference? UPC, February 1999

SMT studies: methodology • Trace-driven simulation • Same traces used in previous ILP study • New front-end to SMT simulator • Used OLTP and DSS workloads UPC, February 1999

SMT Configuration • 21264-like superscalar base, augmented with: • up to 8 hardware contexts • 8-wide superscalar • 128KB, 2-way I and D, L1 cache, 2 cycle access • 16MB, direct-mapped L2 cache, 12 cycle access • 80 cycle memory latency • 10 functional units (6 integer (4 ld/st), 4 FP) • 100 additional integer & FP renaming registers • integer and FP instruction queues, 32 entries each UPC, February 1999

OLTP Characterization • Memory behavior (1 context, 16 server processes) • High miss rates & large footprints UPC, February 1999

Cache interference (16 server processes) • With 8-context SMT, many conflict misses • DSS data set fits in L2$ UPC, February 1999

s s e e c c n n 0 e e 0 r 0 r 2 e e 1 f f e e r r 5 5 Misses (Metadata) 1 7 e e h h Misses (Buffer cache) c c 0 0 a a 1 5 c c 2 1 L L 5 5 f f 2 o o t t n n 0 0 e e c c P S P S r r T T S S e e L L P P D D O O Where are the misses? • L1 and L2 misses dominated by PGA references • Misses result from unnecessary address conflicts 16 server processes, 8-context SMT Misses (PGA) Misses (Instructions) UPC, February 1999

L2$ conflicts: page mapping • Page coloring can be augmented with random first seed UPC, February 1999

OLTP 10.0 ) l a 9.0 b o l 8.0 g ( 7.0 7.0 e t a 6.0 6.0 r s 5.0 5.0 s i m 4.0 4.0 e 3.0 3.0 h c a 2.0 2.0 c 1.0 1.0 2 L 0.0 0.0 1 2 4 8 1 2 4 8 Number of contexts Number of contexts bin hopping page coloring page coloring seed Results for different page mapping schemes 16 MB, direct-mapped L2 cache, 16 server processes DSS 10.0 ) l a 9.0 b o l 8.0 g ( e t a r s s i m e h c a c 2 L UPC, February 1999

Why the steady L2$ miss rates? • Not all footprint has temporal locality • Critical working sets are being cached • 87% of instruction refs are to 31% of the I-footprint • 41% of metadata refs are to 26KB • SMT and superscalar cache misses comparable • SMT changes interleaving, not total footprint • With proper global policies, working sets still fit in caches: SMT is effective UPC, February 1999

L1$ conflicts: application-level offseting • Base of each thread’s PGA is at same virtual address • Causes unnecessary conflicts in virtually-indexed cache • Address offsets can avoid interference • Offset by thread id * 8KB UPC, February 1999

DSS OLTP e e 30 t 30 t a a r r s s 25 25 s s i i m m 20 20 e e h h 15 15 c c a a c c 10 10 a a t t a a 5 5 d d 1 1 L L 0 0 1 2 4 8 1 2 4 8 Number of contexts Number of contexts Offsetting results 128KB, 2-way set associative L1 cache bin hopping no offset bin hopping with offset UPC, February 1999

SMT: constructive interference • Cache interference can also be beneficial • Instruction segment is shared • SMT exploits instruction sharing • Improves I-cache locality • Reduces I-cache miss rate (OLTP) 14% with superscalar  9% with 8-context SMT UPC, February 1999

Design and Evaluation of Architectures for Commercial Applications