250 likes | 366 Views
On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers. Per Stenström Department of Computer Engineering, Chalmers, Göteborg, Sweden http://www.ce.chalmers.se/~pers. Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration with
E N D
On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers Per Stenström Department of Computer Engineering, Chalmers, Göteborg, Sweden http://www.ce.chalmers.se/~pers Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration with Sun Microsystems and Ericsson Research
Motivation • Database applications dominate (32%) • Yet, major focus is on scientific/eng. apps (16%)
Project Objective • Design principles for high-performance memory systems for emerging applications • Systems considered: • high-performance compute nodes • SMP and DSM systems built out of them • Applications considered: • Decision support and on-line transaction processing • Emerging applications • Computer graphics • video/sound coding/decoding • handwriting recognition • ...
Outline • Experimental platform • Memory system issues studied • Working set size in DSS workloads • Prefetch approaches for pointer-intensive workloads (such as in OLTP) • Coherence issues in OLTP workloads • Concluding remarks
Application Operating system (Linux) Devices CPU Sparc V8 Memory Interrupt Ethernet TTY SCSI $ M Single and multiprocessor system models Experimental Platform • Platform enables • Analysis of comm. workloads • Analysis of OS effects • Tracking architectural events to OS or appl. level
Outline • Experimental platform • Memory system issues studied • Working set size in DSS workloads • Prefetch approaches for pointer-intensive workloads (such as in OLTP) • Coherence issues in OLTP workloads • Concluding remarks
Level i i-1 2 1 Join Scan Join Scan Join Scan Scan Decision-Support Systems (DSS) Compile a list of matching entries in several database relations Will moderately sized caches suffice for huge databases?
Cache Miss Rate MWS DWS 1 DWS 2 DWS i Cache size Our Findings • MWS: footprint of instructions and private data to access a single tuple • typically small (< 1 Mbyte) and not affected by database size • DWS: footprint of database data (tuples) accessed across consecutive invocations of same scan node • typically small impact (~0.1%) on overall miss rate
Methodological Approach Challenges: • Not feasible to simulate huge databases • Need source code: we used PostgreSQL and MySQL Approach: • Analytical model using • parameters that describe the query • parameters measured on downscaled query executions • system parameters
Level i i-1 2 1 Footprints per tuple access Join Scan MWS and DWSi Join Scan MWS and DWSi-1 Join Scan MWS and DWS2 MWS and DWS1 Scan Footprints and Reuse Characteristics in DSS • MWS: instructions, private, and metadata • can be measured on downscaled simulation • DWS: all tuples accessed at lower levels • can be computed based on query composition and prob. of match
Analytical model-an overview • Goal: Predicts miss rate versus cache size for fully-assoc. caches with a LRU replacement policy for single-proc. systems • Number of cold misses: • size of footprint/block size • |MWS| is measured • |DWSi| computed based on parameters describing the query (size of relations probability of matching a search criterion, index versus sequential scan, etc) • Number of capacity misses for tuple access at level i: • CM0(1- C - C0) if C0 < Cache size < |MWS| • |MWS| - C0 • size of tuple/block size if |MWS| <= Cache size < |MWS| + |DWSi| • Number of accesses per tuple: measured • Total number of misses and accesses: computed
Q3 on PostgreSQL • - 3 levels • - 1 seq. scan • - 2 index scan • - 2 nest. loop • joins Miss ratio Cache size (Kbyte) Model Validation Goal: • Prediction accuracy for queries with different compositions • Q3, Q6, and Q10 from TPC-D • Prediction accuracy when scaling up the database • parameters at 5Mbyte used to predict at 200 Mbytes databases • Robustness across database engines • Two engines: PostgreSQL and MySQL
Model Predictions: Miss rates for Huge Databases • Miss rate by instr., priv. and meta data rapidly decay (128 Kbytes) • Miss rate component for database data small What’s in the tail?
Outline • Experimental platform • Memory system issues studied • Working set size in DSS workloads • Prefetch approaches for pointer-intensiveworkloads (such as in OLTP) • Coherence issues in OLTP workloads • Concluding remarks
Cache Issues for Linked Data Structures • Pointer-chasing show up in many interesting applications: • 35% of the misses in OLTP (TPC-B) • 32% of the misses in an expert system • 21% of the misses in Raytrace • Traversal of lists may exhibit poor temporal locality • Results in chains of data dependent loads, • called pointer-chasing
SW Prefetch Techniques to Attack Pointer-Chasing • Greedy Prefetching (G). - computation per node < latency • Jump Pointer Prefetching (J) - short list or traversal not known a priori • Prefetch Arrays (P.(S/H)) Generalization of G and J that addresses above shortcomings. - Trade memory space and bandwidth for more latency tolerance
HEALTH MST B G J P.S P.H B G J P.S P.H Results: Hash Tables and Lists in Olden Prefetch Arrays do better because: • MST has short lists and little computation per node • They prefetch data for the first nodesin HEALTH unlike Jump prefetching
DB.tree Tree.add B G J P.S P.H B G J P.S P.H Results: Tree Traversals in OLTP and Olden Hardware-based prefetch Arrays do better because: • Traversal path not known in DB.tree (depth first search) • Data for the first nodes prefetched in Tree.add
Other Results in Brief • Impact of longer memory latencies: • Robust for lists • For trees, prefetch arrays may cause severe cache pollution • Impact of memory bandwidth • Performance improvements sustained for bandwidths of typical high-end servers (2.4 Gbytes/s) • Prefetch arrays may suffer for trees. Severe contention on low-bandwidth systems (640 Mbytes/s) were observed • Node insertion and deletion for jump pointers and prefetch arrays • Results in instruction overhead (-). However, • insertion/deletion is sped up by prefetching (+)
Outline • Experimental platform • Memory system issues studied • Working set size in DSS workloads • Prefetch approaches for pointer-intensive workloads (such as in OLTP) • Coherence issues in OLTP workloads • Concluding remarks
DSM node SMP node M M M M M M M M M M M M $ $ $ $ $ $ $ $ $ $ $ $ P P P P P P P P P P P P Coherence Issues in OLTP • Favorite protocol: write-invalidate • Ownership overhead: invalidations cause write stall and inval. traffic
Ownership Overhead in OLTP Simulation setup: • CC-NUMA with 4 nodes • MySQL, TPC-B, 600 MB database • 40% of all ownership transactions stem from load/store sequences
Techniques to Attack Ownership Overhead • Dynamic detection of migratory sharing • detects two load/store sequences by different processors • only a sub-set of all load/store sequences (~40% in OLTP) • Static detection of load/store sequences • compiler algorithms that tags a load followed by a store and brings exclusive block in cache • poses problems in TPC-B
New Protocol Extension Criterion: load miss from processor i followed by global store from i, tag block as Load/Store
Concluding Remarks • Focus on DSS and OLTP has revealed challenges not exposed by traditional appl. • Pointer-chasing • Load/store optimizations • Application scaling not fully understood • Our work on combining simulation with analytical modeling shows some promise