190 likes | 330 Views
Analytical Evaluation of Shared-Memory Systems with Commercial Workloads. Jichuan Chang <chang@cs.wisc.edu>. Outline. A Case for Analytical Models Existing Models and Their Limitations What Kind of Tools do We Need. Background. Shared-memory Multiprocessors Servers
E N D
Analytical Evaluation of Shared-Memory Systems with Commercial Workloads Jichuan Chang <chang@cs.wisc.edu> CS747
Outline • A Case for Analytical Models • Existing Models and Their Limitations • What Kind of Tools do We Need CS747
Background • Shared-memory Multiprocessors Servers • Important - the computing infrastructure of our society • Complex system (ILP processors + caches + interconnection) • Commercial workloads • Important - 80% server market, supporting our daily business • Different behavior from scientific workloads • Large code size and data set, different cache behaviors • Lots of OS interactions (context switches), higher I/O rate • Hard to study (complex, hard to setup, no code, moving target) CS747
A Motivating Example • Bob is designing a next generation multiprocessor server for commercial workloads. Assume that the largest benchmark he can setup now is a 10G database. • How can Bob predict the performance (IPC, or tpm) of running a 100G database TPC-D benchmark on the future machine? • What’s the ideal cache hierarchy design for this workload given his prediction of future technology constants? • We need tools to characterize the workloads! • We need tools to prune the vast design space! CS747
Performance Evaluation Tools • Hardware Monitors, Binary Instrumentation Tools + Realistic, dynamic information - Only work for existing systems, aggregated info • Program Analysis Tools (i.e. compilers) + Can do global analysis, works well for arrays/loops - Little dynamic info, not good for (pointer-based) irregular programs, needs source code. • (Full System) Architecture Simulators + Detailed simulation, realistic result, can simulate future HW - Slow (can’t extrapolate), complex, can’t simulate future SW • Analytical Models + Fast, gives insights, can predict for future SW/HW combinations + Need to create models of multiprocessor with new workloads CS747
Sorin et al. MVA for ILP Multiprocessors • Application input parameters • CV fM fsync-write Pread Pwrite … ... • Iterate between 2 submodels • SB (fraction of time CPU stalls due to synch operations) • MB (fraction of time CPU stalls due to limited MSHR size) • Surrogate service time inflation ILP Processor The rest of the system (Bus, NI, Switches DRAM, Directories) L1$ L2$ MSHR (when MSHR not full) CS747
Sorin et al. MVA Model + Target system design, answer question like + MSHR size, directory organization, NI latencies, etc + Insight into application behavior + Miss rate (), burstiness (CV), degree of parallelism (fM) – Some app. param. (, fM,fsync-write) depend on arch. param. • Most parameters insensitive to changes outside CPU/cache • Need input parameters for each CPU/cache configuration • Caches also interact with the system design (i.e update protocol) – Fixed problem size, not characterizing the workload • Can we break the processor/cache black-box into processor and cache two submodels? • What would be the application input parameters? CS747
Cache Models (1) • Stack distance model • Estimate capacity misses, based on one access trace • Work for inclusive fully-associated cache • Have extensions for direct-mapped and set-associative cache ABBACA A typical access trace CS747
Cache Models (2) • Agarwal et al. 1989 • Model cache block size, working-set transitions, conflict misses and multi-programming interference • Data Reference Model (Tsai/Agarwal 1993) • Configuration independent model for Multiprocessor • problem size, # processor, block size as parameters • Model sharing pattern for each shared block • Assume certain data distribution for data-dependent applications (i.e. parallel quick-sort) • Limitation: simple and iterative program, well-known algorithm, no significant synchronization CS747
Cache Models (3) • Mathematical Cache Miss Equations • Compiler generated equations for loop-based array access • Model reuse along array dimensions by “reuse vector” • Extended to model pointer data structures • Single-linked lists and binary trees on uniprocessor • Must understand malloc() implementation • Ultimate aim is to model B-tree for databases CS747
Architects’ Workload Characterization • Observe for different configurations • Busy/stall time breakdown • Kernel/user time breakdown • Misses breakdown (4C) • Last touch prediction • Observe for different problem size • Working set and working set transition • Sharing degree (producer-consumer, migratory) CS747
What Tools do We Need • Application models for commercial workloads • What to model? (working set, sharing, communication, etc.) • Include problem size as input parameter • Configuration independent (or less dependent) • Algorithm-based (need source code) • Or observation-based (on simulations) • Architectural Models • Separate processor core and caches • Separate CPU and the rest of the system [Sorin et al] • Model vs. Simulation • Analytical models to simplify simulator design [CAECW 01] • Simulators to ease the acquisition of model parameters CS747
Configuration Independent Analysis • What to characterize? [Abandah/Davidson] • general characteristics • working set (access-age, footprint) • concurrency (serial / imbalance / contention / busy) • communication pattern (sharing degree/invalidation degree) • communication phases and locality, sharing behavior • Possible parameters for workload characterization • An Example - DSS systems working-set sizes • Application parameters (for each node i in the query plan) • Ni = # truples in a scan; Hi = probability a tuple matches • QD = depth of the query tree; • DB_REi= fraction of a relation accessed • Model the reuse after working set transitions (instructions, private, meta-data, index, tuple-locks, tuples) CS747
A (simplistic?) Model for TPCC • Use stack distance curve to derive miss rates • L1 cache accesses totally overlapped with execution • M/G/1 queue to model bus/memory contention • Things not being modeled • Query algorithms • Communication misses • Overlapping between computation and memory access • The paper reports <10% errors. [Zhang et al 99] CS747
Conclusion • Analytical models are needed to • Characterize commercial workloads • Predict their performance on multiprocessors • We need models that • Perform configuration independent analysis • Can use the output from workload models CS747
Thank You! Questions? CS747
Backup Slides • References • Acknowledgement CS747
References • Cache Models • An Analytical Cache Model, Agarwal et al, ACM Transaction on Computer Systems, 1989 • Analyzing Multiprocessor Cache Behavior Through Data Reference Modeling, Tsai and Agarwal, SIGMETRICS 93 • An Analytical Model for Designing Memory Hierarchies, Jacob et al, IEEE Transaction on Computers, 1996 • Cache Miss Equations: A Compiler Framework for Analyzing and Turning Memory Behavior, Ghosh et al, ACM Transactions on Programming Languages and Systems, 1999 • A Mathematical Cache Miss Analysis for Pointer Data Structures, Zhang and Martonosi, SIAM • Commercial Workloads Overview • Trends in Shared Memory Multiprocessing, Stenstrom et al, IEEE Computer 97 • Memory System Characterization of Commercial Workloads, Barroso et al, ISCA 98 CS747
Reference (cont.) • Configuration Independent Analysis • Configuration Independent Analysis for Characterizing Shared-memory Applications, Abandah and Davidson, UMich TR 1997. • Shared Memory Multiprocessor Models • Analytical Evaluation of Shared-memory Systems with ILP Processors, Sorin et al, ISCA 98 • A Customized MVA Model for Shared-memory Systems with Heterogeneous Applications, Sorin et al, UWisc TR, 2000 • Commercial Workload Specific Models • An Analytical Model of the Working-set Sizes in Decision-Support Systems, Karlsson et al, SIGMETRICS 2000 • Analysis of Commercial Workload on SMP Multiprocessors, Zhang et al, Proceedings of Performance 99 • Evaluation of Commercial Workloads • A Processor Queueing Simulation Model for Multiprocessor System Performance Analysis, Tsuei and Yamamoto, CAECW 2001 • Evaluating the Non-determinism in Commercial Workloads, Multifacet group, CAECW 2001 CS747