240 likes | 352 Views
Workload Characteristics and Representative Workloads. David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu. Overview.
E N D
Workload Characteristicsand Representative Workloads David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu
Overview • When we want to collect profiles to be used in the design of a next-generation computing system, we need to be very careful that we capture a representative sample in our profile • Workload characteristics allow us to better understand the content of the samples which we collect • We need to select programs to study which represent the class of applications we will eventually run on the target system
Examples of Workload Characteristics • Instruction mix • Static vs. dynamic instruction count • Working sets • Control flow behavior • Working set behavior • Inter-reference gap model (temporal locality) • Database size • Address and value predictability • Application/library/OS breakdown • Heap vs. stack allocation
Benchmarks • Real or synthetic programs/applications used to exercise a hardware system, and generally representing a range of behaviors found in a particular class of applications • Benchmarks classes include: • Toy benchmarks: tower of hanoi, qsort, fibo • Synthetic benchmarks: dhrystone, whetstone • Embedded: EEMBC, UTDSP • CPU benchmarks: SPECint/fp • Internet benchmark: SPECjAppServ, SPECjvm • Commerical benchmark: TPC, SAP, SPECjAppServ • Supercomputing: Perfect Club, Splash, Livermore Loops
Is there a way to reduce the runtime of SPEC while maintaining representativeness? • MinneSPEC (U. of Minn) – Using gprof statistics about the runtime of SPEC and various Simplescalar simulation results (I-mix, cache misses, etc), we can capture statistically similar, though significantly shorter, runs of the programs • Provides three input sets will run in: • A few minutes • A few hours • A few days • IEEE Computer Arch Letters paper
How many programs do we need? • Does the set of application capture enough variation to be “representative” of the entire class of workload? • Should we consider using multiple benchmark suites to factor out similarities in programming styles? • Can we utilize workload characteristics to identify the particular programs of interest?
Example of capturing program behavior: “Quantifying Behavioral Differences Between C and C++ Programs” Calder, Grunwald and Zorn, 1995 • C++ is a programming language growing in popularity • We need to design tomorrow’s computer architectures based on tomorrow’s software paradigms • How do workload characteristics changes as we move to new programming paradigms?
Example of capturing program behavior: “Quantifying Behavioral Differences Between C and C++ Programs” Calder, Grunwald and Zorn, 1995 • First problem – Find a set of representative programs from both FO and OO domains • Difficult for OO in 1995 • Differences between programming models • OO relies heavily on messages and methods • Data locality will change due to the bundling together of data structures in Objects • Size of functions will be reduced in OO Polymorphism allows for indirect function invocation and runtime method selection • OO programs will manage a larger number of dynamically allocated objects
Address and Value Profiling • Lipasti observed that profiled instructions tend to repeat their behavior • Many addresses are nearly constant • Many values do not change between instruction execution • Can we use profiles to better understand some of these behaviors, and the until this knowledge to optimize execution?
Address Profiling • If an address remains unchanged, can we issue loads and store early (similar to prefetching)? • Do we even have to issue the load or store if we have not modified memory? • What are the implications if indirect addressing is used? • Can we detect patterns (i.e., strides) in the address values? • Can we do anything smart when we detect pointer chasing??
Data Value Profiling • When we see that particular data values do not change, how can we take advantage of this? • Lipasti noticed that a large percentage of store instructions overwrite memory with the value already stored there • Can we avoid computing new results if we notice that our input operand have not changed? • What can we do if we a particular operand only takes on a small set of values?
Parameter Value Profiling • Profile the parameter values passed to functions • If these parameters are predictable, we can exploit this fact during compilation • We can study this on an individual function basis or a call site basis • Compiler optimizations such as code specialization and function cloning can be used
Parameter Value Profiling • We have profiled a set of MS Windows NT 4.0 desktop applications • Word97 • Foxpro 6.0 • SQLserver 7.0 • VC++ 6.0 • Excel97 • Powerpoint97 • Access97 • We measured the value predictability of parameter values for all non-pointer based parameters
Parameter Value Profiling • We look for the predictability of parameters using: • Invariance 1 – probability that the most frequent value is passed • Invariance 4 – probability that one of the 4 most frequent values is passed • Parameter values are more predictable on a call site basis than on a function basis (e.g., for Word97, 8% of the functions pass highly predictable parms, where as when computed on individual call sites, over 16% of the call sites pass highly predictable parms) • Highly predictable means that on over 90% of the calls the same value is observed • We will discuss how to clone and specialize procedures when we discuss profile guided data transformations
How can we reduce the runtime of a single benchmark and still maintain accuracy? • Simpoint – attempt to collect a set of trace samples that best represents the whole execution of the program • Identifies phase behavior in programs • Considers a metric that captures the differences between two samples • Computes the difference between these two intervals • Selects the interval that is closest to all other intervals
Simpoint (Calder ASPLOS 2002) • Utilize basic block frequencies to build basic block vectors (bbf0, bbf1….bbfn-1) • Each frequency is weighted by its length • Entire vector normalized by dividing by total number of basic blocks executed • Take fixed-length samples (100M instructions) • Compare BBVs using: • Euclidean Distance: • ED(a, b) = sqrt(sum(i->1,n) (ai-bi)2) • Manhattan Distance: • MD(a, b) = sum(i->1,n)(|ai-bi|)
Simpoint (Calder ASPLOS 2002) • Manhattan Distance is used to build a similarity matrix • N x N matrix, where N is the number of sampling intervals in the program • Element SM(x, y) is the Manhattan Distance between two 100M element BBV at sample offsets x and y • Plot the Similarity Matrix as an upper triangle
Simpoint (Calder ASPLOS 2002) • Basic Block Vectors can adequately capture the necessary representative characteristics of a program • Distance metrics can help to identify the most representative samples • Cluster analysis (k-means) can improve representativeness by selecting multiple samples
Simpoint (Calder ISPASS 2004) • Newer work on Simpoint considers using register def-use behavior on an interval basis • Also, tracking of loop behavior and procedure calls frequencies provides similar accuracy as using basis block vectors
Simpoint (Calder ASPLOS 2002) • Algorithm overview • Profile program by dividing into fixed sized intervals (e.g., 1M, 10M, 100M insts) • Collect frequency vectors (e.g., BBVs, def-use, etc.) – compute normalized frequencies • Run k-means clustering algorithm to divide the set of intervals into k partitions/sets, for values of k from 1 to K • Compute a “goodness-of-fit” of the data for each value of k • Select the clustering the reduces small k and provides a reasonable “goodness-of-fit” result • The result is a selection of representative simulation points that best “fit” the entire application execution
Improvements to Simpoints (KDD05, ISPASS06) • Utilize a Mixture of Multinomials instead of K-means • Assumes data is generated by a mixture of K-component density functions • We utilize Expectation-Maximization (EM) to find a local maximum likelihood for the parameters of the density function – iterate on E and M steps until convergence • The number of clusters is selected using the Bayesian Information Criteria (BIC) approach to judge “goodness of fit” • “A multinomial clustering model for fast simulation of computer architecture designs”, K. Sanghai et al., Proc. of KDD 2005, Chicago, IL., pp. 808-813.
Summary • Before you begin studying a new architecture, have a clear understanding of the target workloads for this system • Perform a significant amount of workload characterization before you begin profiling work • Benchmarks are very useful tools, but must be used properly to obtain meaningful results • Value profiling is a rich area for future research • Simpoints can be used to reduce the runtime of simulation and still maintain simulation fidelity