1 / 24

Workload Characteristics and Representative Workloads

Workload Characteristics and Representative Workloads. David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu. Overview.

tierra
Download Presentation

Workload Characteristics and Representative Workloads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workload Characteristicsand Representative Workloads David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu

  2. Overview • When we want to collect profiles to be used in the design of a next-generation computing system, we need to be very careful that we capture a representative sample in our profile • Workload characteristics allow us to better understand the content of the samples which we collect • We need to select programs to study which represent the class of applications we will eventually run on the target system

  3. Examples of Workload Characteristics • Instruction mix • Static vs. dynamic instruction count • Working sets • Control flow behavior • Working set behavior • Inter-reference gap model (temporal locality) • Database size • Address and value predictability • Application/library/OS breakdown • Heap vs. stack allocation

  4. Benchmarks • Real or synthetic programs/applications used to exercise a hardware system, and generally representing a range of behaviors found in a particular class of applications • Benchmarks classes include: • Toy benchmarks: tower of hanoi, qsort, fibo • Synthetic benchmarks: dhrystone, whetstone • Embedded: EEMBC, UTDSP • CPU benchmarks: SPECint/fp • Internet benchmark: SPECjAppServ, SPECjvm • Commerical benchmark: TPC, SAP, SPECjAppServ • Supercomputing: Perfect Club, Splash, Livermore Loops

  5. SPEC Benchmarks Presentation

  6. Is there a way to reduce the runtime of SPEC while maintaining representativeness? • MinneSPEC (U. of Minn) – Using gprof statistics about the runtime of SPEC and various Simplescalar simulation results (I-mix, cache misses, etc), we can capture statistically similar, though significantly shorter, runs of the programs • Provides three input sets will run in: • A few minutes • A few hours • A few days • IEEE Computer Arch Letters paper

  7. How many programs do we need? • Does the set of application capture enough variation to be “representative” of the entire class of workload? • Should we consider using multiple benchmark suites to factor out similarities in programming styles? • Can we utilize workload characteristics to identify the particular programs of interest?

  8. Example of capturing program behavior: “Quantifying Behavioral Differences Between C and C++ Programs” Calder, Grunwald and Zorn, 1995 • C++ is a programming language growing in popularity • We need to design tomorrow’s computer architectures based on tomorrow’s software paradigms • How do workload characteristics changes as we move to new programming paradigms?

  9. Example of capturing program behavior: “Quantifying Behavioral Differences Between C and C++ Programs” Calder, Grunwald and Zorn, 1995 • First problem – Find a set of representative programs from both FO and OO domains • Difficult for OO in 1995 • Differences between programming models • OO relies heavily on messages and methods • Data locality will change due to the bundling together of data structures in Objects • Size of functions will be reduced in OO Polymorphism allows for indirect function invocation and runtime method selection • OO programs will manage a larger number of dynamically allocated objects

  10. Address and Value Profiling • Lipasti observed that profiled instructions tend to repeat their behavior • Many addresses are nearly constant • Many values do not change between instruction execution • Can we use profiles to better understand some of these behaviors, and the until this knowledge to optimize execution?

  11. Address Profiling • If an address remains unchanged, can we issue loads and store early (similar to prefetching)? • Do we even have to issue the load or store if we have not modified memory? • What are the implications if indirect addressing is used? • Can we detect patterns (i.e., strides) in the address values? • Can we do anything smart when we detect pointer chasing??

  12. Data Value Profiling • When we see that particular data values do not change, how can we take advantage of this? • Lipasti noticed that a large percentage of store instructions overwrite memory with the value already stored there • Can we avoid computing new results if we notice that our input operand have not changed? • What can we do if we a particular operand only takes on a small set of values?

  13. Parameter Value Profiling • Profile the parameter values passed to functions • If these parameters are predictable, we can exploit this fact during compilation • We can study this on an individual function basis or a call site basis • Compiler optimizations such as code specialization and function cloning can be used

  14. Parameter Value Profiling • We have profiled a set of MS Windows NT 4.0 desktop applications • Word97 • Foxpro 6.0 • SQLserver 7.0 • VC++ 6.0 • Excel97 • Powerpoint97 • Access97 • We measured the value predictability of parameter values for all non-pointer based parameters

  15. Parameter Value Profiling • We look for the predictability of parameters using: • Invariance 1 – probability that the most frequent value is passed • Invariance 4 – probability that one of the 4 most frequent values is passed • Parameter values are more predictable on a call site basis than on a function basis (e.g., for Word97, 8% of the functions pass highly predictable parms, where as when computed on individual call sites, over 16% of the call sites pass highly predictable parms) • Highly predictable means that on over 90% of the calls the same value is observed • We will discuss how to clone and specialize procedures when we discuss profile guided data transformations

  16. How can we reduce the runtime of a single benchmark and still maintain accuracy? • Simpoint – attempt to collect a set of trace samples that best represents the whole execution of the program • Identifies phase behavior in programs • Considers a metric that captures the differences between two samples • Computes the difference between these two intervals • Selects the interval that is closest to all other intervals

  17. Simpoint (Calder ASPLOS 2002) • Utilize basic block frequencies to build basic block vectors (bbf0, bbf1….bbfn-1) • Each frequency is weighted by its length • Entire vector normalized by dividing by total number of basic blocks executed • Take fixed-length samples (100M instructions) • Compare BBVs using: • Euclidean Distance: • ED(a, b) = sqrt(sum(i->1,n) (ai-bi)2) • Manhattan Distance: • MD(a, b) = sum(i->1,n)(|ai-bi|)

  18. Simpoint (Calder ASPLOS 2002) • Manhattan Distance is used to build a similarity matrix • N x N matrix, where N is the number of sampling intervals in the program • Element SM(x, y) is the Manhattan Distance between two 100M element BBV at sample offsets x and y • Plot the Similarity Matrix as an upper triangle

  19. Simpoint (Calder ASPLOS 2002) • Basic Block Vectors can adequately capture the necessary representative characteristics of a program • Distance metrics can help to identify the most representative samples • Cluster analysis (k-means) can improve representativeness by selecting multiple samples

  20. Simpoint (Calder ISPASS 2004) • Newer work on Simpoint considers using register def-use behavior on an interval basis • Also, tracking of loop behavior and procedure calls frequencies provides similar accuracy as using basis block vectors

  21. Simpoint (Calder ASPLOS 2002) • Algorithm overview • Profile program by dividing into fixed sized intervals (e.g., 1M, 10M, 100M insts) • Collect frequency vectors (e.g., BBVs, def-use, etc.) – compute normalized frequencies • Run k-means clustering algorithm to divide the set of intervals into k partitions/sets, for values of k from 1 to K • Compute a “goodness-of-fit” of the data for each value of k • Select the clustering the reduces small k and provides a reasonable “goodness-of-fit” result • The result is a selection of representative simulation points that best “fit” the entire application execution

  22. Simpoint Paper

  23. Improvements to Simpoints (KDD05, ISPASS06) • Utilize a Mixture of Multinomials instead of K-means • Assumes data is generated by a mixture of K-component density functions • We utilize Expectation-Maximization (EM) to find a local maximum likelihood for the parameters of the density function – iterate on E and M steps until convergence • The number of clusters is selected using the Bayesian Information Criteria (BIC) approach to judge “goodness of fit” • “A multinomial clustering model for fast simulation of computer architecture designs”, K. Sanghai et al., Proc. of KDD 2005, Chicago, IL., pp. 808-813.

  24. Summary • Before you begin studying a new architecture, have a clear understanding of the target workloads for this system • Perform a significant amount of workload characterization before you begin profiling work • Benchmarks are very useful tools, but must be used properly to obtain meaningful results • Value profiling is a rich area for future research • Simpoints can be used to reduce the runtime of simulation and still maintain simulation fidelity

More Related