1 / 23

Using Sampled and Incomplete Profiles

Using Sampled and Incomplete Profiles. David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu. Overview. Trace-based simulation is expensive (caches are getting larger, CPUs and networks are getting faster)

giolla
Download Presentation

Using Sampled and Incomplete Profiles

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Sampled and Incomplete Profiles David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu

  2. Overview • Trace-based simulation is expensive (caches are getting larger, CPUs and networks are getting faster) • Approximate results are many times sufficient • How can we utilize a reduced or sampled profile and still obtain accurate modeling/simulation results?

  3. How does sampling affect our Metrics for Evaluating Trace Collection Methodologies • Speed – sampled profiles reduce speed requirements • Memory – sampled profiles may take less space • Accuracy – sampled profiles are less accurate • Intrusiveness – sampling is less intrusive • Completeness – no change • Granularity – may affect our ability to capture fast events (nyquist) • Flexibility – no change • Portability – clock speed may affect sampling accuracy • Capacity – sampling should reduce this • Cost – potentially less cost and less time

  4. Application Areas: • Memory system performance – cache simulation, working set models, temporal and spatial locality • CPU pipeline modeling – instruction frequencies, instruction sequences (2-at-a-time, 3-at-a time, pipeline snapshots) • Network simulation – input traffic distribution, queue lengths, throughput, burstiness

  5. Memory Systems • Temporal locality – addresses will be referenced close in time • Spatial locality – addresses close by will be referenced next • Working set models • Belady (1966) – Virtual memory page replacement algorithms (Optimal replacement defined) • Denning (1980) – The pattern of page access over the execution of the program • Thiebaut and Stone (1986) – The number of misses incurred due to task switches can be modeled as a binomial distribution

  6. Memory Systems • Cold start – How do we model behavior when a program starts execution for the first time? • Warm start – How do we model behavior when a program resumes execution? • How does memory organization (e.g., set associativity)and typical program behavior affect our ability to utilize sampled profiles effectively? • Do we sample in time or can we also sample in space (e.g., a set)?

  7. Memory Systems • If we only capture a subset of the important addresses, can we reproduce the full trace? • Abstract Execution (Larus 1990) – basic block traces • Trace Reduction (Smith 1977, Puzak 1985) – generate a reduced trace that contains the exact same number of misses and writebacks as the original trace (a type of filtering is performed, similar to Agarwal’s Block Filter) • Trace compaction (Samples 1989) – perform a diff on sequential addresses and only capture important diffs

  8. Can we do something simple?Laha 1988 Fu 1994 sample n sample n+1 ignore ignore ignore sampling interval sample size sampling ratio = sample size/sample interval • Two types of errors • sampling errors – is the ratio optimal? • accurately predicting effects of ignored portions • But this only suggests when to sample, not what to sample…

  9. Sampling Rate:What is the Nyquist Frequency for an Program? • We must sample a program at a rate of twice the frequency of the event of interest • Half the sampling frequency is termed the Nyquist frequency • Sampling at lower rates than twice the Nyquist frequency can cause aliasing and distortion • The biggest problem is that not all events of interest in a program exhibit a nice periodic pattern

  10. Example: gprof( ) • Produces 3 things • A listing of the total execution times and call counts for each of the functions in the program, sorted by decreasing time • The functions sorted according to the time they represent, including the time of their call graph descendents • Total execution in a cycle and the members in that cycle (a cycle is a back edge) • gprof samples a program’s execution • Obtains exact call statistics • Does not obtain exact time measurements • Accuracy is obtained through statistical sampling • Sampling reduces the associated overhead

  11. When to sample: Cold start vs. Warm start

  12. Sampling dimensions • We can sample in both time and space • Time (periodic sampling) • Filtered sampling (e.g., using addresses ranges) • Time • Periodic • Random • #misses, #instructions, #loads/stores • Space • Address ranges – may not be representative of all ranges • Cache sets – may limit the utility of the sample • Statically tagged events – focuses in on particular instructions and data of interest

  13. How do we account for unknown references? • Assume that this behavior does not affect past/future behavior • Assume that some percentage of past behavior is overwritten by unknown reference behavior • Decay model • Footprints in the cache • MRU model • Assume that all of the past behavior is overwritten by the unknown reference behavior

  14. How do we model the effects of multiprogramming?? • Flush all tables (caches, branch predictors, TLBs, load buffers, etc.) • Estimate interference using a model • Invalidate some % of all entries based on the relative time since last execution • Utilize working set models and utilize these to estimate the effect of the interference • Allow aliasing to occur where appropriate (branch predictors, but not caches)

  15. Sampled Instruction Execution Profiles • Instruction frequencies • For SPEC92int programs on a IA32 CPU • 43% ALU, 22% loads, 12% stores, 23% control flow (H&P AQA) • Instruction sequences • Top pairs • Top triples • Continue up until average BB size • Branches in the pipeline • Sampled versus modeled

  16. Analytical Models of Workload BehaviorSquillante and Kaeli, 1997 • We an capture distributions of the distance between events • We can then compute the probability of different events occurring in a pipeline of length n (think of n as a window of execution) • We can also compute the conditional probabilities of multiple events occurring in the pipeline of length n • We can then assign weights to each of these multiple events to compute the throughput (IPC) of a pipeline • Our model uses random marked point process (time between events) and produces very accurate estimates of pipeline throughput

  17. Analytical Models of Workload BehaviorSquillante and Kaeli, 1997 • Some constraints are assumed in the sake of simplicity: • Inter-branch times are independent • Inter-branch times are identically distributed • Delay due to taken branches is constant

  18. Analytical Models of Workload BehaviorSquillante and Kaeli, 1997

  19. Analytical Models of Workload BehaviorSquillante and Kaeli, 1997 • A analytical formula is used to compute approximate CPI and speedup measures for an n-stage pipelined processor • Traces are captured from benchmark execution • The distribution of the number of instructions between successive taken branches is computed using a “window of execution” filter

  20. Analytical Models of Workload BehaviorSquillante and Kaeli, 1997

  21. Analytical Models of Workload BehaviorSquillante and Kaeli, 1997 • For a pipeline length of 8, the obtained results are quite precise for three out of the five benchmarks • For bubblesort and prime, the IID assumption is violated, thus introducing some inaccuracies in our model • Future work looks at handling inaccuracies that provide multi-level conditional probabilities into our model

  22. Capturing n-length Instruction Sequences • Sampling over n sequential instructions • Capturing the most frequently executed sequences • Utilizing these sequences to drive pipeline design • Capturing longer profiles may allow us to predict design hardware trace caches • Gonzalez, Tubella and Molina describe a mechanism for both profiled instructions and operand values

More Related