230 likes | 336 Views
Using Sampled and Incomplete Profiles. David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu. Overview. Trace-based simulation is expensive (caches are getting larger, CPUs and networks are getting faster)
E N D
Using Sampled and Incomplete Profiles David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli@ece.neu.edu
Overview • Trace-based simulation is expensive (caches are getting larger, CPUs and networks are getting faster) • Approximate results are many times sufficient • How can we utilize a reduced or sampled profile and still obtain accurate modeling/simulation results?
How does sampling affect our Metrics for Evaluating Trace Collection Methodologies • Speed – sampled profiles reduce speed requirements • Memory – sampled profiles may take less space • Accuracy – sampled profiles are less accurate • Intrusiveness – sampling is less intrusive • Completeness – no change • Granularity – may affect our ability to capture fast events (nyquist) • Flexibility – no change • Portability – clock speed may affect sampling accuracy • Capacity – sampling should reduce this • Cost – potentially less cost and less time
Application Areas: • Memory system performance – cache simulation, working set models, temporal and spatial locality • CPU pipeline modeling – instruction frequencies, instruction sequences (2-at-a-time, 3-at-a time, pipeline snapshots) • Network simulation – input traffic distribution, queue lengths, throughput, burstiness
Memory Systems • Temporal locality – addresses will be referenced close in time • Spatial locality – addresses close by will be referenced next • Working set models • Belady (1966) – Virtual memory page replacement algorithms (Optimal replacement defined) • Denning (1980) – The pattern of page access over the execution of the program • Thiebaut and Stone (1986) – The number of misses incurred due to task switches can be modeled as a binomial distribution
Memory Systems • Cold start – How do we model behavior when a program starts execution for the first time? • Warm start – How do we model behavior when a program resumes execution? • How does memory organization (e.g., set associativity)and typical program behavior affect our ability to utilize sampled profiles effectively? • Do we sample in time or can we also sample in space (e.g., a set)?
Memory Systems • If we only capture a subset of the important addresses, can we reproduce the full trace? • Abstract Execution (Larus 1990) – basic block traces • Trace Reduction (Smith 1977, Puzak 1985) – generate a reduced trace that contains the exact same number of misses and writebacks as the original trace (a type of filtering is performed, similar to Agarwal’s Block Filter) • Trace compaction (Samples 1989) – perform a diff on sequential addresses and only capture important diffs
Can we do something simple?Laha 1988 Fu 1994 sample n sample n+1 ignore ignore ignore sampling interval sample size sampling ratio = sample size/sample interval • Two types of errors • sampling errors – is the ratio optimal? • accurately predicting effects of ignored portions • But this only suggests when to sample, not what to sample…
Sampling Rate:What is the Nyquist Frequency for an Program? • We must sample a program at a rate of twice the frequency of the event of interest • Half the sampling frequency is termed the Nyquist frequency • Sampling at lower rates than twice the Nyquist frequency can cause aliasing and distortion • The biggest problem is that not all events of interest in a program exhibit a nice periodic pattern
Example: gprof( ) • Produces 3 things • A listing of the total execution times and call counts for each of the functions in the program, sorted by decreasing time • The functions sorted according to the time they represent, including the time of their call graph descendents • Total execution in a cycle and the members in that cycle (a cycle is a back edge) • gprof samples a program’s execution • Obtains exact call statistics • Does not obtain exact time measurements • Accuracy is obtained through statistical sampling • Sampling reduces the associated overhead
Sampling dimensions • We can sample in both time and space • Time (periodic sampling) • Filtered sampling (e.g., using addresses ranges) • Time • Periodic • Random • #misses, #instructions, #loads/stores • Space • Address ranges – may not be representative of all ranges • Cache sets – may limit the utility of the sample • Statically tagged events – focuses in on particular instructions and data of interest
How do we account for unknown references? • Assume that this behavior does not affect past/future behavior • Assume that some percentage of past behavior is overwritten by unknown reference behavior • Decay model • Footprints in the cache • MRU model • Assume that all of the past behavior is overwritten by the unknown reference behavior
How do we model the effects of multiprogramming?? • Flush all tables (caches, branch predictors, TLBs, load buffers, etc.) • Estimate interference using a model • Invalidate some % of all entries based on the relative time since last execution • Utilize working set models and utilize these to estimate the effect of the interference • Allow aliasing to occur where appropriate (branch predictors, but not caches)
Sampled Instruction Execution Profiles • Instruction frequencies • For SPEC92int programs on a IA32 CPU • 43% ALU, 22% loads, 12% stores, 23% control flow (H&P AQA) • Instruction sequences • Top pairs • Top triples • Continue up until average BB size • Branches in the pipeline • Sampled versus modeled
Analytical Models of Workload BehaviorSquillante and Kaeli, 1997 • We an capture distributions of the distance between events • We can then compute the probability of different events occurring in a pipeline of length n (think of n as a window of execution) • We can also compute the conditional probabilities of multiple events occurring in the pipeline of length n • We can then assign weights to each of these multiple events to compute the throughput (IPC) of a pipeline • Our model uses random marked point process (time between events) and produces very accurate estimates of pipeline throughput
Analytical Models of Workload BehaviorSquillante and Kaeli, 1997 • Some constraints are assumed in the sake of simplicity: • Inter-branch times are independent • Inter-branch times are identically distributed • Delay due to taken branches is constant
Analytical Models of Workload BehaviorSquillante and Kaeli, 1997
Analytical Models of Workload BehaviorSquillante and Kaeli, 1997 • A analytical formula is used to compute approximate CPI and speedup measures for an n-stage pipelined processor • Traces are captured from benchmark execution • The distribution of the number of instructions between successive taken branches is computed using a “window of execution” filter
Analytical Models of Workload BehaviorSquillante and Kaeli, 1997
Analytical Models of Workload BehaviorSquillante and Kaeli, 1997 • For a pipeline length of 8, the obtained results are quite precise for three out of the five benchmarks • For bubblesort and prime, the IID assumption is violated, thus introducing some inaccuracies in our model • Future work looks at handling inaccuracies that provide multi-level conditional probabilities into our model
Capturing n-length Instruction Sequences • Sampling over n sequential instructions • Capturing the most frequently executed sequences • Utilizing these sequences to drive pipeline design • Capturing longer profiles may allow us to predict design hardware trace caches • Gonzalez, Tubella and Molina describe a mechanism for both profiled instructions and operand values