Presented by: John Tully Dept of Computer & Information Sciences University of Delaware

Using Machine Learning to Guide Architecture Simulation Greg Hamerly (Baylor University)‏ Gerez Perelman, Jeremy Lau, Brad Calder (UCSD)‏ Timothy Sherwood (UCSB)‏ Journal of Machine Learning Research 7 (2006) http://cseweb.ucsd.edu/~calder/papers/JMLR-06-SimPoint.pdf Presented by: John Tully Dept of Computer & Information Sciences University of Delaware

Simulation is Critical! • Allows engineers to understand cycle-level behavior of processor before fabrication • Can play with design options cheaply. How are performance, complexity, area, power affected when I make modification X, and remove feature Y?

But... Simulation is SLOW • Modelling at cycle level is very slow • Simplescalar in cycle-accurate mode: a few hundred million cycles per hour • Modelling at gate level is very, very, very slow • ETI cutting-edge emulation technology: 5,000 cycles/second (24 hours = ~1 second of Cyclops-64 instructions).

Demands are increasing • Size of benchmarks: applications can be quite large. • Number of programs: Industry standard benchmarks are large suites. Many focus on variety (i.e. SPEC – 26 programs. Stress ALUs, FPUs, Memory, Cache, etc.)‏ • Iterations required: just to experiment with one feature (cache size) can take hundreds of thousands of benchmark runs

‘Current’ Remedies • Simulate programs for N instructions (whatever your timeframe allows), and just stop. • Similarly, fast-forward through initialization portion, and then simulate N instructions. • Simulate N instructions from only the “important” (most computationally intensive) portions of a program. • Neither work well, and at their worst are embarrassing: error rates of almost 4,000%!

SimPoint to the Rescue • 1. As a program executes, its behavior changes. The changes aren’t random – they’re structured as sequences of recurring behavior (termed phases). • 2. If repetitive and structured behavior can be identified, then we only need to sample each unique behavior of a program (and not the whole thing) to get an idea for its execution profile. • 3. How can we identify repetitive, structured behavior? Use machine learning! • Now, only a small set of samples needed. Collect points from each phase (simulation points), and weigh them – this accurately depicts execution of the entire program.

Defining Phase Behavior • Seems pretty easy at first... let's just collect hardware-based statistics, and classify phases accordingly • CPI (performance)‏ • Cache Miss Rates • Branch Statistics (Frequency, Prediction Rate) • FPU instructions per cycle • But what's the problem here?

Defining Phase Behavior • Problem: if we use hardware-based stats, we're tying phases to architectural configuration! • Every time we tweak architecture, we must re-define phases! • Underlying methodology: identify phases without relying on architectural metrics. Then, we can find a set of samples that can be used across our entire design space. • But what can we use that's independent of hardware-based status, but still relates to fundamental changes in what the hardware is doing?

Defining Phase Behavior • Basic Block Vector (BBV): a structure designed to capture how a program changes behavior over time. • A distribution of how many times each basic block is executed over an interval (can use a 1D-array)‏ • Each entry weighted by # of instructions in the BB (so all instructions have equal weight). • Subsets of information in BBVs can also be extracted • Register usage vectors • Loop / branch execution frequencies

Defining Phase Behavior • Now, we can use BBVs to find patterns in the program. But can we prove they're useful? • Detailed study by Lau et. al: very strong correlation between the following: • 1) Difference in BBV of the interval, and BBV of the whole program (code changes)‏ • 2) CPI of the interval (performance)‏ • Graphic on next slide...... • Things are looking really good now – we can create a set of phases (and therefore, points to simulate) by ONLY looking at executed code.

Defining Phase Behavior

Extracting Phases • Next step: how do I actually turn my BBV vectors into phases? • Create a function to compare two BBVs: how similar are they? • Use machine learning data clustering algorithms to group similar BBVs. Each cluster (set of similar points) = a phase! • SimPoint is the implementation of this • Profiles programs (divides them into intervals, and creates BBVs for each). • Use k-means clustering algorithm. Input includes granulatiry of clusters - that dictates the size and abundance of phases!

Choosing Simulation Pts • Final Step: choose simulation points. From each phase, SimPoint chooses one representative interval that will be simulated (in full detail) to represent the whole phase. • All points in the phase are (theoretically) similar in performance statistics – so we can extrapolate. • Machine learning also used to pick representative points of a cluster (the interval to use from a phase). • Points are weighed based on interval size (and phase size, of course) • Only needs to be done one per program+input combination – remember why?

Choosing Simulation Pts • User can tweak interval length, # clusters, etc – tradeoff between number of points simulated, and simulation time.

Experimental Framework • Test Programs: SPEC Benchmarks (26 applications, about half integer, half FP; designed to stress all aspects of a processor.)‏ • Simulation: SimpleScalar, Alpha architecture. • Metrics: accuracy of simulation measured in CPI prediction error

Million Dollar Question... • How does phase classification do? • SPEC2000, 100 million instruction intervals, no more than 10 simulation points • Gzip, Gcc: only 4 and 8 phases found, respectively

Million Dollar Question...

Million Dollar Question... • How accurate is this thing? • A lot better than “current” methods.....

Million Dollar Question...

Million Dollar Question... • How much time are we saving? • In previous result, we're only simulating 400-800 million instructions for SimPoint results. According to SPEC benchmark data sheet, 'reference' input configurations are 50 billion and 80 billion instructions, respectively. • So, baseline simulation needed to execute ~100 times more instructions for this configuration – took several months! • Imagine if we needed to run on a few thousand combinations of cache size, memory latency, etc.... • Intel / Microsoft use it - must be pretty good.

Putting it all together • First implementation of machine learning techniques to perform program phase analysis. • Main thing to take away: applications (even complex ones) only exhibit a few unique behaviors – they're simply interleaved with each other over time. • Using machine learning, we can find these behaviors with methods that are independent of architectural metrics. • By doing so, we only need to simulate a few carefully chosen intervals, which greatly reduces simulation time.

Related / Future Work • Other clustering algorithms with same data (multinomail clustering, regression trees) – k-means appears to do the best. • “Un-tie” simulation points from binary – how could we do this? • Map behavior back to source level after detecting it • Now, we can use same simulation points for different compilations / input of a program • Accuracy is just about as good as with fixed intervals (Lau et. al)

Presented by: John Tully Dept of Computer & Information Sciences University of Delaware