1 / 22

Presented by: John Tully Dept of Computer & Information Sciences University of Delaware

Using Machine Learning to Guide Architecture Simulation Greg Hamerly (Baylor University) ‏ G erez Perelman, Jeremy Lau, Brad Calder (UCSD) ‏ Timothy Sherwood (UCSB) ‏ Journal of Machine Learning Research 7 (2006) http://cseweb.ucsd.edu/~calder/papers/JMLR-06-SimPoint.pdf.

eris
Download Presentation

Presented by: John Tully Dept of Computer & Information Sciences University of Delaware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Machine Learning to Guide Architecture Simulation Greg Hamerly (Baylor University)‏ Gerez Perelman, Jeremy Lau, Brad Calder (UCSD)‏ Timothy Sherwood (UCSB)‏ Journal of Machine Learning Research 7 (2006) http://cseweb.ucsd.edu/~calder/papers/JMLR-06-SimPoint.pdf Presented by: John Tully Dept of Computer & Information Sciences University of Delaware

  2. Simulation is Critical! • Allows engineers to understand cycle-level behavior of processor before fabrication • Can play with design options cheaply. How are performance, complexity, area, power affected when I make modification X, and remove feature Y?

  3. But... Simulation is SLOW • Modelling at cycle level is very slow • Simplescalar in cycle-accurate mode: a few hundred million cycles per hour • Modelling at gate level is very, very, very slow • ETI cutting-edge emulation technology: 5,000 cycles/second (24 hours = ~1 second of Cyclops-64 instructions).

  4. Demands are increasing • Size of benchmarks: applications can be quite large. • Number of programs: Industry standard benchmarks are large suites. Many focus on variety (i.e. SPEC – 26 programs. Stress ALUs, FPUs, Memory, Cache, etc.)‏ • Iterations required: just to experiment with one feature (cache size) can take hundreds of thousands of benchmark runs

  5. ‘Current’ Remedies • Simulate programs for N instructions (whatever your timeframe allows), and just stop. • Similarly, fast-forward through initialization portion, and then simulate N instructions. • Simulate N instructions from only the “important” (most computationally intensive) portions of a program. • Neither work well, and at their worst are embarrassing: error rates of almost 4,000%!

  6. SimPoint to the Rescue • 1. As a program executes, its behavior changes. The changes aren’t random – they’re structured as sequences of recurring behavior (termed phases). • 2. If repetitive and structured behavior can be identified, then we only need to sample each unique behavior of a program (and not the whole thing) to get an idea for its execution profile. • 3. How can we identify repetitive, structured behavior? Use machine learning! • Now, only a small set of samples needed. Collect points from each phase (simulation points), and weigh them – this accurately depicts execution of the entire program.

  7. Defining Phase Behavior • Seems pretty easy at first... let's just collect hardware-based statistics, and classify phases accordingly • CPI (performance)‏ • Cache Miss Rates • Branch Statistics (Frequency, Prediction Rate) • FPU instructions per cycle • But what's the problem here?

  8. Defining Phase Behavior • Problem: if we use hardware-based stats, we're tying phases to architectural configuration! • Every time we tweak architecture, we must re-define phases! • Underlying methodology: identify phases without relying on architectural metrics. Then, we can find a set of samples that can be used across our entire design space. • But what can we use that's independent of hardware-based status, but still relates to fundamental changes in what the hardware is doing?

  9. Defining Phase Behavior • Basic Block Vector (BBV): a structure designed to capture how a program changes behavior over time. • A distribution of how many times each basic block is executed over an interval (can use a 1D-array)‏ • Each entry weighted by # of instructions in the BB (so all instructions have equal weight). • Subsets of information in BBVs can also be extracted • Register usage vectors • Loop / branch execution frequencies

  10. Defining Phase Behavior • Now, we can use BBVs to find patterns in the program. But can we prove they're useful? • Detailed study by Lau et. al: very strong correlation between the following: • 1) Difference in BBV of the interval, and BBV of the whole program (code changes)‏ • 2) CPI of the interval (performance)‏ • Graphic on next slide...... • Things are looking really good now – we can create a set of phases (and therefore, points to simulate) by ONLY looking at executed code.

  11. Defining Phase Behavior

  12. Extracting Phases • Next step: how do I actually turn my BBV vectors into phases? • Create a function to compare two BBVs: how similar are they? • Use machine learning data clustering algorithms to group similar BBVs. Each cluster (set of similar points) = a phase! • SimPoint is the implementation of this • Profiles programs (divides them into intervals, and creates BBVs for each). • Use k-means clustering algorithm. Input includes granulatiry of clusters - that dictates the size and abundance of phases!

  13. Choosing Simulation Pts • Final Step: choose simulation points. From each phase, SimPoint chooses one representative interval that will be simulated (in full detail) to represent the whole phase. • All points in the phase are (theoretically) similar in performance statistics – so we can extrapolate. • Machine learning also used to pick representative points of a cluster (the interval to use from a phase). • Points are weighed based on interval size (and phase size, of course) • Only needs to be done one per program+input combination – remember why?

  14. Choosing Simulation Pts • User can tweak interval length, # clusters, etc – tradeoff between number of points simulated, and simulation time.

  15. Experimental Framework • Test Programs: SPEC Benchmarks (26 applications, about half integer, half FP; designed to stress all aspects of a processor.)‏ • Simulation: SimpleScalar, Alpha architecture. • Metrics: accuracy of simulation measured in CPI prediction error

  16. Million Dollar Question... • How does phase classification do? • SPEC2000, 100 million instruction intervals, no more than 10 simulation points • Gzip, Gcc: only 4 and 8 phases found, respectively

  17. Million Dollar Question...

  18. Million Dollar Question... • How accurate is this thing? • A lot better than “current” methods.....

  19. Million Dollar Question...

  20. Million Dollar Question... • How much time are we saving? • In previous result, we're only simulating 400-800 million instructions for SimPoint results. According to SPEC benchmark data sheet, 'reference' input configurations are 50 billion and 80 billion instructions, respectively. • So, baseline simulation needed to execute ~100 times more instructions for this configuration – took several months! • Imagine if we needed to run on a few thousand combinations of cache size, memory latency, etc.... • Intel / Microsoft use it - must be pretty good.

  21. Putting it all together • First implementation of machine learning techniques to perform program phase analysis. • Main thing to take away: applications (even complex ones) only exhibit a few unique behaviors – they're simply interleaved with each other over time. • Using machine learning, we can find these behaviors with methods that are independent of architectural metrics. • By doing so, we only need to simulate a few carefully chosen intervals, which greatly reduces simulation time.

  22. Related / Future Work • Other clustering algorithms with same data (multinomail clustering, regression trees) – k-means appears to do the best. • “Un-tie” simulation points from binary – how could we do this? • Map behavior back to source level after detecting it • Now, we can use same simulation points for different compilations / input of a program • Accuracy is just about as good as with fixed intervals (Lau et. al)

More Related