1 / 30

Cycle Accurate Performance Measurement

This project introduces a module to capture cycle-accurate hardware event profiles during program runtimes on real systems. It aims to identify bottlenecks like memory accesses and ISA decoding, using FPGA stats modules. The architecture associates counters with events and methods, providing scalability. Results show data from the Linpack benchmark running on FPGA. The study discusses the future integration of the StatsMod into Linux OS for runtime introspection and optimization.

earnshaw
Download Presentation

Cycle Accurate Performance Measurement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cycle Accurate Performance Measurement Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain, Jason Fritts, John Lockwood, and Ron Cytron rh3@wustl.edu http://liquid.arl.wustl.edu/ Funded by NSF Grant ITR-0313203

  2. Outline • Introduction • Motivation • Background • Architecture • Usage • Results • Future Work • Related Work • Conclusion

  3. Introduction – What Are We Doing? • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems

  4. Introduction – What Are We Doing? • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module

  5. Introduction – What Are We Doing? Program Bottlenecks Program Runtime • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module

  6. Introduction – What Are We Doing? Program Bottlenecks Memory Accesses ISA Decoding Program Runtime Cache Hits • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module

  7. Introduction – What Are We Doing? Program Bottlenecks Memory Accesses ISA Decoding Program Runtime Cache Hits • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module

  8. Background - FPX • Designed and implemented on the FPX platform • The FPX platform is: • Designed for developing pluggable network circuits • Contains a Virtex 2000e FPGA for design deployment • Possesses a smaller FPGA used as a network interface device • Can potentially operate at gigabit line rates

  9. Background - LEON2 • Developed by Gaisler Research • Sparc-V8 • Open-Source VHDL • Widely used • European Space Agency, etc. • Second in popularity only to the Microblaze

  10. Motivation – Why Not Use Software? • Software Profiling Is: • Inaccurate • Many data points estimated • Time slices not absolute • Profiling affects results • Inefficient • Unreasonable for real-system deployment • Ineffective • Difficult to separate OS overhead

  11. Motivation – Why Not Use Simulation? • Simulation is: • Slow • A simple simulation could require 100X more time than running the program • Bound by the quality of the model • The model used may be inaccurate • Processors often tweaked without updating the documentation [Larus]

  12. Motivation – Why Use FPGAs? • ASICs are expensive • FPGAs provide good blend of cost and accuracy • Software simulation of processors is incredibly slow • Allows for easy prototyping • Test new caching methods, tweak the ISA, etc.

  13. Motivation – Why Put Statsmod In A FPGA? • The Statistics Module Allows You To: • Pull Event Signals from anywhere • Evaluate both software and hardware optimizations • Tweak the architecture • Integrate hardware accelerated modules into software solutions • Adjust the software algorithm • Gather repeatable and reliable results

  14. Architecture – Naïve Solution • Interested in 10 events and counters • Naïve solution implements a counter for each possibility • 100 counters! • Not scalable for large systems

  15. Architecture – Our Solution • Better Approach • Associate counters to events and methods at run time • Covers the problem area, but uses less chip space

  16. Architecture – An In Depth Look

  17. Architecture – Scalability Naïve Approach Address Range Registers Counters Events

  18. Usage

  19. Results – What do we get? • The next few slides contain data from the Linpack benchmark running on the FPGA • Linpack is a FPU intensive benchmark • While the following slides focus on runtime, it is important to remember that the graphs could in principle be of *any* event

  20. Results 323,686,726 Clock Cycles

  21. Results

  22. Results

  23. Results

  24. Future Work – Where can we go? • As of a week ago, the StatsMod was successfully integrated into a Linux 2.6.11 OS running on Leon • Changes have been made to allow a clear separation between Process IDs • OS, background tasks, threads • A device driver allows any program, including the program being profiled, to gather the statistics

  25. Future Work – Where can we go? • Programs could now potentially collect statistics on themselves perform runtime introspection • Adjust operation to conserve power, memory accesses, etc. • Deeper integration could occur at the kernel level to affect scheduler decisions • Adds a new dimension for slicing resources • Network activity, device activity, page faults, etc.

  26. Related Work • SnoopP • Developed by Lesley Shannon and Paul Chow at the University of Toronto • Collects timing characteristics of programs running on a Microblaze processor • Focuses on clock cycles only • Integrated into the EDK

  27. Conclusion In closing, I would like to thank: • Phillip Jones for his hard work and support • Ron Cytron for his mentoring and persistence • Scott Friedman for his work on the web interface • The rest of the Liquid Architecture team • And WISA for the invitation to present

  28. Questions?

  29. Background – Liquid

  30. Usage • Connect to a secure web server controlling the FPGA hardware • Upload the desired binary executable, associated mapfile, and desired programming bitfile • A perl script parses the map file and provides a graphical interface for selecting the desired address ranges and events • Statistic results are tabulated at the end of the program’s execution

More Related