300 likes | 421 Views
Cycle Accurate Performance Measurement. Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain, Jason Fritts, John Lockwood, and Ron Cytron rh3@wustl.edu http://liquid.arl.wustl.edu/. Funded by NSF Grant ITR-0313203. Outline. Introduction Motivation Background Architecture Usage
E N D
Cycle Accurate Performance Measurement Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain, Jason Fritts, John Lockwood, and Ron Cytron rh3@wustl.edu http://liquid.arl.wustl.edu/ Funded by NSF Grant ITR-0313203
Outline • Introduction • Motivation • Background • Architecture • Usage • Results • Future Work • Related Work • Conclusion
Introduction – What Are We Doing? • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems
Introduction – What Are We Doing? • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module
Introduction – What Are We Doing? Program Bottlenecks Program Runtime • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module
Introduction – What Are We Doing? Program Bottlenecks Memory Accesses ISA Decoding Program Runtime Cache Hits • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module
Introduction – What Are We Doing? Program Bottlenecks Memory Accesses ISA Decoding Program Runtime Cache Hits • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module
Background - FPX • Designed and implemented on the FPX platform • The FPX platform is: • Designed for developing pluggable network circuits • Contains a Virtex 2000e FPGA for design deployment • Possesses a smaller FPGA used as a network interface device • Can potentially operate at gigabit line rates
Background - LEON2 • Developed by Gaisler Research • Sparc-V8 • Open-Source VHDL • Widely used • European Space Agency, etc. • Second in popularity only to the Microblaze
Motivation – Why Not Use Software? • Software Profiling Is: • Inaccurate • Many data points estimated • Time slices not absolute • Profiling affects results • Inefficient • Unreasonable for real-system deployment • Ineffective • Difficult to separate OS overhead
Motivation – Why Not Use Simulation? • Simulation is: • Slow • A simple simulation could require 100X more time than running the program • Bound by the quality of the model • The model used may be inaccurate • Processors often tweaked without updating the documentation [Larus]
Motivation – Why Use FPGAs? • ASICs are expensive • FPGAs provide good blend of cost and accuracy • Software simulation of processors is incredibly slow • Allows for easy prototyping • Test new caching methods, tweak the ISA, etc.
Motivation – Why Put Statsmod In A FPGA? • The Statistics Module Allows You To: • Pull Event Signals from anywhere • Evaluate both software and hardware optimizations • Tweak the architecture • Integrate hardware accelerated modules into software solutions • Adjust the software algorithm • Gather repeatable and reliable results
Architecture – Naïve Solution • Interested in 10 events and counters • Naïve solution implements a counter for each possibility • 100 counters! • Not scalable for large systems
Architecture – Our Solution • Better Approach • Associate counters to events and methods at run time • Covers the problem area, but uses less chip space
Architecture – Scalability Naïve Approach Address Range Registers Counters Events
Results – What do we get? • The next few slides contain data from the Linpack benchmark running on the FPGA • Linpack is a FPU intensive benchmark • While the following slides focus on runtime, it is important to remember that the graphs could in principle be of *any* event
Results 323,686,726 Clock Cycles
Future Work – Where can we go? • As of a week ago, the StatsMod was successfully integrated into a Linux 2.6.11 OS running on Leon • Changes have been made to allow a clear separation between Process IDs • OS, background tasks, threads • A device driver allows any program, including the program being profiled, to gather the statistics
Future Work – Where can we go? • Programs could now potentially collect statistics on themselves perform runtime introspection • Adjust operation to conserve power, memory accesses, etc. • Deeper integration could occur at the kernel level to affect scheduler decisions • Adds a new dimension for slicing resources • Network activity, device activity, page faults, etc.
Related Work • SnoopP • Developed by Lesley Shannon and Paul Chow at the University of Toronto • Collects timing characteristics of programs running on a Microblaze processor • Focuses on clock cycles only • Integrated into the EDK
Conclusion In closing, I would like to thank: • Phillip Jones for his hard work and support • Ron Cytron for his mentoring and persistence • Scott Friedman for his work on the web interface • The rest of the Liquid Architecture team • And WISA for the invitation to present
Usage • Connect to a secure web server controlling the FPGA hardware • Upload the desired binary executable, associated mapfile, and desired programming bitfile • A perl script parses the map file and provides a graphical interface for selecting the desired address ranges and events • Statistic results are tabulated at the end of the program’s execution