300 likes | 312 Views
This project introduces a module to capture cycle-accurate hardware event profiles during program runtimes on real systems. It aims to identify bottlenecks like memory accesses and ISA decoding, using FPGA stats modules. The architecture associates counters with events and methods, providing scalability. Results show data from the Linpack benchmark running on FPGA. The study discusses the future integration of the StatsMod into Linux OS for runtime introspection and optimization.
E N D
Cycle Accurate Performance Measurement Richard Hough Phillip Jones, Scott Friedman, Roger Chamberlain, Jason Fritts, John Lockwood, and Ron Cytron rh3@wustl.edu http://liquid.arl.wustl.edu/ Funded by NSF Grant ITR-0313203
Outline • Introduction • Motivation • Background • Architecture • Usage • Results • Future Work • Related Work • Conclusion
Introduction – What Are We Doing? • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems
Introduction – What Are We Doing? • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module
Introduction – What Are We Doing? Program Bottlenecks Program Runtime • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module
Introduction – What Are We Doing? Program Bottlenecks Memory Accesses ISA Decoding Program Runtime Cache Hits • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module
Introduction – What Are We Doing? Program Bottlenecks Memory Accesses ISA Decoding Program Runtime Cache Hits • Creating a module for capturing cycle-accurate profiles of hardware events during the runtime of programs on real systems Statistics Module
Background - FPX • Designed and implemented on the FPX platform • The FPX platform is: • Designed for developing pluggable network circuits • Contains a Virtex 2000e FPGA for design deployment • Possesses a smaller FPGA used as a network interface device • Can potentially operate at gigabit line rates
Background - LEON2 • Developed by Gaisler Research • Sparc-V8 • Open-Source VHDL • Widely used • European Space Agency, etc. • Second in popularity only to the Microblaze
Motivation – Why Not Use Software? • Software Profiling Is: • Inaccurate • Many data points estimated • Time slices not absolute • Profiling affects results • Inefficient • Unreasonable for real-system deployment • Ineffective • Difficult to separate OS overhead
Motivation – Why Not Use Simulation? • Simulation is: • Slow • A simple simulation could require 100X more time than running the program • Bound by the quality of the model • The model used may be inaccurate • Processors often tweaked without updating the documentation [Larus]
Motivation – Why Use FPGAs? • ASICs are expensive • FPGAs provide good blend of cost and accuracy • Software simulation of processors is incredibly slow • Allows for easy prototyping • Test new caching methods, tweak the ISA, etc.
Motivation – Why Put Statsmod In A FPGA? • The Statistics Module Allows You To: • Pull Event Signals from anywhere • Evaluate both software and hardware optimizations • Tweak the architecture • Integrate hardware accelerated modules into software solutions • Adjust the software algorithm • Gather repeatable and reliable results
Architecture – Naïve Solution • Interested in 10 events and counters • Naïve solution implements a counter for each possibility • 100 counters! • Not scalable for large systems
Architecture – Our Solution • Better Approach • Associate counters to events and methods at run time • Covers the problem area, but uses less chip space
Architecture – Scalability Naïve Approach Address Range Registers Counters Events
Results – What do we get? • The next few slides contain data from the Linpack benchmark running on the FPGA • Linpack is a FPU intensive benchmark • While the following slides focus on runtime, it is important to remember that the graphs could in principle be of *any* event
Results 323,686,726 Clock Cycles
Future Work – Where can we go? • As of a week ago, the StatsMod was successfully integrated into a Linux 2.6.11 OS running on Leon • Changes have been made to allow a clear separation between Process IDs • OS, background tasks, threads • A device driver allows any program, including the program being profiled, to gather the statistics
Future Work – Where can we go? • Programs could now potentially collect statistics on themselves perform runtime introspection • Adjust operation to conserve power, memory accesses, etc. • Deeper integration could occur at the kernel level to affect scheduler decisions • Adds a new dimension for slicing resources • Network activity, device activity, page faults, etc.
Related Work • SnoopP • Developed by Lesley Shannon and Paul Chow at the University of Toronto • Collects timing characteristics of programs running on a Microblaze processor • Focuses on clock cycles only • Integrated into the EDK
Conclusion In closing, I would like to thank: • Phillip Jones for his hard work and support • Ron Cytron for his mentoring and persistence • Scott Friedman for his work on the web interface • The rest of the Liquid Architecture team • And WISA for the invitation to present
Usage • Connect to a secure web server controlling the FPGA hardware • Upload the desired binary executable, associated mapfile, and desired programming bitfile • A perl script parses the map file and provides a graphical interface for selecting the desired address ranges and events • Statistic results are tabulated at the end of the program’s execution