1 / 25

Project F2: Application Performance Analysis

John Curreri Seth Koehler Rafael Garcia. Project F2: Application Performance Analysis. Outline. Introduction Application mappers Historical background Performance analysis today HLL runtime performance analysis tool Motivation Instrumentation Framework Visualization Case study

pomona
Download Presentation

Project F2: Application Performance Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis

  2. Outline • Introduction • Application mappers • Historical background • Performance analysis today • HLL runtime performance analysis tool • Motivation • Instrumentation • Framework • Visualization • Case study • Molecular Dynamics • Conclusions & References

  3. Application Mappers • Translates C code to HDL • Higher level of abstraction • Usually a subset of ANSI C • No pointers • No standard C libraries for FPGA • HDL is generated as a project file for Xilinx or Altera tools • Built-in communication • Separate C source files are made for the CPU & FPGA • Similar communication function calls between CPU & FPGA

  4. Application Mappers (continued) • Computational parallelism • Pipelining of loops • for(), while(), etc. • Use of library functions • HDL coded functions called at HLL • FFT, Floating point operations • Replication of functions defined in hardware • Types of communication • DMA transfers • Efficient transfer of large chucks of data • Stream transfers • Steady flow of data • Buffered for transfer rate changes

  5. Introduction to the F2 project • Goals for performance analysis in RC • Productively identify and remedy performance bottlenecks in RC applications (CPUs and FPGAs) • Motivations • Complex systems are difficult to analyze by hand • Manual instrumentation is unwieldy • Difficult to make sense of large volume of raw data • Tools can help quickly locate performance problems • Collect and view performance data with little effort • Analyze performance data to indicate potential bottlenecks • Staple in HPC, limited in HPEC, and virtually non-existent in RC • Challenges • How do we expand notion of software performance analysis into software-hardware realm of RC? • What are common bottlenecks for dual-paradigm applications? • What techniques are necessary to detect performance bottlenecks? • How do we analyze and present these bottlenecks to a user?

  6. Historical Background • Gettimeofday and printf • VERY cumbersome, repetitive, manual, not optimized for speed • Profilers date back to 70’s with “prof” (gprof, 1982) • Provide user with information about application behavior • Percentage of time spent in a function • How often a function calls another function • Simulators / Emulators • Too slow or too inaccurate • Require significant development time • PAPI (Performance Application Programming Interface) • Portable interface to hardware performance counters on modern CPUs • Provides information about caches, CPU functional units, main memory, and more * Source: Wikipedia

  7. Performance Analysis Today • What does performance analysis look like today? • Goals • Low impact on application behavior • High-fidelity performance data • Flexible • Portable • Automated • Concise Visualization • Techniques • Event-based, sample-based • Profile, Trace • Above all, we want to understand application behavior in order to locate performance problems!

  8. Related Research and Tools: Parallel Performance Wizard (PPW) • Open-source tool developed by UPC Group at University of Florida • Performance analysis and optimization (PGAS* systems and MPI support) • Performance data can be analyzed for bottlenecks • Offers several ways of exploring performance data • Graphs and charts to quickly view high-level performance information at a glance [right, top] • In-depth execution statistics for identifying communication and computational bottlenecks • Interacts with popular trace viewers (e.g. Jumpshot [right, bottom]) for detailed analysis of trace data • Comprehensive support for correlating performance back to original source code * Partitioned Global Address Space languages allow partitioned memory to be treated as global shared memory by software.

  9. Debug Sequential Performance Debug Parallel Performance Debug Dual-Paradigm Performance Motivation for RC Performance Analysis • Dual-paradigm applications gaining more traction in HPC and HPEC • Design flexibility allows best use of FPGAs and traditional processors • Drawback: More challenging to design applications for dual-paradigm systems • Parallel application tuning and FPGA core debugging are hard enough! Less Difficultylevel More • No existing holistic solutions for analyzing dual-paradigm applications • Software-only views leave out low-level details • Hardware-only views provide incomplete performance information • Need complete system view for effective tuning of entire application

  10. Motivation for RC Performance Analysis • Q: Is my runtime load-balancing strategy working? • A: ??? ChipScope waveform

  11. Motivation for RC Performance Analysis • Q: How well is my core’s pipelining strategy working? • A: ??? Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 51.52 2.55 2.55 5 510.04 510.04 USURP_Reg_poll 29.41 4.01 1.46 34 42.82 42.82 USURP_DMA_write 11.97 4.60 0.59 14 42.31 42.31 USURP_DMA_read 4.06 4.80 0.20 1 200.80 200.80 USURP_Finalize 2.23 4.91 0.11 5 22.09 22.09 localp 1.22 4.97 0.06 5 12.05 12.05 USURP_Load 0.00 4.97 0.00 10 0.00 0.00 USURP_Reg_write 0.00 4.97 0.00 5 0.00 0.00 USURP_Set_clk 0.00 4.97 0.00 5 0.00 931.73 rcwork 0.00 4.97 0.00 1 0.00 0.00 USURP_Init gprof output (×N, one for each node!)

  12. Instrumentation Level • High-level language (HLL) • Requires HLL timing functions • Application mapping disturbed by instrumentation • Hardware Description Language (HDL) • Portable between HLL and types FPGA families • Selected level for instrumentation • FPGA bit stream • Requires targeting specific FPGA family • Instrument in minutes

  13. Instrumentation Selection • Automated - Computation • State machines • Used for preserving execution order in C functions • Used to control state of pipelines • Control and status signals • Used by library function • Automated - Communication • Control and status signals • Used for streaming communication • Used for DMA transfers • Application specific • Monitoring variables for meaningful values

  14. Measurement Techniques • Profiling • Counters • Records number of occurrences of event • Low overhead • Normally uses registers • Block RAM can be used for state machines • Tracing • Timestamps • Indicating when event occurred • Data • Associated with each event • Greater overhead • Uses memory to store timestamps and data • Greater fidelity • Reconstruction of sequence of events * CPU-0 1 2 3 Time * Zaki, O., Lusk, E., Gropp, W., and Swider, D. 1999. Toward Scalable Performance Visualization with Jumpshot. Int. J. High Perform. Comput. Appl. 13, 3 (Aug. 1999), 277-288.

  15. Hardware Measurement Module

  16. Measurement Extraction Process/Thread Instrumented Signals Loopback (C source) Loopback (HDL) Adding Instrumentation & Measurement CPU(s) HLL Tool Flow C source Application (C source) Instrumentation Software -hardware mapping HLL API Wrapper Compile software FPGA(s) Instrumentation Implement hardware HLL Hardware Wrapper Application (C source) Application (HDL) Hardware Measurement Module Finished design Implement hardware Instrumentation added to HDL C source for FPGA mapped to HDL Instrumentation added to C source Uninstrumented Project

  17. Reverse Mapping & Analysis • Mapping of HDL data back to HLL • Variable name-matching • Observing scope and other patterns • Bottleneck detection • Load-balancing of replicated functions • Monitoring for pipeline stalls • Detecting streaming communication stalls • Finding shared-memory contention

  18. Example RC Visualization • Need unified visualizations that accentuate important statistics • Must be scalable to many nodes

  19. Molecular Dynamics • Simulation • Interactions between atoms and molecules • discrete time intervals • Models forces • Newtonian physics • Van Der Walls forces • Other interactions • Tracks molecules position and velocity • X, Y and Z directions http://en.wikipedia.org/wiki/Molecular_dynamics

  20. Case Study Setup • Impulse C v2.2 • XD1000 platform • Opteron 2.2 GHz • XD1000 module with Altera Stratix-II EP2S180 FPGA in second processor socket • MD communication architecture • Chunks of MD data are read from SRAM • Data is streamed to multiple MD kernels that are pipelined • Results are stored back to SRAM

  21. Impulse-C Profile Percentages Output stream of Molecular Dynamics kernel is a bottleneck.

  22. Stream buffer size was increased by 32 times allowing application speedup to increase from 6.2 to 7.8 vs. serial baseline.

  23. Performance Analysis Overhead • Additional FPGA resource usage • Less than 4% • Frequency reduction • Less than 3%

  24. Conclusions • Developed prototype HLL-oriented RC performance analysis tool • First such runtime performance analysis tool framework (per extensive literature review) • Tracing & profiling available • Automated instrumentation in progress • Application case study performed • Observed minimal overhead from tool • Speedup achieved due to performance analysis • Future work • SRC support, automated instrumentation and analysis, integration with software PAT, further case studies

  25. References • Paul Graham, Brent Nelson, and Brad Hutchings. Instrumenting bitstreams for debugging FPGA circuits. In Proc. of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 41-50, Washington, DC, USA, Apr. 2001. IEEE Computer Society. • Sameer S. Shende and Allen D. Malony. The Tau parallel performance system. International Journal of High Performance Computing Applications (HPCA), 20(2):287-311, May 2006. • C. EricWu, Anthony Bolmarcich, Marc Snir, DavidWootton, Farid Parpia, Anthony Chan, Ewing Lusk, and William Gropp. From trace generation to visualization: a performance framework for distributed parallel systems. In Proc. of the 2000 ACM/IEEE conference on Supercomputing (CDROM) (SC), page 50, Washington, DC, USA, Nov. 2000. IEEE Computer Society. • Adam Leko and Max Billingsley, III. Parallel performance wizard user manual. http://ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007. • S. Koehler, J. Curreri, and A. George, "Challenges for Performance Analysis in High-Performance Reconfigurable Computing," Proc. of Reconfigurable Systems Summer Institute 2007 (RSSI), Urbana, IL, July 17-20, 2007. • J. Curreri, S. Koehler, B. Holland, and A. George, "Performance Analysis with High-Level Languages for High-Performance Reconfigurable Computing," Proc. of 16th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Palo Alto, CA, Apr. 14-15, 2008.

More Related