Project F2: Application Performance Analysis

John Curreri Seth Koehler Rafael Garcia Project F2: Application Performance Analysis

Outline • Introduction • Application mappers • Historical background • Performance analysis today • HLL runtime performance analysis tool • Motivation • Instrumentation • Framework • Visualization • Case study • Molecular Dynamics • Conclusions & References

Application Mappers • Translates C code to HDL • Higher level of abstraction • Usually a subset of ANSI C • No pointers • No standard C libraries for FPGA • HDL is generated as a project file for Xilinx or Altera tools • Built-in communication • Separate C source files are made for the CPU & FPGA • Similar communication function calls between CPU & FPGA

Application Mappers (continued) • Computational parallelism • Pipelining of loops • for(), while(), etc. • Use of library functions • HDL coded functions called at HLL • FFT, Floating point operations • Replication of functions defined in hardware • Types of communication • DMA transfers • Efficient transfer of large chucks of data • Stream transfers • Steady flow of data • Buffered for transfer rate changes

Introduction to the F2 project • Goals for performance analysis in RC • Productively identify and remedy performance bottlenecks in RC applications (CPUs and FPGAs) • Motivations • Complex systems are difficult to analyze by hand • Manual instrumentation is unwieldy • Difficult to make sense of large volume of raw data • Tools can help quickly locate performance problems • Collect and view performance data with little effort • Analyze performance data to indicate potential bottlenecks • Staple in HPC, limited in HPEC, and virtually non-existent in RC • Challenges • How do we expand notion of software performance analysis into software-hardware realm of RC? • What are common bottlenecks for dual-paradigm applications? • What techniques are necessary to detect performance bottlenecks? • How do we analyze and present these bottlenecks to a user?

Historical Background • Gettimeofday and printf • VERY cumbersome, repetitive, manual, not optimized for speed • Profilers date back to 70’s with “prof” (gprof, 1982) • Provide user with information about application behavior • Percentage of time spent in a function • How often a function calls another function • Simulators / Emulators • Too slow or too inaccurate • Require significant development time • PAPI (Performance Application Programming Interface) • Portable interface to hardware performance counters on modern CPUs • Provides information about caches, CPU functional units, main memory, and more * Source: Wikipedia

Performance Analysis Today • What does performance analysis look like today? • Goals • Low impact on application behavior • High-fidelity performance data • Flexible • Portable • Automated • Concise Visualization • Techniques • Event-based, sample-based • Profile, Trace • Above all, we want to understand application behavior in order to locate performance problems!

Related Research and Tools: Parallel Performance Wizard (PPW) • Open-source tool developed by UPC Group at University of Florida • Performance analysis and optimization (PGAS* systems and MPI support) • Performance data can be analyzed for bottlenecks • Offers several ways of exploring performance data • Graphs and charts to quickly view high-level performance information at a glance [right, top] • In-depth execution statistics for identifying communication and computational bottlenecks • Interacts with popular trace viewers (e.g. Jumpshot [right, bottom]) for detailed analysis of trace data • Comprehensive support for correlating performance back to original source code * Partitioned Global Address Space languages allow partitioned memory to be treated as global shared memory by software.

Debug Sequential Performance Debug Parallel Performance Debug Dual-Paradigm Performance Motivation for RC Performance Analysis • Dual-paradigm applications gaining more traction in HPC and HPEC • Design flexibility allows best use of FPGAs and traditional processors • Drawback: More challenging to design applications for dual-paradigm systems • Parallel application tuning and FPGA core debugging are hard enough! Less Difficultylevel More • No existing holistic solutions for analyzing dual-paradigm applications • Software-only views leave out low-level details • Hardware-only views provide incomplete performance information • Need complete system view for effective tuning of entire application

Motivation for RC Performance Analysis • Q: Is my runtime load-balancing strategy working? • A: ??? ChipScope waveform

Motivation for RC Performance Analysis • Q: How well is my core’s pipelining strategy working? • A: ??? Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 51.52 2.55 2.55 5 510.04 510.04 USURP_Reg_poll 29.41 4.01 1.46 34 42.82 42.82 USURP_DMA_write 11.97 4.60 0.59 14 42.31 42.31 USURP_DMA_read 4.06 4.80 0.20 1 200.80 200.80 USURP_Finalize 2.23 4.91 0.11 5 22.09 22.09 localp 1.22 4.97 0.06 5 12.05 12.05 USURP_Load 0.00 4.97 0.00 10 0.00 0.00 USURP_Reg_write 0.00 4.97 0.00 5 0.00 0.00 USURP_Set_clk 0.00 4.97 0.00 5 0.00 931.73 rcwork 0.00 4.97 0.00 1 0.00 0.00 USURP_Init gprof output (×N, one for each node!)

Instrumentation Level • High-level language (HLL) • Requires HLL timing functions • Application mapping disturbed by instrumentation • Hardware Description Language (HDL) • Portable between HLL and types FPGA families • Selected level for instrumentation • FPGA bit stream • Requires targeting specific FPGA family • Instrument in minutes

Instrumentation Selection • Automated - Computation • State machines • Used for preserving execution order in C functions • Used to control state of pipelines • Control and status signals • Used by library function • Automated - Communication • Control and status signals • Used for streaming communication • Used for DMA transfers • Application specific • Monitoring variables for meaningful values

Measurement Techniques • Profiling • Counters • Records number of occurrences of event • Low overhead • Normally uses registers • Block RAM can be used for state machines • Tracing • Timestamps • Indicating when event occurred • Data • Associated with each event • Greater overhead • Uses memory to store timestamps and data • Greater fidelity • Reconstruction of sequence of events * CPU-0 1 2 3 Time * Zaki, O., Lusk, E., Gropp, W., and Swider, D. 1999. Toward Scalable Performance Visualization with Jumpshot. Int. J. High Perform. Comput. Appl. 13, 3 (Aug. 1999), 277-288.

Hardware Measurement Module

Measurement Extraction Process/Thread Instrumented Signals Loopback (C source) Loopback (HDL) Adding Instrumentation & Measurement CPU(s) HLL Tool Flow C source Application (C source) Instrumentation Software -hardware mapping HLL API Wrapper Compile software FPGA(s) Instrumentation Implement hardware HLL Hardware Wrapper Application (C source) Application (HDL) Hardware Measurement Module Finished design Implement hardware Instrumentation added to HDL C source for FPGA mapped to HDL Instrumentation added to C source Uninstrumented Project

Reverse Mapping & Analysis • Mapping of HDL data back to HLL • Variable name-matching • Observing scope and other patterns • Bottleneck detection • Load-balancing of replicated functions • Monitoring for pipeline stalls • Detecting streaming communication stalls • Finding shared-memory contention

Example RC Visualization • Need unified visualizations that accentuate important statistics • Must be scalable to many nodes

Molecular Dynamics • Simulation • Interactions between atoms and molecules • discrete time intervals • Models forces • Newtonian physics • Van Der Walls forces • Other interactions • Tracks molecules position and velocity • X, Y and Z directions http://en.wikipedia.org/wiki/Molecular_dynamics

Case Study Setup • Impulse C v2.2 • XD1000 platform • Opteron 2.2 GHz • XD1000 module with Altera Stratix-II EP2S180 FPGA in second processor socket • MD communication architecture • Chunks of MD data are read from SRAM • Data is streamed to multiple MD kernels that are pipelined • Results are stored back to SRAM

Impulse-C Profile Percentages Output stream of Molecular Dynamics kernel is a bottleneck.

Stream buffer size was increased by 32 times allowing application speedup to increase from 6.2 to 7.8 vs. serial baseline.

Performance Analysis Overhead • Additional FPGA resource usage • Less than 4% • Frequency reduction • Less than 3%

Conclusions • Developed prototype HLL-oriented RC performance analysis tool • First such runtime performance analysis tool framework (per extensive literature review) • Tracing & profiling available • Automated instrumentation in progress • Application case study performed • Observed minimal overhead from tool • Speedup achieved due to performance analysis • Future work • SRC support, automated instrumentation and analysis, integration with software PAT, further case studies

References • Paul Graham, Brent Nelson, and Brad Hutchings. Instrumenting bitstreams for debugging FPGA circuits. In Proc. of the the 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 41-50, Washington, DC, USA, Apr. 2001. IEEE Computer Society. • Sameer S. Shende and Allen D. Malony. The Tau parallel performance system. International Journal of High Performance Computing Applications (HPCA), 20(2):287-311, May 2006. • C. EricWu, Anthony Bolmarcich, Marc Snir, DavidWootton, Farid Parpia, Anthony Chan, Ewing Lusk, and William Gropp. From trace generation to visualization: a performance framework for distributed parallel systems. In Proc. of the 2000 ACM/IEEE conference on Supercomputing (CDROM) (SC), page 50, Washington, DC, USA, Nov. 2000. IEEE Computer Society. • Adam Leko and Max Billingsley, III. Parallel performance wizard user manual. http://ppw.hcs.ufl.edu/docs/pdf/manual.pdf, 2007. • S. Koehler, J. Curreri, and A. George, "Challenges for Performance Analysis in High-Performance Reconfigurable Computing," Proc. of Reconfigurable Systems Summer Institute 2007 (RSSI), Urbana, IL, July 17-20, 2007. • J. Curreri, S. Koehler, B. Holland, and A. George, "Performance Analysis with High-Level Languages for High-Performance Reconfigurable Computing," Proc. of 16th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Palo Alto, CA, Apr. 14-15, 2008.

Project F2: Application Performance Analysis

Project F2: Application Performance Analysis

Presentation Transcript

PROGRAMME F2

Using Performance Monitoring Hardware for Application Performance Analysis

F2: Performance Analysis Profiling with PPW

Web application Performance

Enhancing Application Performance

Performance Analysis

Application Performance Monitoring

Project F2: Application Performance Analysis

Performance Analysis

PROGRAMME F2

ACCA F2

Performance analysis of a Pose application -- BigNetSim

Application Performance

F2-0508

Application performance management

Application Performance Analysis and Modeling

Application Performance Analysis on Blue Gene/L

Application Performance Management

Network Application Performance