1 / 24

Debugging and Optimizing RC Applications

Seth Koehler John Curreri. Debugging and Optimizing RC Applications. Presentation Overview. Introduction Background Reconfigurable computing (RC) applications Debug Performance analysis Project overview Project details ReCAP framework & tool Special features

nhi
Download Presentation

Debugging and Optimizing RC Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Seth Koehler John Curreri Debugging and Optimizing RC Applications

  2. Presentation Overview • Introduction • Background • Reconfigurable computing (RC) applications • Debug • Performance analysis • Project overview • Project details • ReCAP framework & tool • Special features • HLL-based debug & performance analysis • Case studies • Conclusions

  3. Introduction • Debugging and optimization are an integral part of application development • Typically at end of development cycle (after formulation and design phases) • Designers often spend longer debugging the application than designing it! * • Optimization is often just left for a later version, if ever • Every optimization made has to re-pass through debug phase • To improve productivity in application design, it is critical to address debug and optimization Formulation Design Translation Execution * Debugging FPGA systems - ftp://ftp.altera.com/outgoing/download/education/events/ highspeed/Tek_ALTERAFPGADEBUG_IPIntegration_final.pdf

  4. Background – RC Applications • Why reconfigurable computing (RC)? • General-purpose architectures can bewasteful interms of performance and power • Impractical to have an ASIC for every application • RC ~= FPGAs (Field-Programmable Gate Arrays) • Application-specific hardware and parallelism • Retain flexibility and programmability • RC applications typicallyemploy CPUs and FPGAs • Leverage strengths of both types of processors • Potential for higher performance using less power • Programmed using Hardware Description Languages (HDLs) or High-Level Languages (HLLs) • CPU is programmed with whatever conventional HLL is desired (C, C++, MPI, UPC, etc.) • System and application complexity can make it difficult to achieve a correct, well-performing application

  5. Background - Debug • Debug: to detect and remove errors from a program * • Debugging methods • Stare at code • At least it helps you "wrap your mind around your code" • Insert printf statements • Requires some good guessing, can be tedious if more than a few printf's • Use debugger (e.g., gdb) • Much better – instant access to all data and support for indicating where/why a program crashed • Use simulator • Can provide more flexibility and information than debugger, but simulators can be inaccurate and slow, not to mention hard to make • Write assertions • Best – application designer documents situations that are impossible • Formal and dynamic verification methods check whether assertions hold * http://dictionary.reference.com

  6. Background – Performance Analysis • Performance analysis – investigate program behavior using information gathered during execution * • Aides designer in locating and remedying application bottlenecks, reducing guesswork in optimization • Replaces tedious, error-prone manual analysis methods (timing routines and printf statements) * http://en.wikipedia.org/wiki/Performance_analysis

  7. Project Overview • RC systems and applications are even more complex than in HPC • Heterogeneous components • Hierarchy of parallelism among components • Lack of visibility inside RC devices • Optimizing applications is crucial for effective use of these systems • Debug and performance tools are relied on heavily in HPC to productively verify and optimize applications • Debug and performance tools are even more essential in RC due to additional system and application complexity, and yet research is lacking • Objective: expand the notion and benefits of software debugging and performance analysis into the software-hardware realm of RC

  8. ReCAP Framework • Reconfigurable-computing application performance (ReCAP) framework • Adds assertion-based verification and performance analysis capabilities to FPGA portion of application • Builds upon existing assertions in HLL languages AND Parallel Performance Wizard (PPW) for performance analysis of CPU portion of application • Three main components • HDL Instrumenter • Hardware Measurement Module (HMM) • RC-enhanced version of PPW (PPW+RC) • Backend (instrumentation andmeasurement) • Frontend (analysis and visualization)

  9. ReCAP: HDL Instrumenter • Modifies HDL design files to monitor application data at runtime • User can define "events" that are of interest • e.g., buffer full, cycles spent in a state • User can define "monitors" that determine what to record when event occurs • e.g., summary statistics, full trace • User can enable a number of automatic analyses • e.g., decision coverage, assertions, profiling, automatic bottleneck detection HDL Instrumenter Instrumentation Process

  10. ReCAP: Hardware Measurement Module • Hardware necessary to record, store, and retrieve data at runtime • Profiling, tracing, and sampling • Cycle counter and other module statistics (trace records dropped, counter overflow, etc.) • Buffers for storing trace data • Module control for performance data retrieval and miscellaneous control (e.g., clear and stop) Hardware Measurement Module (HMM) Instrumentation Process

  11. ReCAP: PPW+RC • PPW+RC backend adds thread to software to query HMM at runtime • Requires lock (since we now have shared FPGA access) • Handles FPGA performance data storage and migration to PPW data structures • Monitors FPGA API calls in addition to normal PPW software performance monitoring • PPW+RC frontend analyzes and presents measured data for CPUs / FPGAs • Table and chart views across multiple experiments • Export to Jumpshot for timeline views Instrumentation Process PPW+RC front-end

  12. ReCAP Tool-Flow • HDL source files are instrumented, then synthesized/implemented normally • HLL source files are instrumented during compilation • Use ppwcc instead of gcc or ppwupcc instead of upcc • Program is executed normally on system • Performance data file produced can be viewed and analyzed with PPW+RC

  13. Common RC Bottleneck Detection • Automatically search for common RC bottlenecks • Reduces time and knowledge needed to find bottlenecks • Requires some information from user • We attempt to minimize the amount of information requested • Currently produces text file containing • All detected bottlenecks, • Potential optimization strategies for each • Peak/ideal speedup if bottleneck resolved

  14. Architecture-Aware Visualization System level • Architecture-aware visualization • Visualization within application & system context, with integrated common-bottleneck data • Must be scalable to large systems • Allow user to experiment with different optimization scenarios to see what provides best performance Node level

  15. Automated instrumentation Computation State machines Used for preserving execution order in C functions Used to control pipelines Control and status signals used by library functions Communication Control and status signals Streaming communication DMA transfers User-assisted instrumentation Application-specific variables Monitor meaningful values selected by user Measurement Employ HMM from HDL framework High-level languages Impulse-C and Carte C Convert subset of C to HDL Employ DMA and streaming communication Speedup gained by Pipelining loops Library functions Replicated functions Impulse C Pipelining of loops Determined by pragmas in code Carte (SRC) Pipelining of loops Automatic pipelining of inner most loop Library functions Called as C function HDL coded HLL Performance Analysis HLL

  16. Measurement Extraction Process/Thread Instrumented Signals Loopback (C source) Loopback (HDL) HLL Instrumentation & Measurement HLL CPU(s) HLL Tool Flow C source Application (C source) Instrumentation Software -hardware mapping HLL API Wrapper Compile software Instrumentation FPGA(s) Implement hardware HLL Hardware Wrapper Application (C source) Application (HDL) Hardware Measurement Module Finished design Instrumentation added to HDL C source for FPGA mapped to HDL Implement hardware Instrumentation added to C source Uninstrumented Project

  17. HLL Analysis & Visualizations HLL • Bottleneck detection (currently user-assisted) • Load-balancing of replicated functions • Monitoring for pipeline stalls • Detecting streaming communication stalls • Finding shared-memory contention • Integration with performance analysis tool • Profiling data • Pie charts showing time utilization • Tree view of CPU and FPGA timing HDL State Machine C Source Main MD loop Input stream Pipeline transition b4s0 b4s1 b4s2 b4s3 b4s4 b6s0 b6s1 Output steam ?

  18. HLL Assertion Debugging • Based off of ANSI C assert function int num, i, x[10];while(num==0) { num=x[i++]; assert(i<10);} • Failure will halt application, displaying an error test.c:7: main: Assertion `i<10' failed. • Assertions can be disabled via #define NDEBUG • Most HLLs do not synthesize standard C library functions on the FPGA • Convert assertion function to if statement (renamed via Perl script) • Send line number of failed assertions on the FPGA to the CPU • Communication stream created and routed between hardware functions with assertion statements and software function • Perform failure actions via a software function (added via Perl script)

  19. Q Q Q Q Case Study: N-Queens • Overview • Find number of distinct ways n queens can be placed on an n×n board without attacking each other (via backtracking algorithm) • Multi-CPU/FPGA application (UPC/VHDL) • Overhead • <= 6% area (sixteen 32-bit profile counters for state machines) • <= 2% memory (96-bit-wide trace buffer for core finish time) • Negligible frequency degradation observed FPGAs

  20. Case study: 2D-PDF estimation* • Application • Estimate a 2D probability density function (i.e., nearly smooth histogram) given set of (x, y) coordinate data • 3.2GHz Xeon, Virtex-4 LX100 FPGA, PCI-X • Results • Automatic bottleneck detection results showed problematic communication and control • Based on tool suggestion, increased buffer sizes and restructuring of control logic was achieved in a day, providing up to a 5.5x speedup for the 10-core design Software functions FPGA Write FPGA Read * 2D-PDF code written by Karthik Nagarajan

  21. Case Study: Molecular Dynamics HLL • Stream buffer • Increased buffer size by 32 times • Speedup change • 6.2 vs. serial baseline before enhancements • 7.8 vs. serial baseline after enhancements • Molecular Dynamics • Simulates interaction of molecules over discrete time steps • Impulse C version 2.2 • XD1000 platform • Dual-processor motherboard • Opteron 2.2GHz • Stratix-II EP2S180 XD1000 module • MD communication architecture • Chunks of MD data read from SRAM • Data streamed to multiple pipelined MD kernels • Results stored back to SRAM

  22. HLL Debug Case Study • Impulse C performs 32 bit comparison with 64 bit values void Logcontrol (… { … co_int64 big, test, update; small_1=321; small_2=123; big=5000000000; test=1073741824; IF_SIM(printf("HW big:%lld\n",big);) IF_SIM(printf("HW test:%lld\n",test);) i=0; while(big<test) { co_stream_write(small_stream, &small_1, sizeof(co_int32)); IF_SIM(printf("HW if passed\n");) small_1=big&4294967295; small_2=big>>32; i++; assert(i<10); } Impulse C code 32 bits 100101010000001011111001000000000 1000000000000000000000000000000 VHDL 1073741824 705032704 ni192_suif_tmp <= … & cmp_less_s(r_big(31 downto 0), r_test(31 downto 0));

  23. HLL Debug Case Study (cont) Simulation • Results • In simulation, loop does not execute and assertion is never called • In hardware loop executes infinitely • In hardware with assert, loop executes and assertion fails • Overhead • Streaming overhead generated per process • Additional FPGA resource usage < 0.1% C:\hwr\test4-assert>memtest.exe Small stream Open HW big:5000000000 HW test:1073741824 Big stream Open Small lower read:321 Small upper read:123 … Hardware execution [root@xd1000-3 test4]# ./run_sw Small stream Open Big stream Open memtest_hw.c:31: Assertion 'i<10' failed. Small lower read:705032704 Small upper read:1 …

  24. Conclusions • Debug and performance analysis of RC applications is critical for improving productivity in obtaining a correctly functioning, well-performing application • ReCAP framework/tool aides designers with verification and performance analysis • Records and monitors application data on CPU and FPGA at runtime while minimizing overhead and user effort • Can perform a number of automated analyses including common bottleneck detection, decision coverage, and assertion monitoring • Provides analysis and presentation of CPU/FPGA debug and performance data • ReCAP represents the first RC application performance framework and tool (per extensive literature review) • Debug capabilities are also not currently found in other tools

More Related