240 likes | 377 Views
Seth Koehler John Curreri. Debugging and Optimizing RC Applications. Presentation Overview. Introduction Background Reconfigurable computing (RC) applications Debug Performance analysis Project overview Project details ReCAP framework & tool Special features
E N D
Seth Koehler John Curreri Debugging and Optimizing RC Applications
Presentation Overview • Introduction • Background • Reconfigurable computing (RC) applications • Debug • Performance analysis • Project overview • Project details • ReCAP framework & tool • Special features • HLL-based debug & performance analysis • Case studies • Conclusions
Introduction • Debugging and optimization are an integral part of application development • Typically at end of development cycle (after formulation and design phases) • Designers often spend longer debugging the application than designing it! * • Optimization is often just left for a later version, if ever • Every optimization made has to re-pass through debug phase • To improve productivity in application design, it is critical to address debug and optimization Formulation Design Translation Execution * Debugging FPGA systems - ftp://ftp.altera.com/outgoing/download/education/events/ highspeed/Tek_ALTERAFPGADEBUG_IPIntegration_final.pdf
Background – RC Applications • Why reconfigurable computing (RC)? • General-purpose architectures can bewasteful interms of performance and power • Impractical to have an ASIC for every application • RC ~= FPGAs (Field-Programmable Gate Arrays) • Application-specific hardware and parallelism • Retain flexibility and programmability • RC applications typicallyemploy CPUs and FPGAs • Leverage strengths of both types of processors • Potential for higher performance using less power • Programmed using Hardware Description Languages (HDLs) or High-Level Languages (HLLs) • CPU is programmed with whatever conventional HLL is desired (C, C++, MPI, UPC, etc.) • System and application complexity can make it difficult to achieve a correct, well-performing application
Background - Debug • Debug: to detect and remove errors from a program * • Debugging methods • Stare at code • At least it helps you "wrap your mind around your code" • Insert printf statements • Requires some good guessing, can be tedious if more than a few printf's • Use debugger (e.g., gdb) • Much better – instant access to all data and support for indicating where/why a program crashed • Use simulator • Can provide more flexibility and information than debugger, but simulators can be inaccurate and slow, not to mention hard to make • Write assertions • Best – application designer documents situations that are impossible • Formal and dynamic verification methods check whether assertions hold * http://dictionary.reference.com
Background – Performance Analysis • Performance analysis – investigate program behavior using information gathered during execution * • Aides designer in locating and remedying application bottlenecks, reducing guesswork in optimization • Replaces tedious, error-prone manual analysis methods (timing routines and printf statements) * http://en.wikipedia.org/wiki/Performance_analysis
Project Overview • RC systems and applications are even more complex than in HPC • Heterogeneous components • Hierarchy of parallelism among components • Lack of visibility inside RC devices • Optimizing applications is crucial for effective use of these systems • Debug and performance tools are relied on heavily in HPC to productively verify and optimize applications • Debug and performance tools are even more essential in RC due to additional system and application complexity, and yet research is lacking • Objective: expand the notion and benefits of software debugging and performance analysis into the software-hardware realm of RC
ReCAP Framework • Reconfigurable-computing application performance (ReCAP) framework • Adds assertion-based verification and performance analysis capabilities to FPGA portion of application • Builds upon existing assertions in HLL languages AND Parallel Performance Wizard (PPW) for performance analysis of CPU portion of application • Three main components • HDL Instrumenter • Hardware Measurement Module (HMM) • RC-enhanced version of PPW (PPW+RC) • Backend (instrumentation andmeasurement) • Frontend (analysis and visualization)
ReCAP: HDL Instrumenter • Modifies HDL design files to monitor application data at runtime • User can define "events" that are of interest • e.g., buffer full, cycles spent in a state • User can define "monitors" that determine what to record when event occurs • e.g., summary statistics, full trace • User can enable a number of automatic analyses • e.g., decision coverage, assertions, profiling, automatic bottleneck detection HDL Instrumenter Instrumentation Process
ReCAP: Hardware Measurement Module • Hardware necessary to record, store, and retrieve data at runtime • Profiling, tracing, and sampling • Cycle counter and other module statistics (trace records dropped, counter overflow, etc.) • Buffers for storing trace data • Module control for performance data retrieval and miscellaneous control (e.g., clear and stop) Hardware Measurement Module (HMM) Instrumentation Process
ReCAP: PPW+RC • PPW+RC backend adds thread to software to query HMM at runtime • Requires lock (since we now have shared FPGA access) • Handles FPGA performance data storage and migration to PPW data structures • Monitors FPGA API calls in addition to normal PPW software performance monitoring • PPW+RC frontend analyzes and presents measured data for CPUs / FPGAs • Table and chart views across multiple experiments • Export to Jumpshot for timeline views Instrumentation Process PPW+RC front-end
ReCAP Tool-Flow • HDL source files are instrumented, then synthesized/implemented normally • HLL source files are instrumented during compilation • Use ppwcc instead of gcc or ppwupcc instead of upcc • Program is executed normally on system • Performance data file produced can be viewed and analyzed with PPW+RC
Common RC Bottleneck Detection • Automatically search for common RC bottlenecks • Reduces time and knowledge needed to find bottlenecks • Requires some information from user • We attempt to minimize the amount of information requested • Currently produces text file containing • All detected bottlenecks, • Potential optimization strategies for each • Peak/ideal speedup if bottleneck resolved
Architecture-Aware Visualization System level • Architecture-aware visualization • Visualization within application & system context, with integrated common-bottleneck data • Must be scalable to large systems • Allow user to experiment with different optimization scenarios to see what provides best performance Node level
Automated instrumentation Computation State machines Used for preserving execution order in C functions Used to control pipelines Control and status signals used by library functions Communication Control and status signals Streaming communication DMA transfers User-assisted instrumentation Application-specific variables Monitor meaningful values selected by user Measurement Employ HMM from HDL framework High-level languages Impulse-C and Carte C Convert subset of C to HDL Employ DMA and streaming communication Speedup gained by Pipelining loops Library functions Replicated functions Impulse C Pipelining of loops Determined by pragmas in code Carte (SRC) Pipelining of loops Automatic pipelining of inner most loop Library functions Called as C function HDL coded HLL Performance Analysis HLL
Measurement Extraction Process/Thread Instrumented Signals Loopback (C source) Loopback (HDL) HLL Instrumentation & Measurement HLL CPU(s) HLL Tool Flow C source Application (C source) Instrumentation Software -hardware mapping HLL API Wrapper Compile software Instrumentation FPGA(s) Implement hardware HLL Hardware Wrapper Application (C source) Application (HDL) Hardware Measurement Module Finished design Instrumentation added to HDL C source for FPGA mapped to HDL Implement hardware Instrumentation added to C source Uninstrumented Project
HLL Analysis & Visualizations HLL • Bottleneck detection (currently user-assisted) • Load-balancing of replicated functions • Monitoring for pipeline stalls • Detecting streaming communication stalls • Finding shared-memory contention • Integration with performance analysis tool • Profiling data • Pie charts showing time utilization • Tree view of CPU and FPGA timing HDL State Machine C Source Main MD loop Input stream Pipeline transition b4s0 b4s1 b4s2 b4s3 b4s4 b6s0 b6s1 Output steam ?
HLL Assertion Debugging • Based off of ANSI C assert function int num, i, x[10];while(num==0) { num=x[i++]; assert(i<10);} • Failure will halt application, displaying an error test.c:7: main: Assertion `i<10' failed. • Assertions can be disabled via #define NDEBUG • Most HLLs do not synthesize standard C library functions on the FPGA • Convert assertion function to if statement (renamed via Perl script) • Send line number of failed assertions on the FPGA to the CPU • Communication stream created and routed between hardware functions with assertion statements and software function • Perform failure actions via a software function (added via Perl script)
Q Q Q Q Case Study: N-Queens • Overview • Find number of distinct ways n queens can be placed on an n×n board without attacking each other (via backtracking algorithm) • Multi-CPU/FPGA application (UPC/VHDL) • Overhead • <= 6% area (sixteen 32-bit profile counters for state machines) • <= 2% memory (96-bit-wide trace buffer for core finish time) • Negligible frequency degradation observed FPGAs
Case study: 2D-PDF estimation* • Application • Estimate a 2D probability density function (i.e., nearly smooth histogram) given set of (x, y) coordinate data • 3.2GHz Xeon, Virtex-4 LX100 FPGA, PCI-X • Results • Automatic bottleneck detection results showed problematic communication and control • Based on tool suggestion, increased buffer sizes and restructuring of control logic was achieved in a day, providing up to a 5.5x speedup for the 10-core design Software functions FPGA Write FPGA Read * 2D-PDF code written by Karthik Nagarajan
Case Study: Molecular Dynamics HLL • Stream buffer • Increased buffer size by 32 times • Speedup change • 6.2 vs. serial baseline before enhancements • 7.8 vs. serial baseline after enhancements • Molecular Dynamics • Simulates interaction of molecules over discrete time steps • Impulse C version 2.2 • XD1000 platform • Dual-processor motherboard • Opteron 2.2GHz • Stratix-II EP2S180 XD1000 module • MD communication architecture • Chunks of MD data read from SRAM • Data streamed to multiple pipelined MD kernels • Results stored back to SRAM
HLL Debug Case Study • Impulse C performs 32 bit comparison with 64 bit values void Logcontrol (… { … co_int64 big, test, update; small_1=321; small_2=123; big=5000000000; test=1073741824; IF_SIM(printf("HW big:%lld\n",big);) IF_SIM(printf("HW test:%lld\n",test);) i=0; while(big<test) { co_stream_write(small_stream, &small_1, sizeof(co_int32)); IF_SIM(printf("HW if passed\n");) small_1=big&4294967295; small_2=big>>32; i++; assert(i<10); } Impulse C code 32 bits 100101010000001011111001000000000 1000000000000000000000000000000 VHDL 1073741824 705032704 ni192_suif_tmp <= … & cmp_less_s(r_big(31 downto 0), r_test(31 downto 0));
HLL Debug Case Study (cont) Simulation • Results • In simulation, loop does not execute and assertion is never called • In hardware loop executes infinitely • In hardware with assert, loop executes and assertion fails • Overhead • Streaming overhead generated per process • Additional FPGA resource usage < 0.1% C:\hwr\test4-assert>memtest.exe Small stream Open HW big:5000000000 HW test:1073741824 Big stream Open Small lower read:321 Small upper read:123 … Hardware execution [root@xd1000-3 test4]# ./run_sw Small stream Open Big stream Open memtest_hw.c:31: Assertion 'i<10' failed. Small lower read:705032704 Small upper read:1 …
Conclusions • Debug and performance analysis of RC applications is critical for improving productivity in obtaining a correctly functioning, well-performing application • ReCAP framework/tool aides designers with verification and performance analysis • Records and monitors application data on CPU and FPGA at runtime while minimizing overhead and user effort • Can perform a number of automated analyses including common bottleneck detection, decision coverage, and assertion monitoring • Provides analysis and presentation of CPU/FPGA debug and performance data • ReCAP represents the first RC application performance framework and tool (per extensive literature review) • Debug capabilities are also not currently found in other tools