210 likes | 318 Views
Brian Holland, Mauricio Vacas, Vikas Aggarwal, Ryan DeVille, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Lab Department of Electrical and Computer Engineering University of Florida.
E N D
Brian Holland, Mauricio Vacas, Vikas Aggarwal, Ryan DeVille, Ian Troxel, and Alan D. George High-performance Computing and Simulation (HCS) Research Lab Department of Electrical and Computer Engineering University of Florida Survey ofC-based Application Mapping Toolsfor Reconfigurable Computing
Outline • Introduction • General Survey • Ten C-based Application Mappers • Benchmarking & Results • Finite-Impulse Response (FIR) • N-Queens • Radix Sort • Lessons Learned • Conclusions • Acknowledgements • References
Motivation for Application Mappers Motivation for Application Mappers HDL programming has shortcomings Limited applicability to application developers More involved development process (vs. software) Requires training beyond application level Instead, can we find and exploit an environment that allows a measure of hardware control along with increased productivity? Can we bring RC performance benefits to application developers? Would this be practical/possible in traditional HDL? HDL is well below the level of traditional application programming Consequently, we need to move to a higher level of abstraction
Introduction • Selecting a Higher Level of Abstraction • CAD tools: Visual appealing, but tedious for large projects • New language: Optimal, but requires complete retraining • Traditional or Object-Oriented languages: Which? How? • Ideally, use pure ANSI-C, “The Universal Language” • Requires no additional knowledge or special training • Port existing C programs into hardware implementations (HDL) • Translation can be handled by a hardware compiler • Programmer concentrates on algorithmic functionality
Commonalities General characteristics of C-based application mappers: Companies create proprietary ANSI C-based language Languages do not have all ANSI C features Extra pragmas are included for corresponding compilers Additional libraries of functions/macros for further extensions Must adhere to specific programming “style” for maximum optimization Emphasis on both hardware generation and I/O interfaces
Catapult CMentor Graphics [2-3] CarteSRC Computers [1] • Algorithmic synthesis tool for RTL generation • RTL from “pure” untimed C++ • No extensions, pragmas, etc. • Compiler uses “wrappers” around algorithmic code • External: manages I/O interface • Internal: constrains synthesis to optimize for chosen interface • Explicit architectural constraints and optimization • Output: RTL netlists in VHDL, Verilog, and SystemC • C/Fortran FPGA environment • Direct mapping of C/Fortran code to configuration level • Software emulation and simulation of compiled code for debugging • Capable of multiprocessor and multi-FPGA computational definitions • Allows explicit data flow control within memory hierarchy • Targets SRC’s MAP processor • Produces “Unified Executables” for HW or SW processor execution • Runtime libraries handle required interfacing and management
Handel CCeloxica [5] DIME-CNallatech [4] • Environment for cycle-accurate application development • All operations occur in one deterministic clock cycle • Makes it cycle-accurate, but clock freq reduced to slowest operation • Decisions/Loops are “penalty-free” but can significantly impact timing • Language has pragmas for explicitly defined parallelism • Compiler can analyze, optimize, and rewrite code • Output: VHDL/Verilog, SystemC, or targeted EDIFs • FPGA prototyping tool • Designs are not cycle-accurate • Allows application synthesis for a higher clock speed • Compilation/Optimization • Pipeline/parallelize where possible • Included IEEE-754 FP cores • Dedicated (integer) multipliers • Currently in beta, expected release:4Q05 • Output: synthesizable VHDL and DIMEtalk components
Mitrion CMitrion[7] Impulse CImpulse Accelerated Technologies [6] • “Softcore” processor tactic • “Processor” creates abstraction layer between C code and FPGA • Compilation • C code is mapped to a generic “API” of possible functions • Processor instantiated on FPGA, tailored to specific application • Custom instruction bit-widths, specific cache and buffer sizes • Currently in beta, expected release: 4Q05 • Output: a VHDL IP core for target architectures • Language/compiler for modeling sequential apps. • Processes - independent, potentially concurrent, computing blocks • Streams – communicate and synchronize processes • Uses Streams-C methodology • However, focuses on compatibility with C development environments • Compilation • Each process implemented as separate state machine • Output: Generic or FPGA-specific VHDL
Napa CNational Semiconductor [8] Language/compiler for RISC/FPGA hybrid processor Capitalize on single-cycle interconnect instead of I/O bus Datapath Synthesis Technique Hand-optimized pre-placed, pre-routed module generators Compiler generates hardware pipelines from C loops Targets NS NAPA1000 hybrid processor Fixed-Instruction Processor (FIP), Adaptive Logic Processor (ALP) ALP also compiles to RTL VHDL, structural VHDL, structural Verilog SA-CColorado State University [9-12] • High-level, expression-oriented, machine-independent, single-assignment language • Designed to implicitly express data-parallel operations • Image and signal processing • Compiler(UC-Irvine, UC-Riverside, Colorado State Univ.) • Loop optimizations • Structural transforms • Execution block placement • Target Platforms • UC Irvine Morphosys; Annapolis WildForce, StarFire, WildFire
SystemCOpen SystemC Initiative (OSCI) [15-16] Streams CLos Alamos National Laboratory [12-14] • Open-source extension of C++ for HW/SW modeling • Core language, modules & ports for defining structure, and interfaces & channels • Supports functional modeling • Hierarchical decomposition of a system into modules • Structural connectivity between modules using ports/exports • Scheduling and synchronization of concurrent processes using events • Event-driven simulator • Events are basic dynamic/static process synchronization objects • Stream-oriented sequential process modeling • Essentially, data elements moving through discrete functional blocks • Compiler • Generates multi-threaded processor executables and multiple FPGA bitstreams • Allows parallel C program translation into a parallel arch. • Includes functional-level simulation environment • Output: synthesizable RTL
About the Benchmarks • Three classic algorithms used for benchmarking • Finite-Impulse Response (FIR) • Simple 51-tap FIR filter for standard DSP applications • Compare compiler solutions and analyze their usage metrics • N-Queens • Classic embarrassingly parallel HPC backtracking search problem • Showcases the potential of optimized implementations • Radix Sort • Sorts using ‘binary bins’, minimizing resources • Illustrates resource metrics in RAM-intensive applications • Implementation Details • DIME-C, Handel C, Impulse C, VHDL, and ANSI-C (for baseline timing) • Experiments performed on Nallatech BenNUEY-PCI card with VirtexII-6000 FPGA • Resource utilization based on post place-and-route data • Runtime represents communication time (setup and verification I/O is negated) • Handel C and Impulse C require VHDL wrappers which can increase resource usage
Finite-Impulse Response • FIR filter containing 51 taps, each 16-bits wide (based on algorithms in [4,6]) • Various application-mapper languages do not have a consistent I/O interface • Could not create a consistent streaming channel with requisite blocking in every tool • Instead, FIR algorithm operates on values stored in a block RAM • Obtains speedup through parallel multiplication, efficient memory accesses • The 51 coefficients and variables are stored in local variables • Additional performance boosts are possible in multi-channel DSP processing
N-Queens • Represents a purely computational algorithm; virtually no communication overhead • Algorithm contains several parallelizable code segments, exploitable for speedup • Implementations are based upon same baseline C code • Every available technique and compiler optimization is employed to boost performance • Notes: • Handel C N-Queens is a benchmark from our MAPLD’04 paper with additional refinements • VHDL N-Queens is culmination of a semester-long endeavor into algorithm’s parallelism • DIME-C and Impulse C N-Queens are results of experimentation with beta compilers
Radix Sort • Sorts values one bit at a time (saving significant resources vs. sorting on digit at a time) • Represents a “worst-case” legacy algorithm, containing no functional-level parallelism • Every element in every iteration depends on every previous element in every iteration • Ideal for software processor with fast cache, challenging in FPGA hardware • Speedup comes through efficient RAM usage and compiler optimizations/pipelining • Reduce quantity and addressing complexity of RAM accesses whenever possible • Metrics are based on sorting 600 32-bit integers contained within a block RAM
Some Optimization Techniques • Keep expensive computational operations to a minimum • Multiplication, division, modulo, greater/less than, and floating point are *slow* • Minimize reliance on arrays • Watch for combinable statements • Exploit functional level parallelism • Reduce bit-widths to minimal size
Green – Computation Blue – Communication Orange - Pragmas Case Study: Dot Product DIME-C void Kernel(int a[50], int b[50], int answer) { int i, temp = 0; for(i=0;i<50;i++) { temp += a[i] * b[i]; } answer = temp; } void dot_product(int a1[50], int b1[50], int a2[50], int b2[50], int answer) { int answer1, answer2; #pragma genusc instance Kernel1 Kernel(a1,b1,answer1); #pragma genusc instance Kernel2 Kernel(a2,b2,answer2); answer = answer1 + answer2 } IMPULSE C void Kerne11(co_stream a1, co_stream b1, co_stream z1){ int a[50], b[50], answer=0; co_stream_open(a1,O_RDONLY,INT_TYPE(32)); /*etc*/ for(i=0;i<50;i++) { co_stream_read(a1, &a[i], sizeof(int32)); co_stream_read(b1, &b[i], sizeof(int32)); } for(i=0;i<50;i++) { #pragma CO UNROLL answer += a[i] * b[i]; } co_stream_write(z1, &answer, sizeof(int32)); co_stream_close(a1); /*etc*/ } void Kernel2(co_stream a2, co_stream b2, co_stream z2){ /* SAME AS IN Kernel1 */ } void dot_product(co_stream z1, co_stream z2, co_stream ans){ int i, answer1, answer2, answer; co_stream_open(z1,O_RDONLY,INT_TYPE(32)); /*etc*/ co_stream_read(z1, &answer1, INT_TYPE(32)); co_stream_read(z2, &answer2, INT_TYPE(32)); answer = answer1 + answer2; co_stream_write(ans, &answer, INT_TYPE(32)); co_stream_close(z1); /*etc*/ } HANDEL C int 32 Kernel1(int 32 a[50], int 32 b[50]) { static int 32 i, temp[i], answer; par(i=0;i<50;i++) { temp[i] = a[i] * b[i]; } for(i=0;i<50;i++) { answer += temp[i]; } return answer; } int 32 Kernel2(int 32 a[50], int 32 b[50]) /* SAME AS IN Kernel1 */ } void main() //dot_product { int 32 a1[50]; int 32 b1[50]; int 32 a2[50]; int 32 b2[50]; int 32 temp1, temp2; int 32 answer; interface bus_out() OutputResult(answer); par { ans1 = Kernel1(int 32 a1[50],int 32 b[50]); ans2 = Kernel2(int 32 a2[50],int 32 b[50]); } answer = ans1 + ans2; } *Not all implementations are perfectly optimized. Your mileage will vary.*
Lessons Learned • Tools are not near point of automatic translation • Programs still require some tweaking for hardware compilation [17] • Optimized Software C ≠ Optimized Hardware C • However, generating VHDL is significantly easier • Learning basics of a C-based mapper is straightforward • At least two major challenges remain: • Input/output interfaces become a limiting factor • Moving generic VHDL to unsupported platforms requires VHDL knowledge • However, once a generic I/O wrapper is generated, it should be reusable • True hardware debugging remains a challenge • Another level of abstraction means another layer for mistranslation • With no knowledge of internal VHDL signals, tracing becomes difficult
Conclusions • Advantages of C-based application mappers • Far broader audience of potential RC users with high-level languages • Required HDL knowledge is significantly reduced or eliminated • Time to preliminary results is much less than manual HDL • Software-to-hardware porting is considerably easier • Visualization of C hardware is far easier for scientific community • Disadvantages • Mapper instructions are many times more powerful than CPU instructions, but FPGA clocks are many times slower • Mappers can parallelize and pipeline C code, however they generally cannot automatically instantiate multiple functional units • Optimized C-mapper code is obtained through manual parallelization of existing code using techniques pertinent to algorithm’s structure • Reduced development time can come at cost of performance
Acknowledgements • We thank the following vendors for application mapping tools, information, and technical support: • Celoxica (Handel C) • Impulse Accelerated Technologies (Impulse C) • Nallatech (DIME-C) • Mitrion (Mitrion C) • We thank the following vendors for providing tools and/or hardware that made this study possible: • Aldec (Active-HDL & Riviera EDA tools) • Intel (Xeon servers) • Nallatech (FUSE & DIMEtalk tools, RC boards) • Xilinx (ISE, RC boards, FPGAs)
References [1] http://www.srccomp.com [2] http://www.mentor.com/products/c-based_design/catapult_c_synthesis/. [3] K. Morris, “Catapult C: Mentor Announces Architectural Synthesis,” fpgajournal.com, June 1, 2004. [4] Nallatech, Inc., “DIME-C User Guide,” Reference Manual, United Kingdom, 2005. [5] Celoxica, Ltd. “Using Handel-C with DK,” Training Manual, United Kingdom, 2005. [6] D. Pellerin and S. Thibault, “Practical FPGA Programming in C,” Pearson Education, Inc., Upper Saddle River, NJ, 2005. [7] Mitrionics AB, Inc, “The Mitrion Processor,” Product Overview, Sweden, 2005. [8] M. Gokhale, J. Stone and E. Gomersall, “Co-Synthesis to a Hybrid RISC/FPGA Architecture,” Journal of VLSI Signal Processing Systems, 24, pp. 165-180, 2000. [9] J. Hammes and W. Böhm, “The SA-C Language,” Reference Manual, Colorado State University, 2001. [10]J. Hammes, M. Chawathe and W. Böhm, “The SA-C Compiler,” Reference Manual, Colorado State University, 2001. [11]Colorado State Univ. “Cameron Poster for ACS PI Meeting,” Arlington, VA, March 7, 2002. [12] I. Troxel, “CARMA: An Infrastructure for Reconfigurable High-Performance Computing,” Ph.D. Prospectus, University of Florida, pp. 30-32, 2005. [13] R. Goering, “Open-source C compiler targets FPGAs,” Embedded.com, October 18, 2002. [14] J. Frigo, M. Gokhale and D. Lavenier, “Evaluation of Streams-C C-to-FPGA Compiler: An Applications Perspective,” Proc. ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), Monterey, CA, February 11-13, 2001. [15] http://www.systemc.org. [16] OSCI, “SystemC 2.0.1 Language Reference Manual,” Reference Manual, San Jose, CA, 2003. [17] D. A. Buell, S. Akella, J. P. Davis, G. Quan, and D. Caliga, "The DARPA boolean equation benchmark on a reconfigurable computer," Proc. Military Applications of Programmable Logic Devices (MAPLD),Washington, DC, September 8-10, 2004. [18] V. Aggarwal, I. Troxel, and A George, “Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI” Proc. MAPLD, Washington, DC, September 8-10, 2004. [19] J. Jussel, “The future of programmable SoC design is C-based”, Proc. Engineering of Reconfigurable Systems and Algorithms (ERSA), Las Vegas, NV, June 27-30, 2005.