380 likes | 686 Views
Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Synthesis. Doug Johnson, Technical Marketing Manager NCSA/OSC Reconfigurable Systems Summer Institute Urbana, Illinois, July 11-13 2005. Celoxica. UK-Based System design company
E N D
Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Synthesis Doug Johnson, Technical Marketing Manager NCSA/OSC Reconfigurable Systems Summer Institute Urbana, Illinois, July 11-13 2005
Celoxica • UK-Based System design company • Provider of design tools, IP & services for Digital Imaging & Signal Processing • Image Processing • Video Processing • Sonar/ Radar signal processing • Biometrics • Massively parallel data mining and matching • Complete solutions for Electronic Level System (ESL) Design • System/ algorithm acceleration • Co-design partitioning • Co-simulation & co-verification (C/ C++/ SystemC/ Handel-C/ Matlab/ VHDL/ Verilog) • Hardware compilation & C synthesis to reconfigurable architectures • Consulting and professional services • Systems analysis and design strategy • System implementation capability NCSA/OSC Reconfigurable Systems Summer Institute
Presentation Objectives • Prerequisites • Motivations for using FPGAs in RC and HPC • HPC and RC FPGA systems hardware and infrastructure • Objectives • HPC algorithms and Considerations for Reconfigurable Computing (RC) • Share a perspective on the State-of-the-Art for C-based HW design • Describe the C to FPGA Flow • Illustrate with code examples … • Look forward to some critical debate… NCSA/OSC Reconfigurable Systems Summer Institute
Agenda • Reconfigurable Computing • Considerations, core algorithm relationships, commercial applications • C-based design • The solution space (its place in EDA) • Nature of C for HW design • The Design Flow • Summary • JPEG2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute
Agenda • Reconfigurable Computing (RC) • Considerations, core algorithm relationships, commercial applications • C-based design • The solution space (its place in EDA) • Nature of C for HW design • The Design Flow • Summary • “RC = Using FPGAs for (algorithmic) computation” • 1. Embedded: Well established – body of knowledge/experience • 2. Enterprise: Some • 3. HPC: Starting Out NCSA/OSC Reconfigurable Systems Summer Institute
Intimately Coupled Systems Advanced Compilers Reconfigurable Computing Commercial C-to-FPGA tools Closely Coupled Systems Partitioning Frameworks FPGAs First RC Successes 1980 1990 2000 20X0? • Promised Opportunities • Algorithm Acceleration • Exploit parallelism to increase performance with custom HW implementation • Algorithm Offload • Free CPU resource by offloading bottleneck processes • BIG Challenges • Development complexity • Design framework and methods, deployment and integration/middleware • Coupling to coprocessor/data bandwidth • Price/Performance/Power! • Choosing the right applications! NCSA/OSC Reconfigurable Systems Summer Institute
FPGA Computing and Methodology • High Performance Embedded and Reconfigurable Computing • Why FPGA Computing? • Moore’s Law showing signs of strain • Ability to parallelize in HW • Price/GOPS coming down rapidly • Hard IP blocks – excellent density • Example: Floating Point Performance • Maximum for Virtex-4 – 50 GFLOPS (Courtesy of Dave Bennett, Xilinx Labs) • Maximum for Virtex-2 – 17.5 GFLOPS “ “ “ “ “ “ • “Can fit 10’s of FPUs on 2 Xilinx Virtex-4’s” (Courtesy of Justin Tripp, LANL) • Use of hard macros for functions is mandatory (example DSP48 on Virtex-4) • C-based design for FPGAs • Several offerings on commercial marketplace or in research • Commercial – Celoxica, Mentor Graphics, Impulse Technologies, Mitrion… • Research – Sandia, UC Riverside, LANL • RTL/HDL is the most widely used way to get to FPGAs but is not usable by SW engineers NCSA/OSC Reconfigurable Systems Summer Institute
2005 Closely coupled systems C-based design High Density Devices Soft Cores/C-based design Conventional Wisdom for RC • 1. Small data objects • Data transfer overhead to coprocessor, High operation to byte ratio • 2. Modest arithmetic • Difficult to design and implement complex algorithms in HW • Integer/fixed precision calculations • Floating point too resource expensive • 3. Data-parallelism • Parallelism essential - FPGA clocks order of magnitude slower than CPUs • Fine grain - wide data widths • Medium grain - operation/function routine • Course grain - multiple instantiations of application processes • 4. Pipeline-ability • Streaming Applications – most successful • 5. Simple Control • Difficult to design complex scheduling schemes in Parallel HW Essential Fewer Issues with Latency in HPC NCSA/OSC Reconfigurable Systems Summer Institute
Few Compelling Examples in HPC Further Considerations • 6. Exploiting “Soft” programmable HW • Configurable Applications • Schedule and load HW content prior to HW execution • Reconfigurable Applications • Dynamically change HW content during HW execution NCSA/OSC Reconfigurable Systems Summer Institute
Defense & Security Consumer Automotive & Industrial Commercial RC Applications …using C-based design • Well established in embedded systems: • Digital Video Technology and Image Processing • “PROCESSING AT THE SENSOR” versus local and/or remote processing • 3D LCD display development and test • Real-time verification of HDTV image processing algorithms • Robust image matching - product tracking and production line control • Digital Signal Processing • Engine control unit for 3-phase motors • Radar and sonar beamforming and spatial filtering • Computer aided tomography security system • Communications and Networking • Internet reconfigurable multimedia terminal, MP3, VoIP etc. • Ground traffic simulation testbed for broadband satellite network communications • Satellite based Internet data tracking system • Rapid Systems Prototyping • Automotive safety system incorporating sensor fusion • Robotic vision system for object detection and robot guidance NCSA/OSC Reconfigurable Systems Summer Institute
Commercial RC Applications …using C-based design • Enterprise Computing • Content processing solutions • XML parsing, virus checking • Packet/Pattern Matching/Filtering • Compression/decompression • Security/Encryption – DES/3-DES, SHA, MD5, AES/Rijndael • High Performance Computing • Image processing • CT scan analysis, 3D modeling, Ray Tracing • Finite element analysis and simulation • Custom Vector Engines • Genome calculations • Seismic data processing NCSA/OSC Reconfigurable Systems Summer Institute
Core Algorithm Relationships in HPC Rational Drug Design Nanotechnology Tomographic Reconstruction Phylogenetic Trees Biomolecular Dynamics Neural Networks Crystallography Fracture Mechanics Molecular Modeling MRI Imaging Reservoir Modelling Diffraction Inversion Problems Biosphere/Geosphere Distribution Networks Chemical Dynamics Flow in Porous Media Electrical Grids Atomic Scattering Pipeline Flows Data Assimilation Condensed Matter Electronic Structure Signal Processing Plasma Processing Chemical Reactors Electronic Structure Cloud Physics Boilers Combustion Actinide Chemistry Radiation CVD Quantum Chemistry Reaction-Diffusion Fourier Methods Graph Theoretic Chemical Reactors Cosmology n-body Transport Astrophysics Multiphase Flow Manufacturing Systems CFD Basic Algorithms & Numerical Methods Weather and Climate Discrete Events PDE Air Traffic Control Structural Mechanics Military Logistics Seismic Processing Population Genetics Monte Carlo ODE Multibody Dynamics Geophysical Fluids VLSI Design Transportation Systems Aerodynamics Economics Raster Graphics Fields Orbital Mechanics Nuclear Structure Ecosystems QCD Pattern Matching Symbolic Processing Neutron Transport Economics Models Genome Processing Virtual Reality Cryptography Astrophysics Electromagnetics Computer Vision Virtual Prototypes Intelligent Search Multimedia Collaboration Tools Databases Magnet Design Computational Steering Computer Algebra Scientific Visualization Data Mining Automated Deduction Number Theory Intelligent Agents CAD NCSA/OSC Reconfigurable Systems Summer Institute Source: Rick Stevens - ANL
Core Algorithm Relationships in HPC Rational Drug Design Nanotechnology Tomographic Reconstruction Phylogenetic Trees Biomolecular Dynamics Neural Networks Crystallography Fracture Mechanics Molecular Modeling MRI Imaging Reservoir Modelling Diffraction Inversion Problems Biosphere/Geosphere Distribution Networks Chemical Dynamics Flow in Porous Media Electrical Grids Atomic Scattering Pipeline Flows Data Assimilation Condensed Matter Electronic Structure Signal Processing Plasma Processing Chemical Reactors Electronic Structure Cloud Physics Boilers Combustion Actinide Chemistry Radiation CVD Quantum Chemistry Reaction-Diffusion Fourier Methods Graph Theoretic Chemical Reactors Cosmology n-body Transport Astrophysics Multiphase Flow Manufacturing Systems CFD Basic Algorithms & Numerical Methods Weather and Climate Discrete Events PDE Air Traffic Control Structural Mechanics Military Logistics Seismic Processing Population Genetics Monte Carlo ODE Multibody Dynamics Geophysical Fluids VLSI Design Transportation Systems Aerodynamics Economics Raster Graphics Fields Orbital Mechanics Nuclear Structure Ecosystems QCD Pattern Matching Symbolic Processing Neutron Transport Economics Models Genome Processing Virtual Reality Cryptography Astrophysics Electromagnetics Computer Vision Virtual Prototypes Intelligent Search Multimedia Collaboration Tools How do we map out the right Apps? Databases Magnet Design Computational Steering Computer Algebra Scientific Visualization Data Mining Automated Deduction Number Theory CAD Intelligent Agents NCSA/OSC Reconfigurable Systems Summer Institute Source: Rick Stevens - ANL
Exploiting FPGA in HPC • Hardware: • “Enterprise Quality” co-processor system products (Cray XD1, SGI RASC) • Robust PCI/PCIx/VME-based FPGA card solutions for development • A software design methodology is essential: • SW dominated application sector • Target developers have a SW background • Register Transfer Level (RTL), Hardware Description Languages (HDL) are foreign • Complete designs can be specified in a C environment • Porting to HW implementations simplified • Platform abstractions through API’s and Libraries • Simplified Specification, Development, Deployment How do we select and benchmark? NCSA/OSC Reconfigurable Systems Summer Institute
Agenda • Reconfigurable Computing • Considerations, core algorithm relationships, commercial applications • C-based design • The solution space (its place in EDA – Electronic Design Automation) • Nature of C for HW design • The Design Flow • Summary • JPEG2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute
Algorithm Design Algorithm Design Block Design Fixed Point extraction DSP IP Architecture Exploration Architecture Exploration Implementation IP Models API’s/Libraries TLM Frameworks Design Analysis Mixed Simulation HW Accelerated Simulation Custom Processors HLL Synthesis C-Based Synthesis Interface Synthesis FPGA/SoPC Reconfigurable Prototypes Implementation IP Emulation Platforms RTL Verification RTL C to FPGA/SoPC Embedded Hardware (HW) Design Specification Function Algorithm Design Block Design Fixed Point extraction DSP IP Architecture Architecture Exploration Implementation IP Models TLM Frameworks Design Analysis Fast Mixed Simulation HW Accelerated Simulation Custom Processors HLL Synthesis Interface Synthesis Implementation Reconfigurable Prototypes Implementation IP Emulation Platforms RTL Verification RTL Physical Design NCSA/OSC Reconfigurable Systems Summer Institute
Specification Model Software Model Design HW SW C/C++ Testbench AL C for HW CA COMMS BSP BSP C to FPGA Accelerated System Function & Architecture Algorithm Design System Model Partitioning Architecture Exploration API’s/Libraries Mixed Simulation Design Analysis Optimization C-Based Synthesis Implementation RTL EDIF OBJ Synthesis P&R FPGA Processor NCSA/OSC Reconfigurable Systems Summer Institute
Challenges for C-based synthesis • Concurrency (Parallelism) • Compiler-determined (behavioral synthesis) • Explicit • Timing • Constraints • Explicit • Rules-based • Data Types • Annotations, additional or C++ • Communication • Additional or C-like NCSA/OSC Reconfigurable Systems Summer Institute
C Algorithm to FPGA SoC (System-on-a-Chip) Prototyping/Verification SystemC Core Libraries SCV, TLM, Master/Slave … Standard Channels for Various MOC Handel-C Kahn Process Networks, Static Dataflow… Core Libraries Primitive Channels TLM (PAL/DSM), Fixed/Floating point … Signal, Timer, Mutex, Semaphore, FIFO, etc Core Language Data Types Core Language Data Types par{…}, seq{…}, Interfaces, Channels, Bit Manipulation, RAM & ROM Single cycle assignment Bits and bit-vectors Arbitrary width integers Signals Modules, Ports, Processes, Events, Interfaces, Channels Event Driven Sim Kernel 4-valued logic/vectors Bits and bit-vectors Arbitrary width integers Fixed-point C++ user-defined types ANSI/ISO C Language Standard ANSI/ISO C++ Language Standard Two Approaches to C-based Design NCSA/OSC Reconfigurable Systems Summer Institute
Agenda • Reconfigurable Computing • Considerations, core algorithm relationships, commercial applications • C-based design • The solution space (its place in EDA) • Nature of C for HW design • The Design Flow • Summary • JPEG2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute
A B D C A B D C D System Design Refinement Function par{ processA(…); processB(…); processC(…); processD(…); } • System Function • Course grain parallelism AL C/C++ CP Handel-C • Parallel algorithm design • Fine-grain parallism • Bit/cycle true processes • Algorithm Testbench void processD(…){ unsigned 9 a,b,c; par{ a=1; b=2; } c=3; }; CA Handel-C Architecture A B void main(){ interface port_in… interface port_out… … } • Add interfaces • Signal/cycle accurate test CA Handel-C C EDIF/RTL NCSA/OSC Reconfigurable Systems Summer Institute
D D Systems Integration Implementation A • Complete system design • Interface to pins • Multi-Clock domain • IP Integration B EDIF (Electronic Design Interface Format) C CLK RTL from HDL IP B A RST Data set clock = external “CLK”; set reset = external “RST”; interface Data(…)… void main() { par{ processA(…); processB(…); processC(…); processD(…); } } C { interface processB(…)…}; { interface processD(…)…}; EDIF/RTL NCSA/OSC Reconfigurable Systems Summer Institute
Parallel Debug in C environment Algorithm Design NCSA/OSC Reconfigurable Systems Summer Institute
Resource Usage/Speed Estimations Architecture Exploration NCSA/OSC Reconfigurable Systems Summer Institute
FPGA Support Technology mapping Optimizations NCSA/OSC Reconfigurable Systems Summer Institute
void Multiply(unsigned W A, unsigned W B, unsigned W *C) { static unsigned W a[W], b[W], c[W]; par{ a[0] = A; b[0] = B; c[0] = a[0][0] == 0 ? 0 : b[0]; par (i = 1; i < W; i++) { a[i] = a[i-1] >> 1; b[i] = b[i-1] << 1; c[i] = c[i-1] + (a[i][0] == 0 ? 0 : b[i]); } *C = c[W-1]; } } Pipelined Handel-C Template Multiplier set clock = external "clk"; void main() { … while(1) par { … process(); } } void process() { unsigned W A, B, C; while(1) par { … Multiply(A, B, &C); … } } NCSA/OSC Reconfigurable Systems Summer Institute
Agenda • Reconfigurable Computing • Considerations, core algorithm relationships, commercial applications • C-based design • The solution space (its place in EDA) • Nature of C for HW design • The Design Flow • Summary • JPEG2000 Design Example NCSA/OSC Reconfigurable Systems Summer Institute
Summary • Commercial C-based design is a reality • For the HPC and RC communities it offers: • Fastest route to accelerating SW designs in FPGA • Lower barrier to adoption than RTL technologies • Greater customization and productivity than block based approaches • Complete integration with RTL/block based approaches for “Power users” • Deterministic and quality results • State of the art tools used by embedded systems designers • RC platforms for rapid prototyping • Simple migration, development to deployment with full library support NCSA/OSC Reconfigurable Systems Summer Institute
Design Example JPEG2000 Image Compression Algorithm
Example Design JPEG 2000 Compressor Five Steps to HW Platform: • 1. Specification Model • Algorithm Profiling • 2. Functional System Model • System Estimations • 3. Architecture and Communication Model • Optimization • 4. Implementation Model • Direct Synthesis C to EDIF • 5. HW Platform • Board level integration Original Image Pre processing RGB to YUV conversion DWT Rate Control Quantization Tier-1 Encoder Coded Image Tier-2 Encoder NCSA/OSC Reconfigurable Systems Summer Institute
Specification Model Software Model Design Testbench 1. Specification Model Function & Architecture 22 *.c and *.h files 1468 lines of code C/C++ AL Original Image Pre processing RGB to YUV conversion Algorithm Profiling - Memory - Processing Time - Data Flow DWT Rate Control Quantization Tier-1 Encoder Coded Image Tier-2 Encoder DWT/Tier1 are the compute intensive blocks NCSA/OSC Reconfigurable Systems Summer Institute
Specification Model Software Model Design HW SW Testbench Handel-C CA 2. Functional System Model Function & Architecture C/C++ AL Original Image System Model Pre processing Partitioning RGB to YUV conversion DWT Rate Control quantization /*Handel-C*/ extern “C” sw_block(…); void main(void){ while(1) par{ sw_block(…); hw_block(…); } } void hw_block(…) { … } /* C */ void sw_block(…) { … } Tier-1 Encoder Coded Image Tier-2 Encoder Cycles/speed/area… NCSA/OSC Reconfigurable Systems Summer Institute
Handel-C CA FIFO DsmRead(…) FIFO DsmWrite(…) DsmFlush(…) 3. Architecture and Communication Model Function & Architecture C/C++ AL Original Image Pre processing RGB to YUV conversion DWT Rate Control quantization Tier-1 Encoder DsmPortH2S Coded Image Tier-2 Encoder Dataflow/Cycles/speed/area… NCSA/OSC Reconfigurable Systems Summer Institute
D 4. Implementation Model A B void main(){ interface port_in… interface port_out… … } C EDIF Device Family Implementation RTL EDIF NCSA/OSC Reconfigurable Systems Summer Institute
Estimations from Synthesis • DWT ~ 6% VII1000 NCSA/OSC Reconfigurable Systems Summer Institute
D 5. Hardware Platform From P&R Report for VII1000-4 A B uP HW uP • DWT Slices: 758 Device utilization : 7% Speed (MHz): 151 Lines of code: 395 HW C uP HW uP HW RAM RAM Board Level Integration Specific I/O Implementations Pin Location constraints Implementation Model Estimations • DWT ~6% Implementation EDIF • Microblaze + Xilinx FPGA • Nios + Altera FPGA • Xilinx V2Pro • Toshiba MeP + FPGA • PowerPC + PLB + FPGA • PC + FPGA PCI Card • …etc P&R FPGA NCSA/OSC Reconfigurable Systems Summer Institute
JPEG2000 DWT Implementation • Example taken from a “Xilinx Design Challenge” • Comparison made with HDL approach • See Article in Xcell Volume 46http://www.xilinx.com/publications/xcellonline/xcell_46/xc_celoxica46.htm • C-Based Design 1st pass Slices 646 Device utilization 6% Speed (MHz) 110 Lines of code 386 Design time (days) 6 Simulation time 5 mins • 2nd pass 546 5% 130 386 7 (6+1) 5 mins • Final 758 7% 151 395 7 (6+1) 20 mins HDL 800 7% 128 435 20* +6 hours • Observations Comparable Using C faster Using C quicker Expert vs Novice * Doesn’t include partitioning spec. development * Lena used as testbench throughout, input bit width12, max 1K image width NCSA/OSC Reconfigurable Systems Summer Institute
JPEG2000 MQ coder Implementation • Celoxica 1st Pass Slices 1.347 Device utilization 12% Speed (MHz) 89.5 Lines of code 310 Design time (days) 10 Simulation time for Lena jpeg 5 mins • Celoxica Final 1,999 18% 115.5 330 12 (10+2) 5 mins • HDL 620 6% 76 800 30* Hours • Observations HDL Smaller HC Faster HC Quicker Expert vs Novice * Doesn’t include partitioning spec. development • Common language base eased porting to hardware of the MQ coder source & DSM allowed partition, co verification & data to be moved between hardware & software • Optimizations included adding parallelism, replacing for() loops with while() loops, & simplifying loop control. • Design developed in a unified design environment NCSA/OSC Reconfigurable Systems Summer Institute