OCCBIO 2007 Tutorial on FPGA-Acceleration Processors

OCCBIO 2007Tutorial on FPGA-Acceleration Processors

The Promise of FPGA-Acceleration Processors to Bioinformatics ResearchAnthony D. JohnsonThe University of Toledo

FPGA Acceleration Processors - Why? • Accelerating Bioinformatics beyond microprocessors, • Inherent limitations of microprocessors, • already at the upper limit on clock speed, • limited degree of parallelism in hardware, • fixed hardware architecture not adapted to: • FASTA, • BLAST. • Acceleration Processors already developed using: • high parallelism in hardware, • mature technologies.

Computational tasks in Bioinformatics • At the top abstraction level, most Bioinformatics tasks can be described as partial matching of character strings, or string patterns, • Bioinformatics research tasks characterized by: • large data sets, • minimal dependency between data elements, • huge numbers of simple processing steps.

Computer Architectures • Jargons are living, changing, confusing: • Von Neumann architecture is defined by basic hardware subsystems and their interaction,

Computer Architectures - continued • in the current jargon, architecture is defined by the instruction set of the processor, • x86 = IA-32: Pentium, Athlon, • IA-64: Itanium, 2001, • PowerPC: PPC620, IBM, • SPARC, SUN, • PA-RISC: PA8700, HewletPackard; • microarchitecture determines how processor executes instructions.

Microprocessor Architectures: Pipelines • Still basic Von Neumann architecture, • architectures of subsystems have undergone extensive development, • first to achieve one instruction per clock cycle, exploiting ILP, Pipe-lining: executes in parallel different parts of multiple instructions on different parts of the pipe-lined processor's hardware,

Microprocessor Architectures: Threading • Execution of multiple instructions per clock cycle, exploiting independent code sequence levelparallelism by: • threading, which uses multiple pipe-lines to execute multiple instruction sequences concurrently . • both, pipelining and threading are not always clear winners, • pipelining speedup is eroded by hazards which stall execution of an instruction: • structural hazards consequence of lack of hardware parallelism, • data hazards consequence of instruction dependencies, • control hazards consequence of branches which change the PC.

Microprocessor Architectures-Summary In the end: • fixed architectures, • compromise between demands of different applications, • programming using HLLs, • compilers adapt processing to the instruction set, • speed is limited, and paid for, by power dissipation. • powering supercomputers takes up to 50MW.

Multiple platform computing systems • offer hardware parallelism in the form of: • computer clusters, • multi-core processors; • slow data transfer between: • processors and shared memory, • processors and distributed memory; • powering supercomputers is costly (up to 50MW).

Field Programmable Gate Arrays • Very Large Scale Integration (VLSI) electronic components, • manufactured using cutting edge VLSI technologies, • architectures completely application-independent, • configurable functionality: • after the manufacturing of the FPGA has been completed, • on the lowest functional module level, • "on the fly" - while running an application, • vast configurable interconnect resources, • massive hardware parallelism, • power consumption significantly below microprocessors.

FPGA Architectures • Arrays of identical configurable Logic Circuit Modules (LCMs), • vendors refer to propriotary LCMs by different names: • Logic Module (Actel), • Logic Array Block (Altera), • Configurable Logic Block (Xilinx); • Architectures loosely classified by the size of LCMs: • fine-grain, • coarse-grain, • mixed-grain.

Virtex-2 LCM • Includes: • fourslices, • oneswitch matrix.

Virtex-2 Half-Slice arcitecture

Virtex-2 slice Configurations • Selection: • Look Up Table, • Shift Register, • Memory.

FPGA Architectures - continued • modern FPGAs feature a variety of specialized modules: • adder circuitry, • multiplier arrays, • memory blocks, • whole microprocessors, • blocks of fast I/O ports, • Digital Delay Loops for clock skew and frequency management.

Virtex-2 Memory and Multipliers • Close coupling between: memory and multiplier blocks.

FPGA Characteristics • variable number of LCMs - 4k to 200k, • variable number of I/O connection pads - up to 1400, • speed grades - several, • package types – all usual and specialized packages.

High Level programming Languages • Every researcher is fluent in one HLL, • one HLL is included in all Bachelor Degree curricula, • HLLs’ characteristics: • semantics and syntax are oriented to the application for which originally developed, • hundreds have been developed with some goal in mind - small percentage is still in use.

HLLs - continued • algorithms described by consecutive operations, • require translation to a lower level language, • translation is platform/compiler dependent, • source code is nominally portable.

Hardware Description Languages (HDLs) • Only few engineering students learn an HDL, • HDLs’ characteristics: • designed to describe hardware, • intended originally for hardware simulation, • describe algorithms by concurrent operations on electrical signals.

HDLs - continued • there aretwo standardized HDLs : • VHDL developed under a grant from DOD, • VHDL = VHSIC HDL, • VHSIC= Very High Speed Integrated Circuits, • Verilog is a product of a single company.

Configuring the FPGAs After an HDL code has been extensively simulated: 1. synthesis tool converts the code into a net-list form, • due to the original purpose, NOTall HDL constructs are synthesizable! 2. vendor tools map the net list onto the FPGA's hardware, 3. designer specifies mapping constraints to meat the timing requirements, 4. extensive verification after each step in the process.

Hybrid HLL/HDL Languages • Incorporate the hardware component into an HLL, • tend to bear resemblance to the C-language, • their name predominantly contains the character C , • aim to imply that power of acceleration processors is at finger tips of researchers, • re-education still needed to transition from sequential to parallel programming.

Hybrid HLL/HDL - continued • development of libraries of common functions crucial for adoption, • some vendors provide tools for conversion form fortran and C to Hybrid-C; • After the Hybrid-C code has been obtained: • all steps listed for HDLs are necessary, • vendors are working on automating the steps, • automation implies artificial intelligence solutions.

A sample of Hybrids and tools for converting C code to HDL • C2H ChiMPS Catapult-C • Handel-C HARWEST Carte • Mitrion-C SystemC Chapel • ROCCC Impulse-C DIME-C

Computing platforms with FPGA acceleration processors • General trend: patforms with hybrid processing resources. • Means for connecting FPGAs to microprocessors: • main processor bus, • memory slot, • I/O slot, • HTX expansion bus slot.

A Shared Hybrid Architecture • Two Opteron processor sockets on a board, • sockets linked by a high-speed, HyperTransport (HTX) bus, • one socket contains an Opteron – for control tasks, • the other holds an FPGA card – for intensive data-processing, • examples: • XtremeData’s XD1000 FPGA Coprocessor Module, • DRC’s Reconfigurable Processing Unit, RPU110-L200.

Other Interesting Hybrid Architectures • Systems using two processors and one HTX slot: • IBM x3455, • HP DL145 Server. • Other architectures: • Cray XD1 cluster supercomputer with proprietary extension modules, • first couple: Opteron - FPGA - for processing, • second couple: Opteron - FPGA - for communication; • Celoxica’s RCHTX high-performance computing board plugs into an HTX slot, • SRC Computers plug MAP processor into a memory slot.

Other significant Hybrid Platforms • Mitrion Virtual Processor: • runs on Virtex-4 FPGAs on the SGI RASC RC100 compute blades, • runs commercially available BLAST application; • SGI Altix family servers are equipped with • SGI RASC RC100 computation blades, • Nallatech System uses: • BenONE FPGA-based computing card on H100 Series platform, • Cray XT Super Computers: • will replace the XD1 line, • will use a DRC’s extension module with Virtex-4 FPGAs.

Benchmarks of Bioinformatics Algorithms • FASTA running on Cray XD1 Hybrid System, • benchmarking results show unprecedented FPGA speedups [2]. • Demonstrating substantial benefits to the users generates the traction for a breakthrough into mainstream HPC markets.

Micro-RNA Comparison benchmark • query sequences: 3685 sequences (~20 characters ) • databasefile: the first of all 24 human genome chromosomes • Platform 1: Cray XD1, using one Virtex2 Pro 50 FPGA • Speedup vs. Opteron 10X

Bacillus_anthracis DNA Comparison • query sequences: AE017024 through AE017041, (300K characters per sequence) • database file: AE016879 (more than 5M characters) • Platform 1: Cray XD1, using one Virtex2 Pro 50 FPGA • speedup vs. Opteron: 50X • Platform 2: Cray XD1, usingone Virtex4 LX160 FPGA • speedup vs. Opteron: 100X (from 8 hours down to 5 minutes) • Platform 3: Cray XD1, usingfive Virtex4 LX160 FPGAs • speedup vs. Opteron: 500X [3] • Speedup scales linearly with the number of FPGAs !!!

Amino Acid Search • query sequences OpenFPGA : ras (60 characters) myc (189 characters) sec (351 characters) • database file: 24 human genome chromosomes translated into amino acids • Platform 1: Cray XD1, using one Virtex-II Pro 50 • Speedup vs. Opteron: 20X to 50X • Speedup increases with the sequence length.

Other Benchmark Outcomes • After the 100X and 500X speedups by FPGA-acceleration processors on XD1: • all other reported speedups are less significant, • few researchers are likely to settle for 10X to 20X, • all vendors will scramble to catch up with Cray, • Bioinformatics should hope for a much better HPC landscape in the near future.

Recent References • [1] Tripp, J.L., Gokhale, M.B, Peterson, K.D.: Trident: From High-Level Language to Hardware Circuitry, IEEE Computer, March 2007,pp.28-37. • [2] Storaasli, O., YU, W., Strenski, D., Multbi, J.,: Evaluation of FPGA-Based Biological Applications, CUG 2007. • [3] Storaasli, O., ORNL’s Future Technologies Group - personal communication. • [4] Lazou,C.: FPGAs in HPC Landscape, EnterTheGrid -PrimeurMonthly, 2007.

On research supported by the LDRD Program of ORNL for the U.S. Department of Energy under Contract DE-AC05-00OR22725 Acknowledgment: Results courtesy of:

Other Candidates for Acceleration Processors • General Purpose Graphics Processing Units (GPGPUs) • stream processors (stream = set of records which require similar computation) • GPUs enhanced for supporting some other FP applications, • extreme parallelism of the GPU pipeline makes them suitable for applications with: • large data sets, • high parallelism, • minimal dependency between data elements, • may lack reliability needed in scientific HPC.

Other Candidates for Acceleration Processors • Cell Processor by IBM, Toshiba & Sony • designed for PlayStation’s enhanced graphics processing, • contains one scalar and eight vector processors. • Vector processors: • pipeline both, the instruction execution and the data I/O, • have a degree of superscalar implementation, i.e. parallelism in hardware.

Other candidates for Acceleration Processors Array Processor by ClearSpeed • Multithreaded Array Processor for floating point operations: • 64 processing elements in an 8x8 array, • FP unit, • local memory, 384KB SRAM, • I/O ports; • programmable only in C language.

OCCBIO 2007 Tutorial on FPGA-Acceleration Processors

OCCBIO 2007 Tutorial on FPGA-Acceleration Processors

Presentation Transcript

Amenability of Multigrid Computations to FPGA-Based Acceleration*

Conjoining Soft-Core FPGA Processors

FPGA Acceleration of Information Management Services 29 Sep 2004

Multithreaded FPGA Acceleration of DNA Sequence Mapping

2007 NIPS Tutorial on: Deep Belief Nets

Application-Specific Customization of Parameterized FPGA Soft-Core Processors

FPGA based Acceleration of Linear Algebra Computations.

The Microarchitecture of FPGA-Based Soft Processors

ICCV Tutorial 2007

FPGA Acceleration of Gene Rearrangement Analysis

The Microarchitecture of FPGA-Based Soft Processors

Network-on-FPGA

Amenability of Multigrid Computations to FPGA-Based Acceleration*

Application-Specific Customization of FPGA Soft-core Processors

Tutorial on Microscopy September 15, 2007

FPGA-based Acceleration of Hyperspectral K-Means Clustering

Network-on-FPGA

AES Acceleration Via FPGA Co-Processor

FPGA-based acceleration platform for chip verification

Approximate Computing on FPGA using Neural Acceleration

ICCV Tutorial 2007

Conjoining Soft-Core FPGA Processors