260 likes | 541 Views
Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers. Esam El-Araby 1 , Mohamed Taher 1 , Tarek El-Ghazawi 1 , Mohamed Abouellail 1 , Nandakishore Sastry 2 , and Kris Gaj 2 1 The George Washington University, 2 George Mason University. Outline.
E N D
Efficient Implementation of a String Matching Algorithm for SRC and Cray Reconfigurable Computers Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, Mohamed Abouellail1, Nandakishore Sastry2, andKris Gaj2 1The George Washington University,2George Mason University
Outline • Introduction • SRC Hardware & Software • Cray XD1 Hardware & Software • String Matching Algorithms • Implementation Methodology • Results and Comparisons • Conclusions
Microprocessor System Reconfigurable Processor System . . . P P . . . FPGA FPGA P memory P memory FPGA memory FPGA memory . . . . . . Interface Interface I/O I/O Introduction
Outline • Introduction • SRC Hardware & Software • Cray XD1 Hardware & Software • String Matching Algorithms • Implementation Methodology • Results and Comparisons • Conclusions
SRC Hi-Bar Switch Common Memory SNAP™ Memory SNAP Memory Common Memory MAP® MAP Wide Area Network P P Chaining GPIO Local Area Network SRC-6 Gig Ethernet etc. PCI-X PCI-X Storage Area Network Disk Customers’ Existing Networks SRC Architecture(Hi-BarTM Based Systems) • Hi-Bar sustains 1.4 GB/s per port with 180 ns latency per tier • Up to 256 input and 256 output ports with two tiers of switch • Common Memory (CM) has controller with DMA capability • Controller can perform other functions such as scatter/gather • Up to 8 GB DDR SDRAM supported per CM node
SRC Programming Environment HDL (VHDL) HLL (C) FPGA system P system SRC Programming Environment
User Macro sources Application sources Application sources HDL HDL sources . . vhd vhd or or .v files .v files .c or .f files .c or .f files .v files .edf files Logic synthesis Logic synthesis m m MAP Compiler MAP Compiler P P Compiler Compiler . . edf files files Object Object .o files .o files .o files .o files files files Place & Route Place & Route Linker Linker .bin files .bin files Configuration Configuration Application Application bitstreams bitstreams executable executable SRC Programming Environment (cnt’d)
SRC Programming Environment (cnt’d) FPGA contents after the Function_1 call Program in C or Fortran Main program Function_1 a …… FPGA Macro_1(a, b, c) Macro_2(b, d) Macro_2(c, e) Function_1(a, d, e) Macro_1 …… c b Function_2 Macro_2 Macro_2 Macro_3(s, t) Macro_1(n, b) Macro_4(t, k) Function_2(d, e, f) d e ……
Outline • Introduction • SRC Hardware & Software • Cray XD1 Hardware & Software • String Matching Algorithms • Implementation Methodology • Results and Comparisons • Conclusions
RapidArray components in a Cray XD1 chassis Cray XD1 System Architecture(One Chassis) Compute • 12 AMD Opteron 32/64 bit, x86 processors • High Performance Linux RapidArray Interconnect • 12 communications processors • 1 Tb/s switch fabric Active Management • Dedicated processor Application Acceleration • 6 co-processors FPGA and 2nd RAP are on Expansion Module
UserLogic RapidArray Transport Core QDR RAM Interface Core ADDR(20:0) D(35:0) Q(35:0) ADDR(20:0) D(35:0) TX QDR II SRAM Q(35:0) RAP RX ADDR(20:0) D(35:0) Q(35:0) ADDR(20:0) D(35:0) RapidArrayTransport Q(35:0) Virtex-II Pro Cray XD1 Application Acceleration Interfaces • XC2VP30-50 running at up to 200 MHz • 4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s) • 16 bit simplified HyperTransport I/F at 400 MHz DDR (800 MTransfers/s) • QDR and HT I/F take up <20 % of XC2VP30. The rest is available for user applications
Hardware Flow Software Flow Standard Hardware Flow Cray XD1 Development Flow
Standard Flow Additional High-Level Tools Cray XD1 Hardware Development Flow
Design Methodology using Cray XD1 • Write application in C for system microprocessor • Identify computation intense routine(s) • Generate a bitstream using Cray Cores (RT & QDRII) and language of choice • Create module in HDL (Verilog, VHDL) • Create module using High Level Language Tools • Validate Module • Synthesize using (XST, Leonardo, Synplify Pro) • Create bitstream using Xilinx place & route tools • Replace routines with Cray API calls • Run Application
Outline • Introduction • SRC Hardware & Software • Cray XD1 Hardware & Software • String Matching Algorithms • Implementation Methodology • Results and Comparisons • Conclusions
String Matching - Introduction • String Matching – detecting the occurrence of a particular substring, called the pattern, in another string, called the text • Types of String matching: • Exact string matching • Approximate string matching • Exact string matching: • Involves match patterns, where they exist completely, that is unbroken and with no irrelevant data in between any letters • Numerous Applications : NIDS, text editing, …etc. • Approximate string matching: • Pattern rarely matches the text completely • Finds application in Computational biology (DNA matching), image detection, handwriting recognition…etc.
Why align two protein or DNA sequences? Determine whether they are descended from a common ancestor (homologous) Infer a common function Locate functional elements Infer protein structure, if the structure of one of the sequences is known Problem: find the best pairwise alignment of GAATC and CATAC GAATC CATAC GAAT-C C-ATAC -GAAT-C C-A-TAC GAATC- CA-TAC GAAT-C CA-TAC GA-ATC CATA-C DNA Matching Basics • We need a way to measure the quality of a candidate alignment • Alignment scores consist of two parts: • substitution matrix • gap penalty
A C G T A 10 -5 0 -5 C -5 10 -5 0 G 0 -5 10 -5 T -5 0 -5 10 A hypothetical substitution matrix DNA Matching Basics (cnt’d) Scoring aligned bases Transversion (expensive) GAAT-C CA-TAC Transition (cheap) -5 + 10 + ? + 10 + ? + 10 = ? Scoring gaps • Linear gap penalty: every gap receives a score of d GAAT-C d=-4 CA-TAC -5 + 10 + -4 + 10 + -4 + 10 = 17 • Affine gap penalty: opening a gap receives a score of d; extending a gap receives a score of e G--AATC d=-4 CATA--C e=-1 -5 + -4 + -1 + 10 + -4 + -1 + 10 = 5
A Read sequences A & B Into two arrays Compute Similarity Matrix [i] [j] Set traceback & Similarity matrix to (A+1) * (B+1) Update traceback Array 1’s row & column of Similarity Matrix = 0 Similarity Matrix Complete? NOTE: Traceback array carries the coordinates of one of three cells involved in the calculation of the cell [i] [j] in the similarity matrix no Initialize traceback Arrays by setting to -1 (default value) yes Traceback for best alignments A Approximate String Matching Algorithm(Smith-Waterman Algorithm)
Outline • Introduction • SRC Hardware & Software • Cray XD1 Hardware & Software • String Matching Algorithms • Implementation Methodology • Results and Comparisons • Conclusions
C function for P P System Software Only Implementation C function for MAP VHDL Macro Software/Hardware Implementation FPGA System VHDL Hardware Only Implementation Implementation Schemes in SRC
Operational Environment FPGA-Initiated Transfers Write-Only Transfers µP-Initiated Transfers Operational Scenarios for Cray XD1
Outline • Introduction • SRC Hardware & Software • Cray XD1 Hardware & Software • String Matching Algorithms • Implementation Methodology • Results and Comparisons • Conclusions
Performance Results • Rate = (FPGA freq.) X (cycles/cell) X (# SWPEs) • Opteron Implementation (SSEARCH34)* • 100 Million Cell Updates Per Second (CUPS) • Cray Inc. Implementation* • Current unoptimized design • 80 MHz X 1 X 32 = 2.56 Billion CUPS (GCUPS) • With optimization • 100 MHZ x 1 x 50 = 5.0 GCUPS • With future Virtex 4 FPGA • 100 MHZ x 1 x 150 = 15 GCUPS • 25x speedup vs. Opteron • Our Implementation • SRC-6 • Current unoptimized design • 100 MHz X 1 X (16x16) = 25.6 GCUPS • 10x speedup vs. Cray • 256x speedup vs. Opteron • Cray XD1 • Current unoptimized design • 200 MHz X 1 X (16x16) = 51.2 GCUPS • 20x speedup vs. Cray • 512x speedup vs. Opteron *CUG’05, New Mexico, May 2005
Conclusions • Smith-Waterman sequence alignment algorithm has been implemented on both SRC-6 and Cray XD1 systems • Similarities and differences are highlighted with regard to: • System hardware architecture • Ease of programming • Programming model • Development time • Hardware/software libraries • Performance • The speed-up vs. microprocessor is reported • Primary bottlenecks limiting the performance of both systems are recognized • The capability to share and port applications between the SRC and Cray systems is explored