260 likes | 372 Views
Case Studies of Porting Two Algorithms to Reconfigurable Processors. Reconfigurable Systems Summer Institute Wednesday July 13 2005 Craig Steffen National Center for SuperComputing Applications. Reminder: FPGA computational Strengths. Parallel elements
E N D
Case Studies of Porting Two Algorithms to Reconfigurable Processors Reconfigurable Systems Summer Institute Wednesday July 13 2005 Craig Steffen National Center for SuperComputing Applications
Reminder: FPGA computational Strengths • Parallel elements • Integer processing means small footprint • Processor cache structure—data reuse • Simple problems: maximum utilization • Simple problems to describe
Example Algorithm: Matrix Multiply • Popular and much-used algorithm • Well-known API in place • Advantages: Simple, dividable, inherently parallel, data re-use • Disadvantage: floating point
Matrix Multiply Algorithm • Parallel computations • Multiple data uses • Lends well to MAC units A B C a b c d k q e f g h X m = r n p q = ak + bm + cn + dp r = ek + fm + gn + hp
Matrix Multiply Algorithm(matrix dimensions) δ θ δ a b c d k q α α e f g h X θ m = r n p q = ak + bm + cn + dp r = ek + fm + gn + hp
Matrix Multiply Implementation in C for(i=0; i<α; i++){ for(j=0; j<δ; j++){ C[i][j] = 0.0; for(k=0; k<θ; k++){ C[i][j] += A[i][k] * B[k][j]; } } }
Naïve Implementation Performance • Generated a version that timed matrix multiply with inputs α,δ,θ and N (iterations) • Going from 40,800,250,40 to 400,800,250,40 caused a 2.5x slowdown (cache issues) (data re-use rears its head) • Speed was 500M MACs per second, or 1B operations per second on a 2.8 GHz CPU • Real optimized library would run about 6x faster
Matrix Multiply: block-wise divisible • Any block of elements may be multiplied as a unit • As long as the general rules are followed the final result is the same as it would have been • This can be exploited to take advantage of specialized units with preferred operand sizes X = = X + X
64-bit Floating-Point FPGA Matrix Multiplication • Yong Dou, S. Vassiliadis, G. K. Kuzmanov, G. N. Gaydadjiev • FPGA ’05, February 20-22, 2005, Monterey, California, USA • Contains a 12-stage pipelined MAC block design • Multi-FPGA master-slave multiple simultaneous execution design
Plan for Implementation on SRC MAP Processor • MAP runs at 100 MHz clock speed • Assuming fully pipelined logical units (MAC in this case), requires 5 MACs running in parallel (disregarding transfer latencies)
Single Data Use: MAP Starves MAP Processor • RAM to MAP data pipe can feed 2 MACs • 6 required to make it worthwhile 64 bit 32 bit 32 bit 64 bit 32 bit 32 bit
Using Caching: Now Equals CPU Speed Data re-use in OBM MAP Processor FPGA On-Board Memory On-Board Memory 64 bit 32 bit 32 bit On-Board Memory 64 bit 32 bit 32 bit On-Board Memory On-Board Memory On-Board Memory
More Speedup:Requires Data Re-use in FPGA Block Ram MAP Processor FPGA On-Board Memory On-Board Memory 64 bit 32 bit 32 bit On-Board Memory 64 bit 32 bit 32 bit On-Board Memory On-Board Memory On-Board Memory
Matrix Multiply Status • On hold for the moment • Need to understand programming and access issues for Block Ram
BLAST: DNA Comparison Code • Basic Local Alignment Search Tool • Biology code for comparing DNA and protein sequences • Multiple modes: DNA-DNA, DNA-protein, protein-DNA, protein-protein • Answers the question: “Is A a subset of B?” where A is short and B is very long • Not a complete match—takes into account DNA combinatoric and protein substitution rules
blastp: compare DNA to Protein • First, translate DNA to Protein DNA Amino Acid sequence
blastp: compare DNA to Protein • First, translate DNA to Protein DNA Frame 1 Amino Acid sequences Frame 2
blastp: compare DNA to Protein • First, translate DNA to Protein • Translate all forward frames DNA Frame 1 Amino Acid sequences Frame 2 Frame 3
blastp: compare DNA to Protein • First, translate DNA to Protein • Translate all forward frames • Complete “6-Frame translation” DNA Amino Acid sequences
BLAST Method (per Frame): • Finds small local matches • Tries to expand matches to improve them, by changing matches or inserting gaps; all of which have weights which are tied to the probability of one combination mutating to another • Each change causes a change in the “goodness of match” score • Many combinations are attempted until the goodness value peaks • This method very unsuited for FPGA: each step depends on the previous one, one small comparison loop carrying all the weight Initial match: Score: two elements Score: 2 el. – 1 gap Score: 4 el. – 1 gap
BLAST: Solve the same problem differently • Component-by-component comparison for multiple offsets • For each offset record a score based on the number and/or arrangement of matches • After all comparisons are finished, then (perhaps) do iterative matching Position score: 0 Position score: 2 Position score: 0 Position score: 2 Position score: 0
Performance • Problem provided by Matt Hudson of Crop Sciences: a protein that is built by a DNA in a certain plant chromosome. • Protein is ~1100 amino acids, Chromosome is 31 million bases • BLAST detects two very strong hits and two weak ones, taking 3 CPU-seconds • My algorithm, when coded naively, takes 20 minutes • With reasonable speed-ups, takes 19 to 40 seconds
FPGA Advantages to this Algorithm: • Each offset is independent • each element’s comparision is independent • Data re-use factor is the length of the short sequence • 6-fold parallelism due to 6-frame translation Position score: 0 Position score: 2 Position score: 0 Position score: 2 Position score: 0
Implementation: • Do comparisons in parallel • Shift dna sequence, push new element into pipe • repeat
Current Status • Element-wise shift not trivially defined in MAP-C, defined Perl-expanded macro • Must create parallel comparision and adder tree to finish ? + ? Total offset score + ? + ?
Conclusion • Tools are gaining in usefulness and sophistication • The programmer must explicitly deal with memory architectures and data movement • Some things just don’t work as you’re used to thinking about them