1 / 14

Implementing Data Parallel Algorithms for Bioinformatics

Implementing Data Parallel Algorithms for Bioinformatics. Christopher Mueller, Mehmet Dalkilic, Andrew Lumsdaine SIAM Conference on Computational Science and Engineering February 14, 2005. Introduction. Goal Implement a well known bioinformatics algorithm for a data parallel system (Altivec)

miya
Download Presentation

Implementing Data Parallel Algorithms for Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementing Data Parallel Algorithms for Bioinformatics Christopher Mueller, Mehmet Dalkilic, Andrew Lumsdaine SIAM Conference on Computational Science and Engineering February 14, 2005

  2. Introduction • Goal • Implement a well known bioinformatics algorithm for a data parallel system (Altivec) • Motivation • Current implementations do not scale well to support full genomes • Vector processors are common, even on commodity hardware • New supercomputing architectures are including vector (DSP) units (again)

  3. Normal SIMD 3 3 2 1 4 + 2 2 4 5 9 5 5 6 6 13 SIMD Overview Single Instruction, Multiple Data: Perform the same operation on many data items at once Vector registers can be divided according to the data type. The Altivec registers in the G5 are 128 bits wide. Image from http://developer.apple.com/hardware/ve

  4. General Issues • Altivec code is one step removed from assembly • Programmer manages load/store operations • Debugging and maintenance is a challenge • Compiler optimizations are not available • But, the compiler handles register assignments and can insert load/store operations • For maximum performance, the processor must be fed continuously

  5. Application: Dot Plot qseq, sseq = sequences win = number of elements to compare for each point Strig = number of matches required for a point for each q in qseq: for each s in sseq: if CompareWindow(qseq[q:q+win], s[s:s+win], strig): AddDot(q, s) Dotplot comparing the human and fly mitochondrial genomes (generated by DOTTER)

  6. The Standard Algorithm DOTPLOT(qScores, s, win, strig): dotvec = zeros(len(q)) for each char c in s: dotvec = shift(dotvec, 1) dotvec += qScores[c] if index(c) > win: delchar = s[index(c) - win] dotvec -= shift(qScores[delchar], win) for each dot in dotvec > strig: display(dot) end for each dot end for i end DOTPLOT

  7. Vector Dot Plot VECTORDOTPLOT(qScores, s, win, strig): for each vector diagonal D: runningScore = vector(0) for each char c in s: score = VecLoad(qScores[c]) runningScore = VecAdd(score, r_score) if index(c) > win: delChar = s[index(c) - win] delscore = VecLoad(qScores[delChar]) runningScore = VecSub(score, delscore) if VecAnyElementGte(runningScore, strig): scores = VectorUnpack(runningScore) for each score in scores > strig: Output(row(c), col(score), score) end for each score end for VecGte() end for each c end for each D end VECTORDOTPLOT

  8. Expectations

  9. Data Types • DNA • unsigned char • Window size is generally 16-40, max score 40 with no scoring matrix • Protein • short • Window size is smaller • Scoring matrices can lead to negative scores and scores > 127

  10. Stream Management • Single stream pointer is similar to indexing, but a little slower • For the four score streams, indexed 1/4 of the time, maintaining the pointers costs more than lookup // S-sequence is one stream pointer s++; // Q-sequence is four streams // Option 1: Four Pointers // Keep pointers to the current // position in the score vectors qScore[0]++; qScore[1]++; qScore[2]++; qScore[3]++; score = *qScore[*s]; // Option 2: Index // Index the score vectors with // a counter i++; score = qScore[*s][i];

  11. Pipeline Management Sequence of Vector Operations // score = VecLoad(qScores[c]) score1 = vec_ld(0, ptemp); // unalgined score2 = vec_ld(16, ptemp); // loads vperm = vec_lvsl(0, ptemp); score = vec_perm(score1, score2, vperm); runningScore = vec_add(score, r_score) // delscore = VecLoad(qScores[delChar]) score1 = vec_ld(0, ptemp); score2 = vec_ld(16, ptemp); vperm = vec_lvsl(0, ptemp); delscore = vec_perm(score1, score2, vperm); runningScore = vec_sub(score, delscore) if vec_any_ge(runningScore, strig): scores = vec_st(runningScore) Cycle-accurate plots of the instructions in flight. The left plot shows a series of add/delete steps with no dots generated. The bottom plot shows the pipeline being interrupted when a dot is generated.

  12. std::vector sparse matrix Memory mapped array struct Dot { int col; int value; }; struct Row { int num; vector<Dot> cols; }; typedef vector<Row*> DotMatrixVec; struct RowDot { int row; int col; int value; }; RowDot *out = (RowDot*)mmap(…); Dot Matrix Structure Performance in Mops of sparse matrix formats based on data location • An ‘op’ is one complete dot comparison • Base is a direct port of the DOTTER algorithm

  13. Traditional Optimizations • Prefetch • G5 hardware prefetch is very good • Attempts to optimize had negative impact • Blocking • Slight negative impact due to burps in the stream • Unrolling • Complicated code very quickly • No measurable improvement

  14. Acknowledgements • Jeremiah Willcock helped develop the initial prototype • References Apple Developer’s Connection, Velocity Engine and Xcode, from, Apple Developer Connection, Cupertino, CA, 2004. http://developer.apple.com/hardware/vehttp://developer.apple.com/tools/xcode A. J. Gibbs and G. A. McIntyre, The diagram, a method for comparing sequences. Its use with amino acid and nucleotide sequences, Eur J Biochem, 16 (1970), pp. 1-11. E. L. L. Sonnhammer and R. Durbin, A Dot-Matrix Program with Dynamic Threshold Control Suited for Genomic DNA and Protein-Sequence Analysis, Gene-Combis, 167 (1995), pp. 1-10.

More Related