1 / 49

ClawHMMer

ClawHMMer. Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University. Data proliferation in biology. Protein databases SWISS-PROT (200,000 annotated proteins) NCBI Nonredundant DB (2.5 million proteins) UniprotKB/TrEMBL (2.3 million proteins) DNA sequences

jeubanks
Download Presentation

ClawHMMer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University

  2. Data proliferation in biology • Protein databases • SWISS-PROT (200,000 annotated proteins) • NCBI Nonredundant DB (2.5 million proteins) • UniprotKB/TrEMBL (2.3 million proteins) • DNA sequences • DDBJ Japan (42 million genes) • NCBI GenBank (10 million sequence records)

  3. Protein matching F F F F R R R R N N N N F F T T T A T A P P P P • Problem • Different proteins, same function • common amino acid patterns • Solution • Fuzzy string matching A A As can be seen From side-by-side placement Similar to

  4. Current methods • BLAST [Altschul ‘90] - Ad hoc gap penalties • Matches depend on penalty + Fast • HMMer [Krogh ‘93] + Robust • Probabilistic model - Slow

  5. New architectures IBM Cell Graphics Hardware

  6. New architectures • Traditional CPUs • P4 3.0 GHz = 12 GFlops • G5 2.5 GHz = 10 GFlops • New architectures offer more compute power • ATI X1800XT = 120 GFlops • NVIDIA G70 = 176 GFlops • Cell = 250 GFlops • However, new architectures are parallel

  7. Fragment Processor Texture Memory (256-512MB) 128+ Constant Values 8 Interpolated Values Fragment Processor 32 Temporary R/W Registers 16 floats written sequentially to texture memory eventually

  8. Fragment Processors • Have 16-24 Fragment Processors • Need 10,000 threads executing in parallel • Hides memory latency • Shared Program Counter (PC) • Newer hardware has more (1 per 4 processors) Texture Memory (256-512MB) 128+ Constant Values PC Interpolants Frag. Proc. R/W Reg … Frag. Proc. R/W Reg Frag. Proc. R/W Reg Frag. Proc. R/W Reg Data written sequentially to texture memory

  9. Brook For GPU’s • Abstraction for streaming architectures [Buck ‘04] • Encourages data-parallel applications • Streams • Arrays of data • Kernel • Small function that can run on a fragment processor • No access to global variables • Fixed number of stream inputs, outputs, lookup tables • Mapping operation • Runs kernel on each element of input data stream • Writes to output stream

  10. Challenges • Current search algorithm doesn’t map well • Require lots of fine-grained parallelism • 10,000 kernel invocations in parallel • High Arithmetic intensity • 8:1 compute:fetch ratio • Fragment processors have limited resources • Kernels need to be small

  11. Hidden Markov Model • State machine represents unobservable world state Rainy Sunny

  12. Transition probabilities .3 .7 .6 Rainy Sunny .4

  13. Observations • Observable clues as to the world state .3 .7 .6 Rainy Sunny .4 P(coat|Rainy)=.4 P(nothing|Rainy)=.1 P(umbrella|Rainy)=.5 P(coat|Sunny)=.3 P(nothing|Sunny)=.6 P(umbrella|Sunny)=.1 Probabilities of observation

  14. Viterbi algorithm • Given HMM and observation sequence, Viterbi finds • Most likely path through world state • Probability of this path • Dynamic programming • Fills probability table • Per observation • Per state • Max over incoming transitions Probability of state machine taking the most likely path to this current state while emitting given observation sequence

  15. Viterbi Example Umbrella Nothing Coat =.2 .4 .5 .7 .7 Rainy .4 .4 .20 .3 Sunny .6 .06 =.06 .6 .1 Sunday Monday Tuesday Wednesday • Observations: max(0.7,1 .4)  p(Umbrella|Rainy) Rainy 0.0 Sunny 1.0 max(0  .3,1  .6)  p(Umbrella | Sunny)

  16. Viterbi Example Viterbi Traceback Example .7 .7 Rainy Rainy .4 .4 .20 .056 .018 .00399 .3 Sunny Sunny .6 .06 .01026 .06*.4 < .2*.7 .018·.6 < .056·.3 1.0*.4 > 0.0*.7 Sunday Monday Tuesday Wednesday • Observations: Umbrella Nothing Coat Rainy Rainy Rainy Rainy 0.0 Sunny Sunny Sunny Sunny 1.0 Viterbi path: “Sunny, Rainy, Rainy, Sunny” .01026>.00399

  17. HMM Training F F R R G G N N F T T F G T T P P Proposed alignment Proteins in family • Propose an alignment • Supply training proteins to HMMer • Compute probabilities of insertions and deletes Probabilistic model F F F F F F F F R R R R R R R R Align G G N N N N N N N N + F F F T F F T T T T T F G T T T T T T T P P T P P P P P P Insertion Delete Region of interest

  18. Probabilistic model: HMM Junk Junk Junk Junk Junk Start F R N T T P End Del Del Del Del Junk G F • State machine represents unobservable world state • For HMMer, this is the string match progress • Observations are amino acids P T N F R Emit “G” Emit “F” Emit “F” Emit “P” Emit “N” Emit “T” Emit “R” No Emission F

  19. HMMer Searches a Database R F F F P G R R R R R R F F F Junk Junk Junk Junk Junk G T G G G G R N N F N          N T T T N N N N F T T T N N T N T T Start F F F F F R P P F N T T P End F F F N T T T N N Del Del Del Del T T Junk P R P P P P P P P • If Probability score high: Perform traceback for alignment Database Probabilistic model Query 

  20. Viterbi Algorithm Code for obs in sequence //observation loop for s in states: //next state loop for predecessor in states: //transition loop if max likelihood path to s came from predecessor fill table[obs][s]

  21. Parallelizing Viterbi Over a Database parallel_for observ_seq in database: //DB loop for obs in observ_seq: //observation loop for s in states: //state loop for predecessor in states: //transition loop if max likelihood path to s came from predecessor fill table [observ_seq][obs][s]

  22. Flexible parallelism • Pull database loop inside • Reduces size/state of inner loop code • More flexibility in choosing • Coarse grained parallelism • Data Parallel • Fine Grained Parallelism • SIMD Instructions

  23. Loop Reordering parallel_for observ_seq in database: //DB loop for pred in states: //transition loop if max likelihood path to s came from pred fill table[observ_seq] [i] [s] for i =1 to length(longest observ_seq): //observation loop for s in states: //next state loop Kernel

  24. Loop Reordering parallel_for observ_seq in database://DB loop for pred in states: //transition loop if max likelihood path to s came from pred fill table[observ_seq] [i] [s] for i =1 to length(longest observ_seq): //observation loop for s in states: //next state loop Kernel

  25. Further Optimizations HMM Insert Insert Insert Insert Insert F R N T P Del Del Del Del Junk • Unroll Transition Loop Completely • Unroll State Loop by 3 • Gain fine grained data level parallelism by processing 3 related states (triplet) at once Start End

  26. Further State Loop Unrolling Increases Arithmetic Intensity Insert Insert Insert Insert Insert F R N T P Del Del Del Del Junk • Neighboring probability computations (multi-triplet) • Read Similar Inputs • Use each others’ intermediate results • Fewer Fetches + Same Math = Higher Arithmetic Intensity Start End

  27. Implementation on Graphics Hardware (GPU) • Wrote 1 triplet and 4 triplet state processing kernels in • Brook for GPU’s [Buck ‘04] • Downloaded as shader program • Amino acid sequences • read-only textures • State probabilities • read/write textures • Transition probabilities • constant registers • Emission probabilities • small lookup table • If necessary, traceback performed on CPU

  28. Experimental Methodology • Tested Systems • Sean Eddy’s HMMer 2.3.2 [Eddy ‘03] • Pentium IV 3.0GHz • Erik Lindahl’s AltiVec v2 implementation [Lindahl ‘05] • 2.4 GHz G5 Power PC • Our Brook Implementation • ATI X1800 XT • NVIDIA 7800GT • Data set • NCBI Nonredundant database with 2.5 million proteins • Representative HMM’s from Pfam database

  29. Adenovirus Performance • ATI Hardware outperforms PowerPC by factor of 2+ • Graphics Hardware performs between 10-25 times better than HMMer on x86 • Careful optimization may improve x86 in the future

  30. Performance • ATI Hardware outperforms PowerPC by factor of 2+ • Graphics Hardware performs between 10-25 times better than HMMer on x86 • Careful optimization may improve x86 in the future

  31. Performance analysis • Comparison [ATI x800; 17.5GB/s; 8.3Gops] • 1-Triplet: Bandwidth limited • 4-Triplet: Instruction Issue limited • Conclusions • 4-triplet kernel achieves 90% of peak performance • Unrolling kernel important

  32. Scales Linearly • Results on • 16-node cluster • Gig-E interconnect • ATI Radeon 9800 • Dual CPU • Split 2.5 million protein database across nodes • Linear scalability • Need 10,000 proteins per cluster to keep GPU’s busy

  33. HMMer-Pfam • Solves inverse problem of HMMer-search • Given a protein of interest • search a database of HMMs • Problem for HMMer-pfam • Requires traceback for every HMM • Not enough memory on GPU • GPU requires 10,000 elements in parallel • Requires storage for 10,000  num_states  sequence_length floating point probabilities • Need architecture with finer parallelism

  34. Summary • Streaming version of HMMer • Implemented on GPUs • Performs well • 2-3x a PowerPC G5 • 20-30x a Pentium 4 • Well suited for other parallel architectures • Cell

  35. Acknowledgements • Funding • Sheila Vaidya & John Johnstone @ LLNL • People • Erik Lindahl • ATI folks • NVIDIA folks • Sean Eddy • Ian & Brook Team

  36. Questions? • danielrh @ graphics.stanford.edu • mhouston @ graphics.stanford.edu

  37. Importance of Streaming HMMer Algorithm • Scales to many types of parallel architectures • By adjusting the width of the parallel_for loops • Cell, Multicore, Clusters … • Cell • Addressible Read/Write Memory • Fine-grained parallelism • Pfam possibility • Tens, not thousands of parallel DB queries • On Cell, 16-64 database entries could be processed • Each kernel could processes all states • Only returns a probability

  38. General HMM Performance • Arbitrary transition matrix • Exactly like repeated matrix-vector multiply • Each matrix is transition matrix • Scaled by emission probability • Max() supplants add() operation in matrix-vector analogy • Brook for GPU paper shows • Matrix-Vector Multiply • Good for streaming architectures • Each element is touched once • Streaming over large quantities of memory • GPU memory controlled designed for this mode

  39. Filling Probability Table States: A B C D A B C D A B C D … Sequence Database Sequence A Sequence B Sequence C Sequence D F R N F … .8 .2 .3 .1 F R N T … N T N F … T G P F …

  40. Competing with CPU Cache • Probability at each state requires max() over • incoming state probabilities in memory on Graphics Hardware • Incoming state probabilities in L1 cache on CPU • Only final probability must be written to memory • Only tiny database entries must be read • Luckily 12-way version instruction, not bandwidth limited on GPU • CPU also instruction-limited

  41. Viterbi Example (yes, will fix arrows) • Observations: Umbrella Coat Nothing pemit(Rainy,Umbrella) *max(0*.7,1*.4) .7 Rainy Rainy .3 =.5*max(0,.4) =.2 0 .4 Sunny Sunny .6 1.0 pemit(Sunny,Umbrella) *max(0*.3,1*.6) =.1*max(0,.6)=.06

  42. Viterbi Example • Observations: Umbrella Coat Nothing pemit(Rainy,Coat) *max(.2*.7,.06*.4) .7 Rainy Rainy Rainy =.4*max(.14,.024)=.056 0 .2 .4 .3 Sunny Sunny Sunny 1.0 .6 .06 pemit(Sunny,Coat) *max(.2*.3,.06*.6) =.3*max(.06,.036)= .018

  43. Viterbi Example • Observations: Umbrella Coat Nothing .7 pemit(Rainy,Nothing) *max(.057*.7,.018*.4) Rainy Rainy Rainy Rainy =.1*max(.0399,.0072) =.00399 0 .2 .056 .4 .3 Sunny Sunny Sunny Sunny .6 1.0 pemit(Sunny,Nothing) *max(.057*.3,.018*.6) =.6*max(.0171,.0108) =.01026 .06 .018

  44. Performance • ATI Hardware outperforms PowerPC by factor of 2+ • Graphics Hardware performs between 10-25 times better than HMMer on x86 • Careful optimization may improve x86 in the future

  45. Transition Probabilities Junk Junk Junk Junk Junk F F F F R R R R G G N N N N Start F R N F T T F P End F T T T T F G Del Del Del Del T T T Junk P P P P • Represent chance of the world changing state • Trained from example data .5 .25 .25 Proteins in Family 25% chance of correct match 50% chance of insertion 25% chance of deletion

  46. HMMer Scores a Protein F R Junk Junk Junk Junk Junk G N T Start F F R N T T P End T Del Del Del Del Junk P • Returns probable alignment of protein to model Database Protein Most Likely Alignment to Model Probabilistic model F HMMer R Probability Score + G + N T F T P 

  47. Probabilisitic Model: HMM Rainy Sunny Insert Insert Insert Insert Insert F R N T P Del Del Del Del Junk • Specialized layout of states • Trained to randomly generate amino-acid chains • Chains likely to belong in a desired family • States with a circle emit no observation Start End

  48. GPU Limitations • High granularity parallelism of 10,000 elements • No room to store entire state table • Download of database • Readback of results • Few registers • Only mechanism of fast read/write storage • Limits how many state triplets a kernel may process • Number of kernel outputs

  49. Viterbi Example Viterbi Traceback Example Umbrella Nothing Coat =.2 .5 .4 =.06 .1 .6 .7 .7 .7 .7 .7 Rainy Rainy Rainy .4 .4 .4 .4 .4 .2 .056 .018 .00399 .3 .3 .3 .3 Sunny Sunny Sunny .6 .6 .6 .6 .06 .01026 Sunday Monday Tuesday Wednesday Rainy Umbrella  max(0.7,1 .4) p(Umbrella | Rainy) Nothing Coat Rainy 0 Sunny 1.0 p(Umbrella | Sunny)  max(0.3,1 .6) Viterbi path: “Sunny, Rainy, Rainy, Sunny” .06*.4 < .2*.7 .018·.6 < .056·.3 1.0*.4 > 0.0*.7 .01026>.00399

More Related