490 likes | 510 Views
ClawHMMer. Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University. Data proliferation in biology. Protein databases SWISS-PROT (200,000 annotated proteins) NCBI Nonredundant DB (2.5 million proteins) UniprotKB/TrEMBL (2.3 million proteins) DNA sequences
E N D
ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University
Data proliferation in biology • Protein databases • SWISS-PROT (200,000 annotated proteins) • NCBI Nonredundant DB (2.5 million proteins) • UniprotKB/TrEMBL (2.3 million proteins) • DNA sequences • DDBJ Japan (42 million genes) • NCBI GenBank (10 million sequence records)
Protein matching F F F F R R R R N N N N F F T T T A T A P P P P • Problem • Different proteins, same function • common amino acid patterns • Solution • Fuzzy string matching A A As can be seen From side-by-side placement Similar to
Current methods • BLAST [Altschul ‘90] - Ad hoc gap penalties • Matches depend on penalty + Fast • HMMer [Krogh ‘93] + Robust • Probabilistic model - Slow
New architectures IBM Cell Graphics Hardware
New architectures • Traditional CPUs • P4 3.0 GHz = 12 GFlops • G5 2.5 GHz = 10 GFlops • New architectures offer more compute power • ATI X1800XT = 120 GFlops • NVIDIA G70 = 176 GFlops • Cell = 250 GFlops • However, new architectures are parallel
Fragment Processor Texture Memory (256-512MB) 128+ Constant Values 8 Interpolated Values Fragment Processor 32 Temporary R/W Registers 16 floats written sequentially to texture memory eventually
Fragment Processors • Have 16-24 Fragment Processors • Need 10,000 threads executing in parallel • Hides memory latency • Shared Program Counter (PC) • Newer hardware has more (1 per 4 processors) Texture Memory (256-512MB) 128+ Constant Values PC Interpolants Frag. Proc. R/W Reg … Frag. Proc. R/W Reg Frag. Proc. R/W Reg Frag. Proc. R/W Reg Data written sequentially to texture memory
Brook For GPU’s • Abstraction for streaming architectures [Buck ‘04] • Encourages data-parallel applications • Streams • Arrays of data • Kernel • Small function that can run on a fragment processor • No access to global variables • Fixed number of stream inputs, outputs, lookup tables • Mapping operation • Runs kernel on each element of input data stream • Writes to output stream
Challenges • Current search algorithm doesn’t map well • Require lots of fine-grained parallelism • 10,000 kernel invocations in parallel • High Arithmetic intensity • 8:1 compute:fetch ratio • Fragment processors have limited resources • Kernels need to be small
Hidden Markov Model • State machine represents unobservable world state Rainy Sunny
Transition probabilities .3 .7 .6 Rainy Sunny .4
Observations • Observable clues as to the world state .3 .7 .6 Rainy Sunny .4 P(coat|Rainy)=.4 P(nothing|Rainy)=.1 P(umbrella|Rainy)=.5 P(coat|Sunny)=.3 P(nothing|Sunny)=.6 P(umbrella|Sunny)=.1 Probabilities of observation
Viterbi algorithm • Given HMM and observation sequence, Viterbi finds • Most likely path through world state • Probability of this path • Dynamic programming • Fills probability table • Per observation • Per state • Max over incoming transitions Probability of state machine taking the most likely path to this current state while emitting given observation sequence
Viterbi Example Umbrella Nothing Coat =.2 .4 .5 .7 .7 Rainy .4 .4 .20 .3 Sunny .6 .06 =.06 .6 .1 Sunday Monday Tuesday Wednesday • Observations: max(0.7,1 .4) p(Umbrella|Rainy) Rainy 0.0 Sunny 1.0 max(0 .3,1 .6) p(Umbrella | Sunny)
Viterbi Example Viterbi Traceback Example .7 .7 Rainy Rainy .4 .4 .20 .056 .018 .00399 .3 Sunny Sunny .6 .06 .01026 .06*.4 < .2*.7 .018·.6 < .056·.3 1.0*.4 > 0.0*.7 Sunday Monday Tuesday Wednesday • Observations: Umbrella Nothing Coat Rainy Rainy Rainy Rainy 0.0 Sunny Sunny Sunny Sunny 1.0 Viterbi path: “Sunny, Rainy, Rainy, Sunny” .01026>.00399
HMM Training F F R R G G N N F T T F G T T P P Proposed alignment Proteins in family • Propose an alignment • Supply training proteins to HMMer • Compute probabilities of insertions and deletes Probabilistic model F F F F F F F F R R R R R R R R Align G G N N N N N N N N + F F F T F F T T T T T F G T T T T T T T P P T P P P P P P Insertion Delete Region of interest
Probabilistic model: HMM Junk Junk Junk Junk Junk Start F R N T T P End Del Del Del Del Junk G F • State machine represents unobservable world state • For HMMer, this is the string match progress • Observations are amino acids P T N F R Emit “G” Emit “F” Emit “F” Emit “P” Emit “N” Emit “T” Emit “R” No Emission F
HMMer Searches a Database R F F F P G R R R R R R F F F Junk Junk Junk Junk Junk G T G G G G R N N F N N T T T N N N N F T T T N N T N T T Start F F F F F R P P F N T T P End F F F N T T T N N Del Del Del Del T T Junk P R P P P P P P P • If Probability score high: Perform traceback for alignment Database Probabilistic model Query
Viterbi Algorithm Code for obs in sequence //observation loop for s in states: //next state loop for predecessor in states: //transition loop if max likelihood path to s came from predecessor fill table[obs][s]
Parallelizing Viterbi Over a Database parallel_for observ_seq in database: //DB loop for obs in observ_seq: //observation loop for s in states: //state loop for predecessor in states: //transition loop if max likelihood path to s came from predecessor fill table [observ_seq][obs][s]
Flexible parallelism • Pull database loop inside • Reduces size/state of inner loop code • More flexibility in choosing • Coarse grained parallelism • Data Parallel • Fine Grained Parallelism • SIMD Instructions
Loop Reordering parallel_for observ_seq in database: //DB loop for pred in states: //transition loop if max likelihood path to s came from pred fill table[observ_seq] [i] [s] for i =1 to length(longest observ_seq): //observation loop for s in states: //next state loop Kernel
Loop Reordering parallel_for observ_seq in database://DB loop for pred in states: //transition loop if max likelihood path to s came from pred fill table[observ_seq] [i] [s] for i =1 to length(longest observ_seq): //observation loop for s in states: //next state loop Kernel
Further Optimizations HMM Insert Insert Insert Insert Insert F R N T P Del Del Del Del Junk • Unroll Transition Loop Completely • Unroll State Loop by 3 • Gain fine grained data level parallelism by processing 3 related states (triplet) at once Start End
Further State Loop Unrolling Increases Arithmetic Intensity Insert Insert Insert Insert Insert F R N T P Del Del Del Del Junk • Neighboring probability computations (multi-triplet) • Read Similar Inputs • Use each others’ intermediate results • Fewer Fetches + Same Math = Higher Arithmetic Intensity Start End
Implementation on Graphics Hardware (GPU) • Wrote 1 triplet and 4 triplet state processing kernels in • Brook for GPU’s [Buck ‘04] • Downloaded as shader program • Amino acid sequences • read-only textures • State probabilities • read/write textures • Transition probabilities • constant registers • Emission probabilities • small lookup table • If necessary, traceback performed on CPU
Experimental Methodology • Tested Systems • Sean Eddy’s HMMer 2.3.2 [Eddy ‘03] • Pentium IV 3.0GHz • Erik Lindahl’s AltiVec v2 implementation [Lindahl ‘05] • 2.4 GHz G5 Power PC • Our Brook Implementation • ATI X1800 XT • NVIDIA 7800GT • Data set • NCBI Nonredundant database with 2.5 million proteins • Representative HMM’s from Pfam database
Adenovirus Performance • ATI Hardware outperforms PowerPC by factor of 2+ • Graphics Hardware performs between 10-25 times better than HMMer on x86 • Careful optimization may improve x86 in the future
Performance • ATI Hardware outperforms PowerPC by factor of 2+ • Graphics Hardware performs between 10-25 times better than HMMer on x86 • Careful optimization may improve x86 in the future
Performance analysis • Comparison [ATI x800; 17.5GB/s; 8.3Gops] • 1-Triplet: Bandwidth limited • 4-Triplet: Instruction Issue limited • Conclusions • 4-triplet kernel achieves 90% of peak performance • Unrolling kernel important
Scales Linearly • Results on • 16-node cluster • Gig-E interconnect • ATI Radeon 9800 • Dual CPU • Split 2.5 million protein database across nodes • Linear scalability • Need 10,000 proteins per cluster to keep GPU’s busy
HMMer-Pfam • Solves inverse problem of HMMer-search • Given a protein of interest • search a database of HMMs • Problem for HMMer-pfam • Requires traceback for every HMM • Not enough memory on GPU • GPU requires 10,000 elements in parallel • Requires storage for 10,000 num_states sequence_length floating point probabilities • Need architecture with finer parallelism
Summary • Streaming version of HMMer • Implemented on GPUs • Performs well • 2-3x a PowerPC G5 • 20-30x a Pentium 4 • Well suited for other parallel architectures • Cell
Acknowledgements • Funding • Sheila Vaidya & John Johnstone @ LLNL • People • Erik Lindahl • ATI folks • NVIDIA folks • Sean Eddy • Ian & Brook Team
Questions? • danielrh @ graphics.stanford.edu • mhouston @ graphics.stanford.edu
Importance of Streaming HMMer Algorithm • Scales to many types of parallel architectures • By adjusting the width of the parallel_for loops • Cell, Multicore, Clusters … • Cell • Addressible Read/Write Memory • Fine-grained parallelism • Pfam possibility • Tens, not thousands of parallel DB queries • On Cell, 16-64 database entries could be processed • Each kernel could processes all states • Only returns a probability
General HMM Performance • Arbitrary transition matrix • Exactly like repeated matrix-vector multiply • Each matrix is transition matrix • Scaled by emission probability • Max() supplants add() operation in matrix-vector analogy • Brook for GPU paper shows • Matrix-Vector Multiply • Good for streaming architectures • Each element is touched once • Streaming over large quantities of memory • GPU memory controlled designed for this mode
Filling Probability Table States: A B C D A B C D A B C D … Sequence Database Sequence A Sequence B Sequence C Sequence D F R N F … .8 .2 .3 .1 F R N T … N T N F … T G P F …
Competing with CPU Cache • Probability at each state requires max() over • incoming state probabilities in memory on Graphics Hardware • Incoming state probabilities in L1 cache on CPU • Only final probability must be written to memory • Only tiny database entries must be read • Luckily 12-way version instruction, not bandwidth limited on GPU • CPU also instruction-limited
Viterbi Example (yes, will fix arrows) • Observations: Umbrella Coat Nothing pemit(Rainy,Umbrella) *max(0*.7,1*.4) .7 Rainy Rainy .3 =.5*max(0,.4) =.2 0 .4 Sunny Sunny .6 1.0 pemit(Sunny,Umbrella) *max(0*.3,1*.6) =.1*max(0,.6)=.06
Viterbi Example • Observations: Umbrella Coat Nothing pemit(Rainy,Coat) *max(.2*.7,.06*.4) .7 Rainy Rainy Rainy =.4*max(.14,.024)=.056 0 .2 .4 .3 Sunny Sunny Sunny 1.0 .6 .06 pemit(Sunny,Coat) *max(.2*.3,.06*.6) =.3*max(.06,.036)= .018
Viterbi Example • Observations: Umbrella Coat Nothing .7 pemit(Rainy,Nothing) *max(.057*.7,.018*.4) Rainy Rainy Rainy Rainy =.1*max(.0399,.0072) =.00399 0 .2 .056 .4 .3 Sunny Sunny Sunny Sunny .6 1.0 pemit(Sunny,Nothing) *max(.057*.3,.018*.6) =.6*max(.0171,.0108) =.01026 .06 .018
Performance • ATI Hardware outperforms PowerPC by factor of 2+ • Graphics Hardware performs between 10-25 times better than HMMer on x86 • Careful optimization may improve x86 in the future
Transition Probabilities Junk Junk Junk Junk Junk F F F F R R R R G G N N N N Start F R N F T T F P End F T T T T F G Del Del Del Del T T T Junk P P P P • Represent chance of the world changing state • Trained from example data .5 .25 .25 Proteins in Family 25% chance of correct match 50% chance of insertion 25% chance of deletion
HMMer Scores a Protein F R Junk Junk Junk Junk Junk G N T Start F F R N T T P End T Del Del Del Del Junk P • Returns probable alignment of protein to model Database Protein Most Likely Alignment to Model Probabilistic model F HMMer R Probability Score + G + N T F T P
Probabilisitic Model: HMM Rainy Sunny Insert Insert Insert Insert Insert F R N T P Del Del Del Del Junk • Specialized layout of states • Trained to randomly generate amino-acid chains • Chains likely to belong in a desired family • States with a circle emit no observation Start End
GPU Limitations • High granularity parallelism of 10,000 elements • No room to store entire state table • Download of database • Readback of results • Few registers • Only mechanism of fast read/write storage • Limits how many state triplets a kernel may process • Number of kernel outputs
Viterbi Example Viterbi Traceback Example Umbrella Nothing Coat =.2 .5 .4 =.06 .1 .6 .7 .7 .7 .7 .7 Rainy Rainy Rainy .4 .4 .4 .4 .4 .2 .056 .018 .00399 .3 .3 .3 .3 Sunny Sunny Sunny .6 .6 .6 .6 .06 .01026 Sunday Monday Tuesday Wednesday Rainy Umbrella max(0.7,1 .4) p(Umbrella | Rainy) Nothing Coat Rainy 0 Sunny 1.0 p(Umbrella | Sunny) max(0.3,1 .6) Viterbi path: “Sunny, Rainy, Rainy, Sunny” .06*.4 < .2*.7 .018·.6 < .056·.3 1.0*.4 > 0.0*.7 .01026>.00399