From PSI-BLAST to HMMer

1. From PSI-BLAST to HMMer Professor Mark Pallen Credits: Stephanie Minnema University of Calgary David Wishart University of Alberta

2. Advanced BLAST Methods The NCBI BLAST pages have several advanced BLAST methods available PSI-BLAST PHI-BLAST RPS-BLAST All are powerful methods based on protein similarities

3. Position-Specific-Iterated-BLAST Intuition substitution matrices should be specific to a particular site. e.g. enalize alanine?glycine more in a helix Idea Use BLAST with high stringency to get a set of closely related sequences. Align those sequences to create a new substitution matrix for each position. Then use that matrix to find additional sequences Cycling/iterative method Gives increased sensitivity for detecting distantly related proteins Can give insight into functional relationships Very refined statistical methods Fast � still based on BLAST methods Simple to use

4. PSI-BLAST Principle First, a standard blastp is performed The highest scoring hits are used to generate a multiple alignment A PSSM is generated from the multiple alignment. Highly conserved residues get high scores Less conserved residues get lower scores Another similarity search is performed, this time using the new PSSM Steps 2-4 can be repeated until convergence No new sequences appear after iteration

5. ExampleAminoacyl tRNA Synthetases 20 enzymes for 20 amino acids Each is very different Big, small, monomers, tetramers� All bind to their appropriate tRNAs and amino acids, with high specificity TrpRS and TyrRS share only 13% sequence identity BUT, overall structures of TrpTRS and TyrTRS are similar Structure ? Function relationship

7. So is there sequence similarity between TyrRS and TrpRS? Given structural similarities, we would expect to find sequence similarity� BUT! blastp of E.coli TyrRS against bacterial sequences in SwissProt does NOT show similarity with TrpRS at e-value cutoff of 10

9. Try Using PSI-BLAST� PSI-BLAST available from BLAST main page Query form just like for blastp BUT: one extra formatting option must be used �Format for PSI-BLAST� � activate the tick box! Second e-value cutoff used to determine which alignments will be used for PSSM build� �Threshold for inclusion� First search using TyrRS as query Db = SwissProt; limit = Bacteria [ORGN] Threshold for inclusion = 0.005

12. After A Few Iterations�

14. Power of PSI-BLAST We knew TyrRS and TrpRS were similarly Functionally and structurally BLASTP gave no indication PSI-BLAST was able to detect their weak sequence similarity Words of caution: be sure to inspect and think about the results included in the PSSM build include/exclude sequences on basis of biological knowledge: you are in the driving seat! PSI-BLAST performance varies according to choice of matrix, filter, statistics etc just like BLASTP

15. Why (not) PSI-BLAST If the sequences used to construct the Position Specific Scoring Matrices (PSSMs) are all homologous, the sensitivity at a given specificity improves significantly However, if non-homologous sequences are included in the PSSMs, they are �corrupted.� Then they pull in more non-homologous sequences, and become worse than generic

17. PSI-BLAST caveats Increased ability to find distant homologues Cost of additional required care to prevent non-homologous sequences from being included in the PSSM calculation When in doubt, leave it out! Examine sequences with moderate similarity carefully. Be particularly cautious about matches to sequences with highly biased amino acid content Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology Screen them out of your query sequences!

18. PSI-BLASTon the command line as with simple BLAST searches, using PSI-BLAST on the command line gives the user more power opens up additional options, e.g. PSI-BLASTing over nucleotide databases automating number of iterations trying out lots of different settings in parallel inputting multiple sequences

19. PHI-BLAST Pattern Hit Initiated � BLAST PHI-BLAST principle: Same method as PSI-BLAST Starts first search with query sequence + pattern for a motif in the query PHI-BLAST finds sequences containing the motif and having significant sequence similarity in the vicinity of the motif occurrence Highly specific

20. Example: TyrRS TyrRS contains the aaRS class-I signature Want to find sequences containing that motif, and regional similarity to TyrRS First: get the Prosite pattern for the class-I signature Prosite = db of protein families and domains

21. http://ca.expasy.org/prosite

22. P-x(0,2)-[GSTAN]-[DENQGAPK]-x-[LIVMFP]-[HT]-[LIVMYAC]-G- [HNTG]-[LIVMFYSTAGPC]

24. PHI-BLAST Results After first search, PHI-BLAST functions same as PSI-BLAST Result page is the same Can iterate in same way Try it later if you like�

25. The Key to PHI- and PSI-BLAST Generating the multiple alignments to create PSSMs Refines scoring in searches Annotated collections of multiple alignments defining domains exist Conserved domain database (CDD) Contains 18039 alignments (10013 last year) Can search the CDD using CD search Uses RPS-BLAST

26. RPS-BLAST Reverse Position Specific � BLAST Opposite of PSI-BLAST CDD multiple alignments converted to PSSMs PSSMs are processed and turned into a searchable database Queries are searched against PSSMs using RPS-BLAST Output indicates conserved domains within the query sequence

27. Example: CRADD protein

29. Profile Hidden Markov Models statistical models of multiple sequence alignments capture position-specific information about how conserved each column of the alignment is which residues are likely use position-specific scores for amino acids (or nucleotides) position specific penalties for opening and extending an insertion or deletion.

30. Advantages of using HMMs HMMs have a formal probabilistic basis use probability theory to guide how all the scoring parameters should be set can do things that more heuristic methods cannot do easily For example, a profile HMM can be trained from unaligned sequences, if a trusted alignment isn�t yet known HMMs have a consistent theory behind gap and insertion scores

31. Advantages of using HMMs In most details, profile HMMs are a slight improvement over a carefully constructed profile but less skill and manual intervention are necessary to use profile HMMs HMMs can produce true global alignments, unlike BLAST

32. Limitations of HMMs do not capture any higher-order correlations assumes that the identity of a particular position is independent of the identity of all other positions make poor models of RNAs because an HMM cannot describe base pairs. cf protein �threading� methods which usually include scoring terms for nearby amino acids in a three-dimensional protein structure. slower than and less user-friendly than PSI-BLAST

33. Applications of profile HMMs Database searching for weak homologies Alternative to PSI-BLAST Automated annotation of the domain structure of proteins

34. Applications of profile HMMs Useful for organizing sequences into evolutionarily related families Databases like Pfam constructed by distinguishing between a stable curated �seed� alignment of a small number of representative sequences �full� alignments of all detectable homologs HMMER used to make a model of the seed search the database for homologs automatically produce the full alignment by aligning every sequence to the seed consensus

35. Constructing a profile HMM multiple sequence alignment is made of known members of a given protein family quality of alignment, number and diversity of the sequences crucial for success profile HMM of family built from the alignment model-building program uses the alignment together with its prior knowledge of the general nature of proteins model-scoring program used to assign a score with respect to the model to any sequence of interest better the score, the higher the chance that query sequence is homologous to protein family in the model. each sequence in a database scored to find the members of the family present in the database.

36. Profile HMM programs HMMER developed by Sean Eddy freely available under GNU General Public License includes model-building and model-scoring programs relevant to homology detection contains a program that calibrates a model by scoring it against a set of random sequences fitting an extreme value distribution to the resultant raw scores parameters of this distribution then used to calculate accurate E-values for sequences of interest.

37. Programs in the HMMER 2 package hmmalign Align sequences to existing model hmmbuild Build a model from multiple sequence alignment. hmmcalibrate Takes an HMM and empirically determines parameters used to make searches more sensitive by calculating more accurate E-values hmmconvert Convert a model file into different formats, including a compact HMMER 2 binary format, and �best effort� emulation of GCG profiles. hmmemit Emit sequences probabilistically from a profile HMM. hmmfetch Get a single model from an HMM database. hmmindex Index an HMM database. hmmpfam Search an HMM database for matches to a query sequence. hmmsearch Search a sequence database for matches to an HMM.

38. Profile HMM programs SAM Developed by the bioinformatics group at the University of California, Santa Cruz not open source, but free for academic use does not include a model-calibration program model-scoring program calculates E-values directly using a theoretical function that takes as its argument the difference between raw scores of the query sequence and its reverse important component is target99 script, which generates a multiple sequence alignment suitable for model building

39. Clash of the TitansPSI-BLAST v. HMMer v. SAM! Nucleic Acids Research, 2002, Vol. 30 No. 19 4321 SAM consistently produces better models than HMMER relative performance of the model-scoring components varies HMMER 1-3 X faster than SAM with large databases SAM faster with small ones both methods have effective low complexity and repeat sequence masking accuracy of their E-values was comparable. SAM T99 iterative database search procedure outperforms PSI-BLAST BUT scoring of PSI-BLAST profiles > 30 X faster than scoring of SAM models.

41. Summary PSI-BLAST Input: SEQUENCE Database: SEQUENCES Algorithm: Constructs a PSSM from an initial pass and uses this in the next pass Output: Distantly related sequences + sensitive, -specific PHI-BLAST Input: PROFILE + SEQUENCE Database: SEQUENCES Algorithm: Same as PSI-BLAST except start with a profile Output: Sequences containing the domain and that are similar in the domain region +sensitive, -> -specific RPS-BLAST Input: SEQUENCE Database: DOMAINS Output: Domains found in the sequence +sensitive, +specific HMMs More sensitive But less user-friendly than PSI-BLAST and slower

From PSI-BLAST to HMMer

From PSI-BLAST to HMMer

Presentation Transcript

Gapped Blast and PSI BLAST

Psi-Blast

Psi-Blast

BLAST, PSI-BLAST and position-specific scoring matrices

Gapped BLAST and PSI-BLAST ： a new generation of protein database search programs

Tutorial 4 Substitution matrices and PSI-BLAST

Position-Specific Iterated BLAST (PSI-BLAST)

PSI ( position-specific iterated) BLAST

Psi-Blast: Detecting structural homologs

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

PSI-BLAST and Multiple Sequence Alignments

Multiple alignments, PATTERNS, PSI-BLAST

HSP-HMMER vs MPI-HMMER & parallelization of PSI-BLAST

Point Specific Alignment Methods PSI – BLAST & PHI – BLAST

Local alignment, BLAST and Psi-BLAST

BLAST and Psi-BLAST and MSA

Gapped BLAST and PSI-BLAST ： a new generation of protein database search programs

Psi-Blast: Detecting structural homologs

From PSI-BLAST to HMMer