400 likes | 1.16k Views
Advanced BLAST Methods. The NCBI BLAST pages have several advanced BLAST methods availablePSI-BLASTPHI-BLASTRPS-BLASTAll are powerful methods based on protein similarities. Position-Specific-Iterated-BLAST. Intuitionsubstitution matrices should be specific to a particular site. e.g. enalize a
E N D
1. From PSI-BLAST to HMMer Professor Mark Pallen
Credits: Stephanie Minnema
University of Calgary
David Wishart
University of Alberta
2. Advanced BLAST Methods The NCBI BLAST pages have several advanced BLAST methods available
PSI-BLAST
PHI-BLAST
RPS-BLAST
All are powerful methods based on protein similarities
3. Position-Specific-Iterated-BLAST Intuition
substitution matrices should be specific to a particular site.
e.g. enalize alanine?glycine more in a helix
Idea
Use BLAST with high stringency to get a set of closely related sequences.
Align those sequences to create a new substitution matrix for each position.
Then use that matrix to find additional sequences Cycling/iterative method
Gives increased sensitivity for detecting distantly related proteins
Can give insight into functional relationships
Very refined statistical methods
Fast – still based on BLAST methods
Simple to use
4. PSI-BLAST Principle First, a standard blastp is performed
The highest scoring hits are used to generate a multiple alignment
A PSSM is generated from the multiple alignment.
Highly conserved residues get high scores
Less conserved residues get lower scores
Another similarity search is performed, this time using the new PSSM
Steps 2-4 can be repeated until convergence
No new sequences appear after iteration
5. ExampleAminoacyl tRNA Synthetases 20 enzymes for 20 amino acids
Each is very different
Big, small, monomers, tetramers…
All bind to their appropriate tRNAs and amino acids, with high specificity
TrpRS and TyrRS share only 13% sequence identity
BUT, overall structures of TrpTRS and TyrTRS are similar
Structure ? Function relationship
7. So is there sequence similarity between TyrRS and TrpRS? Given structural similarities, we would expect to find sequence similarity…
BUT!
blastp of E.coli TyrRS against bacterial sequences in SwissProt does NOT show similarity with TrpRS at e-value cutoff of 10
9. Try Using PSI-BLAST… PSI-BLAST available from BLAST main page
Query form just like for blastp
BUT: one extra formatting option must be used
“Format for PSI-BLAST” – activate the tick box!
Second e-value cutoff used to determine which alignments will be used for PSSM build… “Threshold for inclusion”
First search using TyrRS as query
Db = SwissProt; limit = Bacteria [ORGN]
Threshold for inclusion = 0.005
12. After A Few Iterations…
14. Power of PSI-BLAST We knew TyrRS and TrpRS were similarly
Functionally and structurally
BLASTP gave no indication
PSI-BLAST was able to detect their weak sequence similarity
Words of caution:
be sure to inspect and think about the results included in the PSSM build
include/exclude sequences on basis of biological knowledge: you are in the driving seat!
PSI-BLAST performance varies according to choice of matrix, filter, statistics etc just like BLASTP
15. Why (not) PSI-BLAST If the sequences used to construct the Position Specific Scoring Matrices (PSSMs) are all homologous, the sensitivity at a given specificity improves significantly
However, if non-homologous sequences are included in the PSSMs, they are “corrupted.” Then they pull in more non-homologous sequences, and become worse than generic
17. PSI-BLAST caveats Increased ability to find distant homologues
Cost of additional required care to prevent non-homologous sequences from being included in the PSSM calculation
When in doubt, leave it out!
Examine sequences with moderate similarity carefully.
Be particularly cautious about matches to sequences with highly biased amino acid content
Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology
Screen them out of your query sequences!
18. PSI-BLASTon the command line as with simple BLAST searches, using PSI-BLAST on the command line gives the user more power
opens up additional options, e.g.
PSI-BLASTing over nucleotide databases
automating number of iterations
trying out lots of different settings in parallel
inputting multiple sequences
19. PHI-BLAST Pattern Hit Initiated – BLAST
PHI-BLAST principle:
Same method as PSI-BLAST
Starts first search with query sequence + pattern for a motif in the query
PHI-BLAST finds sequences containing the motif and having significant sequence similarity in the vicinity of the motif occurrence
Highly specific
20. Example: TyrRS TyrRS contains the aaRS class-I signature
Want to find sequences containing that motif, and regional similarity to TyrRS
First: get the Prosite pattern for the class-I signature
Prosite = db of protein families and domains
21. http://ca.expasy.org/prosite
22. P-x(0,2)-[GSTAN]-[DENQGAPK]-x-[LIVMFP]-[HT]-[LIVMYAC]-G- [HNTG]-[LIVMFYSTAGPC]
24. PHI-BLAST Results After first search, PHI-BLAST functions same as PSI-BLAST
Result page is the same
Can iterate in same way
Try it later if you like…
25. The Key to PHI- and PSI-BLAST Generating the multiple alignments to create PSSMs
Refines scoring in searches
Annotated collections of multiple alignments defining domains exist
Conserved domain database (CDD)
Contains 18039 alignments (10013 last year)
Can search the CDD using CD search
Uses RPS-BLAST
26. RPS-BLAST Reverse Position Specific – BLAST
Opposite of PSI-BLAST
CDD multiple alignments converted to PSSMs
PSSMs are processed and turned into a searchable database
Queries are searched against PSSMs using RPS-BLAST
Output indicates conserved domains within the query sequence
27. Example: CRADD protein
29. Profile Hidden Markov Models statistical models of multiple sequence alignments
capture position-specific information about
how conserved each column of the alignment is
which residues are likely
use position-specific scores for amino acids (or nucleotides)
position specific penalties for opening and extending an insertion or deletion.
30. Advantages of using HMMs HMMs have a formal probabilistic basis
use probability theory to guide how all the scoring parameters should be set
can do things that more heuristic methods cannot do easily
For example, a profile HMM can be trained from unaligned sequences, if a trusted alignment isn’t yet known
HMMs have a consistent theory behind gap and insertion scores
31. Advantages of using HMMs In most details, profile HMMs are a slight improvement over a carefully constructed profile
but less skill and manual intervention are necessary to use profile HMMs
HMMs can produce true global alignments, unlike BLAST
32. Limitations of HMMs do not capture any higher-order correlations
assumes that the identity of a particular position is independent of the identity of all other positions
make poor models of RNAs because an HMM cannot describe base pairs.
cf protein “threading” methods
which usually include scoring terms for nearby amino acids in a three-dimensional protein structure.
slower than and less user-friendly than PSI-BLAST
33. Applications of profile HMMs Database searching for weak homologies
Alternative to PSI-BLAST
Automated annotation of the domain structure of proteins
34. Applications of profile HMMs Useful for organizing sequences into evolutionarily related families
Databases like Pfam constructed by distinguishing between
a stable curated “seed” alignment of a small number of representative sequences
“full” alignments of all detectable homologs
HMMER used to
make a model of the seed
search the database for homologs
automatically produce the full alignment by aligning every sequence to the seed consensus
35. Constructing a profile HMM multiple sequence alignment is made of known members of a given protein family
quality of alignment, number and diversity of the sequences crucial for success
profile HMM of family built from the alignment
model-building program uses the alignment together with its prior knowledge of the general nature of proteins
model-scoring program used to assign a score with respect to the model to any sequence of interest
better the score, the higher the chance that query sequence is homologous to protein family in the model.
each sequence in a database scored to find the members of the family present in the database.
36. Profile HMM programs HMMER developed by Sean Eddy
freely available under GNU General Public License
includes model-building and model-scoring programs relevant to homology detection
contains a program that calibrates a model by
scoring it against a set of random sequences
fitting an extreme value distribution to the resultant raw scores
parameters of this distribution then used to calculate accurate E-values for sequences of interest.
37. Programs in the HMMER 2 package hmmalign
Align sequences to existing model
hmmbuild
Build a model from multiple sequence alignment.
hmmcalibrate
Takes an HMM and empirically determines parameters used to make searches more sensitive by calculating more accurate E-values
hmmconvert
Convert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles.
hmmemit
Emit sequences probabilistically from a profile HMM.
hmmfetch
Get a single model from an HMM database.
hmmindex
Index an HMM database.
hmmpfam
Search an HMM database for matches to a query sequence.
hmmsearch
Search a sequence database for matches to an HMM.
38. Profile HMM programs SAM Developed by the bioinformatics group at the University of California, Santa Cruz
not open source, but free for academic use
does not include a model-calibration program
model-scoring program calculates E-values directly using a theoretical function that takes as its argument the difference between raw scores of the query sequence and its reverse
important component is target99 script, which generates a multiple sequence alignment suitable for model building
39. Clash of the TitansPSI-BLAST v. HMMer v. SAM! Nucleic Acids Research, 2002, Vol. 30 No. 19 4321
SAM consistently produces better models than HMMER
relative performance of the model-scoring components varies
HMMER 1-3 X faster than SAM with large databases
SAM faster with small ones
both methods have effective low complexity and repeat sequence masking
accuracy of their E-values was comparable.
SAM T99 iterative database search procedure outperforms PSI-BLAST
BUT scoring of PSI-BLAST profiles > 30 X faster than scoring of SAM models.
41. Summary PSI-BLAST
Input: SEQUENCE
Database: SEQUENCES
Algorithm: Constructs a PSSM from an initial pass and uses this in the next pass
Output: Distantly related sequences
+ sensitive, -specific
PHI-BLAST
Input: PROFILE + SEQUENCE
Database: SEQUENCES
Algorithm: Same as PSI-BLAST except start with a profile
Output: Sequences containing the domain and that are similar in the domain region
+sensitive, -> -specific RPS-BLAST
Input: SEQUENCE
Database: DOMAINS
Output: Domains found in the sequence
+sensitive, +specific
HMMs
More sensitive
But less user-friendly than PSI-BLAST and slower