1 / 55

Sequence Comparison – Identification of remote homologues

Sequence Comparison – Identification of remote homologues. Amir Harel Moran Yassour. Overview. Homologues proteins Protein Sequence comparison BLAST and its improvements PSI-BLAST. Homologous Proteins. Proteins that share a common ancestor are called homologous.

dooley
Download Presentation

Sequence Comparison – Identification of remote homologues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Comparison – Identification of remote homologues Amir Harel Moran Yassour

  2. Overview • Homologues proteins • Protein Sequence comparison • BLAST and its improvements • PSI-BLAST

  3. Homologous Proteins • Proteins that share a common ancestor are called homologous. • Common three dimensional folding structure

  4. Homologous Proteins • Homology refers to a similarity that spans an entire folding domain. • The difficulty in defining homology

  5. Why is homology important? • Prediction of protein’s properties • Classification of proteins to families • Evolution tree

  6. How to identify homology? • Using sequence similarities • Aligning two proteins • Giving a score to the alignment

  7. Global & Local Alignments • Global alignment – alignment of the entire sequence • Local alignment – alignment of a segment of the sequence

  8. How to score an alignment • Substitution Matrix – Sij = a value proportional to the probability that amino acid i mutated into amino acid j

  9. Types of Substitution Matrices • PAM – comparison of closely related sequences • BLOSUM – multiple alignments of distantly related sequences

  10. Substitution Matrices • Different matrices reflect different evolutionary distances: • 1 PAM represents the evolutionary distance of 1 amino acid substitution per 100 amino acids. • BLOSUM X: all sequences with a similarity higher than X were summarized into one

  11. Gap costs • The most widely used Gap score is-(a+bk) for a gap of length k. • Long gaps do not cost much more than short ones since a single mutation may cause a large gap.

  12. Basic Sequence Comparison • Smith & Waterman (1981) – dynamic programming of sequence comparison • O(mn) n m

  13. Complexity issue • When DBs become larger, m grows • Time complexity • Space complexity

  14. n m Intuition to Solution • Go over less than the whole matrix • Put the spotlight on segments that can be a part of the best path and extend them. • The best path is close to a diagonal • Less than O(mn)

  15. Heuristic procedures • Heuristic: An algorithm that usually, but not always works, or that gives nearly the right answer. • There is no guarantee to find the best match.

  16. BLAST – Basic Local Alignment Search Tool • BLAST first scans the DB for words that score at least T when aligned with some word within the query sequence, these are called hits. O(n) • Each hit is extended in both directions as long as the score hasn’t dropped too much.

  17. BLAST - - - - - - - x - - - - - - x - - - x - x - - - - - - - - - x - - - - x - - - - - x - - - - - - - - x - - - - - - - - - - - - - - - - - - - - - x - - - - x - - - x - - - - x - - - - - x - - - - - - - - x - - - - - - - x - - x x - - - - - - - x x - - - - - - - - x x x - - - x - - - - - - - - x - - - - - - - - - x - - - - x - - - - - - - - - - - - - - x - - - - - - - - - - - x - - x - - - - x - - - - - - - - - - - - - - - - - - - - - - - - x - - x - - - - - - - - - - - - - - - - - - - - - x - - - - - - x - - - - - - - x - - x - - - - x - - - - - - - - x - - x - - - x - - - x - - - - x - - - x - - x - - - - - - - x - - - - - - - - - - x - - x - - - x - - - - x - - - x - - - - -

  18. A word about the parameter T • Small T:greater sensitivity, more hits to expand • large T: lower sensitivity, fewer hits to expand

  19. Gapped BLAST • The original BALST was un-gapped • Soon after came gapped BLAST

  20. BLAST - Results • P value – The probability of an alignment occurring with score S or better. • E value – Expectation value. The number of different alignments with scores S or better that are expected to occur in this DB search by chance. • Lower E value –> more significant score.

  21. E-value and Homology • Non significant score does not necessarily imply non-homology:

  22. E-value and Homology

  23. Use it wisely • Choose your Substitution Matrix • Choose your DB

  24. Example 1 – remote homology • Frequently, identification of a remote homology will require several database searches. • The glutathione transferase family

  25. Remote homology

  26. Remote homology • Testing the possibility that elongation factors share homology with glutathione S-transferases : • There is a clear relationship between this elongation factor and the class-theta glutathione transferases.

  27. Example 2 - mapping • Three different families of G-protein coupled receptors: • the R family (the largest) • the C/S family • the G receptor family

  28. Finding links between families

  29. Finding links between families

  30. Building Proteins tree

  31. Conclusions • Searches with high-scoring, related or unrelated sequences, is a very important tool. • Homology is a transitive relation…

  32. BLAST – Pros & Cons • Pros: • It works • Cons: • Statistical evaluations rather than biological one. • Converged Evolution • Weak but biologically relevant similarities may be overlooked (PSI will improve this issue)

  33. BLAST improvements • Running time improvements : • Two-hit method • Seed extension • PSI-BLAST

  34. The two-hit method • The extension step accounts for more than 90% of BLAST’s execution time • Invoke an extension only when two non-overlapping hits are found within a certain distance of one another

  35. second hit two-hit extension first hit The two-hit method - - - - - - x x x - - - - - x - - x x - x - - - x - - - - x x - - - - x - - - - - x - - - - - - - - x - - x - - - - - x - - - - - - - - - x - - x - - - - x - - - x - - - - x x - - - - x - - - - - - - - x - - - - - - - x - - x x - - - - - - - x x - - - - - x - - x x x - - - x - - - - - - - - x - - - - - x - - - x - - - - x - - - x - - - - - - x - - - x - - - - x - - - - - x - - - x - - - x - - - - x x - - - - x - - - - - - - - - - - - - - x - - x - - - - - - - - - - - - - - - - - - - - x x - - - - - - x x x - - - - - x - - x x - - x x - - - - - - - - x - - x - - - x - - - x - - - - x - - - x - - - - - - x - - - x - - - - x x - - - - x - - x - - - x - - - - x - - - x - - - - -

  36. Seed Extension

  37. PSI-BLAST • Evolution pressure • Needle in a hey stack • PSI-BLAST comes to solve this problem

  38. Evolution reveals itself • Giving more significance to the conserved areas and to ignoring the background noises • PSI-BLAST=Position Specific Iterated BLAST, shifts our view to these areas using the Position-Specific Score Matrix - PSSM

  39. Position-Specific Matrix - PSSM • Pij = proportional to the probability of finding the ith amino acid in the jth position in these sequences

  40. PSSM • Represents the distribution of the amino acids in each position in a collection of sequences

  41. Steps in the PSI-BLAST • Initiation: • Running gapped BLAST on the query, outputting a collection of matching sequences • Iteration: • Constructing the PSSM based on the best sequences in this collection • The PSSM is compared to the protein DB, again, seeking alignments

  42. PSI-BLAST Example • We start with an uncharacterized protein – MJ0414 • When submitting the query we set the E-value threshold to 0.01 (higher than usual)

  43. Result of initial gapped BLAST

  44. First iteration – • Iterating the search using the derived profile uncovers DNA ligase II with E-value of 0.005

  45. Second iteration –

  46. Interpretation of the results • Considering a strong unrelated protein will shift the PSSM to its direction • E-values retrieved in later iterations should not be taken as automatic proof of homology

  47. Was the ligase a right choice?

  48. PSI-BLAST Conclusions • Uncovers protein relationships missed by single-pass database-search methods • Errors are easily amplified by iterations. • PSI-BLAST increases rather than removes the need for expertise, because there is more to interpret

  49. Running time evaluation • Running time can be highly influenced by modifying parameters

  50. Future Improvements • Accepting PSSM as input from other programs • Realignment – improve the alignment before going over the DB • Automatic domain recognition

More Related