550 likes | 653 Views
Sequence Comparison – Identification of remote homologues. Amir Harel Moran Yassour. Overview. Homologues proteins Protein Sequence comparison BLAST and its improvements PSI-BLAST. Homologous Proteins. Proteins that share a common ancestor are called homologous.
E N D
Sequence Comparison – Identification of remote homologues Amir Harel Moran Yassour
Overview • Homologues proteins • Protein Sequence comparison • BLAST and its improvements • PSI-BLAST
Homologous Proteins • Proteins that share a common ancestor are called homologous. • Common three dimensional folding structure
Homologous Proteins • Homology refers to a similarity that spans an entire folding domain. • The difficulty in defining homology
Why is homology important? • Prediction of protein’s properties • Classification of proteins to families • Evolution tree
How to identify homology? • Using sequence similarities • Aligning two proteins • Giving a score to the alignment
Global & Local Alignments • Global alignment – alignment of the entire sequence • Local alignment – alignment of a segment of the sequence
How to score an alignment • Substitution Matrix – Sij = a value proportional to the probability that amino acid i mutated into amino acid j
Types of Substitution Matrices • PAM – comparison of closely related sequences • BLOSUM – multiple alignments of distantly related sequences
Substitution Matrices • Different matrices reflect different evolutionary distances: • 1 PAM represents the evolutionary distance of 1 amino acid substitution per 100 amino acids. • BLOSUM X: all sequences with a similarity higher than X were summarized into one
Gap costs • The most widely used Gap score is-(a+bk) for a gap of length k. • Long gaps do not cost much more than short ones since a single mutation may cause a large gap.
Basic Sequence Comparison • Smith & Waterman (1981) – dynamic programming of sequence comparison • O(mn) n m
Complexity issue • When DBs become larger, m grows • Time complexity • Space complexity
n m Intuition to Solution • Go over less than the whole matrix • Put the spotlight on segments that can be a part of the best path and extend them. • The best path is close to a diagonal • Less than O(mn)
Heuristic procedures • Heuristic: An algorithm that usually, but not always works, or that gives nearly the right answer. • There is no guarantee to find the best match.
BLAST – Basic Local Alignment Search Tool • BLAST first scans the DB for words that score at least T when aligned with some word within the query sequence, these are called hits. O(n) • Each hit is extended in both directions as long as the score hasn’t dropped too much.
BLAST - - - - - - - x - - - - - - x - - - x - x - - - - - - - - - x - - - - x - - - - - x - - - - - - - - x - - - - - - - - - - - - - - - - - - - - - x - - - - x - - - x - - - - x - - - - - x - - - - - - - - x - - - - - - - x - - x x - - - - - - - x x - - - - - - - - x x x - - - x - - - - - - - - x - - - - - - - - - x - - - - x - - - - - - - - - - - - - - x - - - - - - - - - - - x - - x - - - - x - - - - - - - - - - - - - - - - - - - - - - - - x - - x - - - - - - - - - - - - - - - - - - - - - x - - - - - - x - - - - - - - x - - x - - - - x - - - - - - - - x - - x - - - x - - - x - - - - x - - - x - - x - - - - - - - x - - - - - - - - - - x - - x - - - x - - - - x - - - x - - - - -
A word about the parameter T • Small T:greater sensitivity, more hits to expand • large T: lower sensitivity, fewer hits to expand
Gapped BLAST • The original BALST was un-gapped • Soon after came gapped BLAST
BLAST - Results • P value – The probability of an alignment occurring with score S or better. • E value – Expectation value. The number of different alignments with scores S or better that are expected to occur in this DB search by chance. • Lower E value –> more significant score.
E-value and Homology • Non significant score does not necessarily imply non-homology:
Use it wisely • Choose your Substitution Matrix • Choose your DB
Example 1 – remote homology • Frequently, identification of a remote homology will require several database searches. • The glutathione transferase family
Remote homology • Testing the possibility that elongation factors share homology with glutathione S-transferases : • There is a clear relationship between this elongation factor and the class-theta glutathione transferases.
Example 2 - mapping • Three different families of G-protein coupled receptors: • the R family (the largest) • the C/S family • the G receptor family
Conclusions • Searches with high-scoring, related or unrelated sequences, is a very important tool. • Homology is a transitive relation…
BLAST – Pros & Cons • Pros: • It works • Cons: • Statistical evaluations rather than biological one. • Converged Evolution • Weak but biologically relevant similarities may be overlooked (PSI will improve this issue)
BLAST improvements • Running time improvements : • Two-hit method • Seed extension • PSI-BLAST
The two-hit method • The extension step accounts for more than 90% of BLAST’s execution time • Invoke an extension only when two non-overlapping hits are found within a certain distance of one another
second hit two-hit extension first hit The two-hit method - - - - - - x x x - - - - - x - - x x - x - - - x - - - - x x - - - - x - - - - - x - - - - - - - - x - - x - - - - - x - - - - - - - - - x - - x - - - - x - - - x - - - - x x - - - - x - - - - - - - - x - - - - - - - x - - x x - - - - - - - x x - - - - - x - - x x x - - - x - - - - - - - - x - - - - - x - - - x - - - - x - - - x - - - - - - x - - - x - - - - x - - - - - x - - - x - - - x - - - - x x - - - - x - - - - - - - - - - - - - - x - - x - - - - - - - - - - - - - - - - - - - - x x - - - - - - x x x - - - - - x - - x x - - x x - - - - - - - - x - - x - - - x - - - x - - - - x - - - x - - - - - - x - - - x - - - - x x - - - - x - - x - - - x - - - - x - - - x - - - - -
PSI-BLAST • Evolution pressure • Needle in a hey stack • PSI-BLAST comes to solve this problem
Evolution reveals itself • Giving more significance to the conserved areas and to ignoring the background noises • PSI-BLAST=Position Specific Iterated BLAST, shifts our view to these areas using the Position-Specific Score Matrix - PSSM
Position-Specific Matrix - PSSM • Pij = proportional to the probability of finding the ith amino acid in the jth position in these sequences
PSSM • Represents the distribution of the amino acids in each position in a collection of sequences
Steps in the PSI-BLAST • Initiation: • Running gapped BLAST on the query, outputting a collection of matching sequences • Iteration: • Constructing the PSSM based on the best sequences in this collection • The PSSM is compared to the protein DB, again, seeking alignments
PSI-BLAST Example • We start with an uncharacterized protein – MJ0414 • When submitting the query we set the E-value threshold to 0.01 (higher than usual)
First iteration – • Iterating the search using the derived profile uncovers DNA ligase II with E-value of 0.005
Interpretation of the results • Considering a strong unrelated protein will shift the PSSM to its direction • E-values retrieved in later iterations should not be taken as automatic proof of homology
PSI-BLAST Conclusions • Uncovers protein relationships missed by single-pass database-search methods • Errors are easily amplified by iterations. • PSI-BLAST increases rather than removes the need for expertise, because there is more to interpret
Running time evaluation • Running time can be highly influenced by modifying parameters
Future Improvements • Accepting PSSM as input from other programs • Realignment – improve the alignment before going over the DB • Automatic domain recognition