Sequence Comparison

Sequence Comparison – Identification of remote homologues Amir Harel Moran Yassour

Overview • Homologues proteins • Protein Sequence comparison • BLAST and its improvements • PSI-BLAST

Homologous Proteins • Proteins that share a common ancestor are called homologous. • Common three dimensional folding structure

Homologous Proteins • Homology refers to a similarity that spans an entire folding domain. • The difficulty in defining homology

Why is homology important? • Prediction of protein’s properties • Classification of proteins to families • Evolution tree

How to identify homology? • Using sequence similarities • Aligning two proteins • Giving a score to the alignment

Global & Local Alignments • Global alignment – alignment of the entire sequence • Local alignment – alignment of a segment of the sequence

How to score an alignment • Substitution Matrix – Sij = a value proportional to the probability that amino acid i mutated into amino acid j

Types of Substitution Matrices • PAM – comparison of closely related sequences • BLOSUM – multiple alignments of distantly related sequences

Substitution Matrices • Different matrices reflect different evolutionary distances: • 1 PAM represents the evolutionary distance of 1 amino acid substitution per 100 amino acids. • BLOSUM X: all sequences with a similarity higher than X were summarized into one

Gap costs • The most widely used Gap score is-(a+bk) for a gap of length k. • Long gaps do not cost much more than short ones since a single mutation may cause a large gap.

Basic Sequence Comparison • Smith & Waterman (1981) – dynamic programming of sequence comparison • O(mn) n m

Complexity issue • When DBs become larger, m grows • Time complexity • Space complexity

n m Intuition to Solution • Go over less than the whole matrix • Put the spotlight on segments that can be a part of the best path and extend them. • The best path is close to a diagonal • Less than O(mn)

Heuristic procedures • Heuristic: An algorithm that usually, but not always works, or that gives nearly the right answer. • There is no guarantee to find the best match.

BLAST – Basic Local Alignment Search Tool • BLAST first scans the DB for words that score at least T when aligned with some word within the query sequence, these are called hits. O(n) • Each hit is extended in both directions as long as the score hasn’t dropped too much.

BLAST - - - - - - - x - - - - - - x - - - x - x - - - - - - - - - x - - - - x - - - - - x - - - - - - - - x - - - - - - - - - - - - - - - - - - - - - x - - - - x - - - x - - - - x - - - - - x - - - - - - - - x - - - - - - - x - - x x - - - - - - - x x - - - - - - - - x x x - - - x - - - - - - - - x - - - - - - - - - x - - - - x - - - - - - - - - - - - - - x - - - - - - - - - - - x - - x - - - - x - - - - - - - - - - - - - - - - - - - - - - - - x - - x - - - - - - - - - - - - - - - - - - - - - x - - - - - - x - - - - - - - x - - x - - - - x - - - - - - - - x - - x - - - x - - - x - - - - x - - - x - - x - - - - - - - x - - - - - - - - - - x - - x - - - x - - - - x - - - x - - - - -

A word about the parameter T • Small T:greater sensitivity, more hits to expand • large T: lower sensitivity, fewer hits to expand

Gapped BLAST • The original BALST was un-gapped • Soon after came gapped BLAST

BLAST - Results • P value – The probability of an alignment occurring with score S or better. • E value – Expectation value. The number of different alignments with scores S or better that are expected to occur in this DB search by chance. • Lower E value –> more significant score.

E-value and Homology • Non significant score does not necessarily imply non-homology:

E-value and Homology

Use it wisely • Choose your Substitution Matrix • Choose your DB

Example 1 – remote homology • Frequently, identification of a remote homology will require several database searches. • The glutathione transferase family

Remote homology

Remote homology • Testing the possibility that elongation factors share homology with glutathione S-transferases : • There is a clear relationship between this elongation factor and the class-theta glutathione transferases.

Example 2 - mapping • Three different families of G-protein coupled receptors: • the R family (the largest) • the C/S family • the G receptor family

Finding links between families

Building Proteins tree

Conclusions • Searches with high-scoring, related or unrelated sequences, is a very important tool. • Homology is a transitive relation…

BLAST – Pros & Cons • Pros: • It works • Cons: • Statistical evaluations rather than biological one. • Converged Evolution • Weak but biologically relevant similarities may be overlooked (PSI will improve this issue)

BLAST improvements • Running time improvements : • Two-hit method • Seed extension • PSI-BLAST

The two-hit method • The extension step accounts for more than 90% of BLAST’s execution time • Invoke an extension only when two non-overlapping hits are found within a certain distance of one another

second hit two-hit extension first hit The two-hit method - - - - - - x x x - - - - - x - - x x - x - - - x - - - - x x - - - - x - - - - - x - - - - - - - - x - - x - - - - - x - - - - - - - - - x - - x - - - - x - - - x - - - - x x - - - - x - - - - - - - - x - - - - - - - x - - x x - - - - - - - x x - - - - - x - - x x x - - - x - - - - - - - - x - - - - - x - - - x - - - - x - - - x - - - - - - x - - - x - - - - x - - - - - x - - - x - - - x - - - - x x - - - - x - - - - - - - - - - - - - - x - - x - - - - - - - - - - - - - - - - - - - - x x - - - - - - x x x - - - - - x - - x x - - x x - - - - - - - - x - - x - - - x - - - x - - - - x - - - x - - - - - - x - - - x - - - - x x - - - - x - - x - - - x - - - - x - - - x - - - - -

Seed Extension

PSI-BLAST • Evolution pressure • Needle in a hey stack • PSI-BLAST comes to solve this problem

Evolution reveals itself • Giving more significance to the conserved areas and to ignoring the background noises • PSI-BLAST=Position Specific Iterated BLAST, shifts our view to these areas using the Position-Specific Score Matrix - PSSM

Position-Specific Matrix - PSSM • Pij = proportional to the probability of finding the ith amino acid in the jth position in these sequences

PSSM • Represents the distribution of the amino acids in each position in a collection of sequences

Steps in the PSI-BLAST • Initiation: • Running gapped BLAST on the query, outputting a collection of matching sequences • Iteration: • Constructing the PSSM based on the best sequences in this collection • The PSSM is compared to the protein DB, again, seeking alignments

PSI-BLAST Example • We start with an uncharacterized protein – MJ0414 • When submitting the query we set the E-value threshold to 0.01 (higher than usual)

Result of initial gapped BLAST

First iteration – • Iterating the search using the derived profile uncovers DNA ligase II with E-value of 0.005

Second iteration –

Interpretation of the results • Considering a strong unrelated protein will shift the PSSM to its direction • E-values retrieved in later iterations should not be taken as automatic proof of homology

Was the ligase a right choice?

PSI-BLAST Conclusions • Uncovers protein relationships missed by single-pass database-search methods • Errors are easily amplified by iterations. • PSI-BLAST increases rather than removes the need for expertise, because there is more to interpret

Running time evaluation • Running time can be highly influenced by modifying parameters

Future Improvements • Accepting PSSM as input from other programs • Realignment – improve the alignment before going over the DB • Automatic domain recognition

Sequence Comparison – Identification of remote homologues