580 likes | 696 Views
Searching for Similarities in Genetic and Proteomic Sequences Dr. Jaume Bacardit Prof. N. Krasnogor. Based on a Lecture by Gary Benson, Departments of Computer Science and Biology,
E N D
Searching for Similarities in Geneticand Proteomic SequencesDr. Jaume BacarditProf. N. Krasnogor Based on a Lecture by Gary Benson, Departments of Computer Science and Biology, Boston University, given at the 2003 ISSCB. Examples from P.A. Pevzner’s “Computational Molecular Biology”, D.A.Krane & M.L. Raymer’s “Fundamental Concepts of Bioinformatics” and from C. Gybas & P.Jambeck’s “Developing Bioinformatics Computer Skills”
Topic 1 Outline Similarity and Alignment • Define homology, similarity by descent and similarity by convergence • Common mutations and their mathematical models • Alignments • Scoring Alignments • Gap penalty functions • Computing the best scoring alignment – the Longest Common Subsequence (LCS) problem • Sequence Alignment • Multiple Sequence Alignment
Similarity and Biomolecules Similarity is expected among biomolecules that are descended from a common ancestor. Mutations cause differences, but survival of the organism requires that mutations occur in regions that are less critical to function while importantcatalytic, regulatory or structural regions remain similar.
Similarity and Evolution Evolution has duplicated and shuffled bits and pieces of molecules to produce new linear arrangements that combine function in novel ways. Regions of similarity often suggest an evolutionary tie and/or common functional properties between very different molecules.
An alignment between two or more genetic or proteomic sequences represents an explicit hypothesis vis a vis their evolutionary histories. • Thus comparison of related gene/protein sequences have been instrumental in shedding light into the information content of these sequences and their biological functions. • Hence, comparing and aligning gene/protein sequences is a cornerstone for bioinformatics
Three common similarity problems • Start with a query sequence with unknown properties and search within a database of millions of sequences to find those which share similarity with the query. • Start with a small set of sequences and identify similarities and differences among them. • In many sequences or very long sequences, detect commonly occurring patterns.
Morphology Morphology is the form and structure of an organism. Should shared morphology mean similarity?
The beauty of evolution: Eyes and ears for example have been rediscovered in many species independently, like in octopuses, flies, bees, mammals, etc Shared morphology Shared morphology does not necessarily imply common ancestry. The animals with hands have all evolved from a common ancestor with a hand. The ichthyosaur, shark and porpoise each evolved sea life adaptations independently. When similarity is due to common ancestry, we call it homology.
How homology helps Given molecular sequences X and Y: X ~ Y AND INFO(Y) ==> INFO(X) (“ ~ ” means similar) X,Y could be gene sequences, protein sequences
Are the Sequences Similar • How similar? • What parts are the most similar? Remember, the common ancestor of the two sequences may have existed millions of years ago.
How can we tell if the two sequences are similar? Similarity judgements should be based on: • The types of changes or mutations that occur within sequences. • Characteristics of those different types of mutations. • The frequency of those mutations.
Common mutations in DNA Substitution: A C G T T G A C A C G A T G A C Deletion: A C G T T G A C A C G A C Insertion: A C G T T G A C A C G C A A T T G A C
Common mutations Duplication: A C G T T G A C A C G T T G AT T G A C Inversion (double stranded DNA shown): A C G T T G A C T G C A A C T G A C T C A A C C T G A G T T G G
Frequency of mutations Substitution > Insertion, Deletion > > Duplication > Inversion
Dot Plots • One of the most obvious ways to visualize similarity • A dot plot is a scatter plot with X coordinate associated to one sequence and Y coordinate associated to the second one • If for a position (i, j) the sequences coincide then a dot appears
Dot plot continued • For large sequences the interpretation of the plot becomes difficult • A windowing and cut-off parameters are introduced
Alignments There are many ways to align two sequences. We just saw one way: T T A C G T ACA G A T T A T - - G G A A C A - - - T A Here is another: T T A C G T – A C A G A T T A T - - - G G A A C - - A T - A Which is better? Remember, we can not choose based on the evolutionary history, because that is unknown.
Alignments: Definitions • Gap: a break in the alignment, in either one of the sequences. • For nucleotides, a consequence of an insertion or deletion mutation. • For proteins, it’s more difficult to say. • Regions of matching residues. • Indicate parts of a sequence that are well conserved • Mismatched residues. • For nucleotides, a consequence of a substitution mutation • Less conserved regions
Finding the Best Alignment:Ranking Alignments by Score Score an alignment by • Partitioning it into columns • Assign a weight to each column • Sum the column weights
Distance Scoring Distance scoring: • Alignment gets a non-negative score. • Alignment of identical sequences scores zero, all others > zero. • Best alignment has smallest score. Typical scoring functions are: • d(a,a) = 0; identity • d(a,b) = d(b,a) > 0; a ≠ b; substitution • g = d(a, – ) > 0; indel (gap)
Similarity Scoring Similarity scoring: • Alignment scores may be positive, zero, or negative. • More similar means larger positive score. • The best alignment has largest score. Typical scoring functions are: • s(a,b) is { > 0 if a and b are similar in one or more characteristics or are observed to substitute frequently for each other; ≤ 0 otherwise }; substitution • g = s(a, – ) < 0; indel (gap)
Gap penalty functions • Single character gap penalty g(a, – ) = c (c a constant or a value dependent on a) • Affine (linear) gap penalty g(k) = α + βk (α is a gap opening penalty, β is a gap extension penalty) • Concave gap penalty g(k) =α + β(m(k)) m(k) is a function like log(k) which grows more slowly as k increases.
Distance Scoring Alignment parameters: d(a, a) = 0; d(a, b) = + 2, g = + 4 A – G C C G T A T A C G A - - T - T 0 4 0 2 4 4 0 4 0 = 18
Similarity Scoring Scoring parameters: s(a, a) = + 5, s(a, b) = - 3, g = - 8 A – G C C G T A T A C G A - - T - T 5 5 5 5 + = - 15 8 3 8 8 8 -
Similarity scoring with affine gap Alignment parameters: s(a, a) = + 5, s(a, b) = - 3, g(k) = α + β*(k-1), α = - 4, β = - 2 g(k) = -4 for opening and – 2*(k-1) for extending k>=2 A – G C C G T A T A C G A - - T – T 5 5 5 5 + = 4 3 4 2 3 4 - * The opening gap is counted only once * Has been accounted for in the previous position
Scoring Matrices • Given that not all types of indel and mutations are equally likely and • Given that not all of the changes are equally severe • We might want to penalize differently accordingly to which nucleotide/amino acid are mutually mistmatched • Example: • Consider two protein sequences one of which has an alanine in a given position. A substitution to another small, hydrophobic amino acid, e.g. valine, will not be as bad as a substitution to a bulky & charged residue like lysine. • Thus we might want to score an alignment of alanine-valine more favourably than alanine-lysine.
The relative scores are captured in scoring matrices • For nucleotides these are quite simple: • BLASTS uses: • +5 if the two aligned nucleotides are identical • -4 if they are not • Others: Transition Transversion Matrix Transitions purine-purine or pyrimidine-pyrimidine mildly penalized while transversions purine-pyrimidine or pyrimidine-purine heavily penalized Identity Matrix
Scoring Matrices for Proteins • Designing SM for protein seq is more complicated. Two main approaches: • SM based on chemical-physical properties • SM based on observed substitution frequencies
Physico-chemical similarity scores Examples: • Pairing two amino acids with aromatic functional group should result in a good score while pairing amino acids where one is non-polar and the other is charged should not. • SM have been devised based on hydrophobicity, charge, electronegativity and size. • Also the genetic code has been used where a pair of amino acids is scored accordingly to the minimum number of nucleotide substitutions necessary to convert a codon from one residue to the other.
Observed substitution frequency scores • Observe actual substitution rates in nature • If a substitution between amino acids i and j is observed frequently then positions where these two are aligned are scored favourably. • Likewise if i and j are seldom observed to be interchanged in nature then they are penalized in any alignment. • Example: Asp, Glu, Ser are the most mutable aminoacids while Cys & Trp are the least mutable
Point Accepted Mutation (PAM) Matrix • Computed by observing substitutions in similar sequences • An alignment is computed with sequences having very high (85%>) similarity • Relative mutability Mj for each amino acid j is computed by counting the number of times j is substituted, i.e. aligned, with other amino acids distinct than j. • Aij is computed by counting how many times amino acid i has been substituted by j for all i,j pairs. • For example Acm is the number of times methionine residues were replaced with cysteine in any pair of aligned sequences. • Aij are then divided by the relative mutability values, normalized by the frequency of occurrence of each amino acid, and the log of each value is Rij in PAM-1 matrix • Also called Log odds matrix
Point Accepted Mutation (PAM) Matrix The PAM matrix represents substitution probabilities over a fixed unit of evolutionary change In PMA-1 this unit is 1 substitution (or accepted point mutation) per 100 residues The probabilities in the PMA-1 matrix answers the following question: “Suppose I start with a given polypeptide sequence M at time t, and observe the evolutionary changes in the sequence until 1% of all amino acids residues have undergone substitutions at time t+n. Let the new sequence at time t+n be called M’. What is the probability that a residue of type j in M will be replaced by i in M’?”
Point Accepted Mutation (PAM) Matrix The entries Rij in PAM-1 gives the answer to that question. Multiplying PAM-1 by itself gives the answer for multiple PAM unit. The exact PAM-k matrix to use depends on both the length of the sequences to be compared and how evolutionary close they are belived to be. If two sequences are believed to be close evolutionary relatives a low PAM is appropriate A PAM-1000 should be used for distant relationships. PAM-250 is normally a good compromise
Metrics of similarity: BLOSUM matrices Created later in time, and from larger volumes of proteins Designed to perform better in distant relationships BLOSUM = BLOcks SUbstitution Matrix Computed from regions alignable without gaps of closely-related proteins related to local alignments, explained later in the lecture A parameter specifies the maximum % of sequence identity in the alignments used to compute the matrix avoid overweighting closely related sequences BLOSUM62 is the standard substitution matrix in most alignement programs maximum of 62% sequence identity
Computing the Optimal Alignment:The LCS Problem as Prototype The Longest Common Subsequence (LCS) problem is a method for comparing sequences. Although the solution does not produce an alignment, it illustrates a method of dynamic programming that is very similar to that used by alignment algorithms.
Longest Common Subsequence Problem Let X be a string of characters. A subsequence X’ of X is formed by discarding zero or more letters of X. Note that the letters in X’maintain their same relative order as in X. Let X and Y be two strings. A common subsequence Z is a subsequence of both. A longest common subsequence (LCS) is the longest such Z. Examples: X = a b c d e b a Y = b e b d c ea c d Z = b d e a X = a b c d e b a X’ = a b d b
Longest Common Subsequence Problem Find the longest subsequence common to two strings Input: Two strings (i.e. sequences) X and Y. Output: The longest common subsequence between X & Y. A divide and conquer solution can be developed by looking at what happens to the last letters in each sequence. That is, are they part of the LCS solution or not?
Alignment paths & gap placement No gap gap in seq. T gap in seq. S T T T S S S T S
Alignments and Paths through the Alignment Array t a c g - c a a - - - a c g t g a a t t
Alignments and Paths:An Alternate Alignment t - - a c g c a - - a a c g t g - - a a t t
LCS recursion i’s are rows, j’s are columns
Completed LCS array - T G C A T - A - A T - C - T G A T Alignment: