1.23k likes | 1.25k Views
B ioinform atics Alignment of biological sequences. UL , 2019, Juris Vi ksna. Topics. Short review about sequence comparison : biological motivation to compare sequences sequence similarity criteria DP basic algorithm for distance computation between sequences
E N D
Bioinformatics Alignment of biological sequences UL, 2019,Juris Viksna
Topics • Short review about sequence comparison: • biological motivation to compare sequences • sequence similarity criteria • DP basic algorithm for distance computation between sequences • Global and local sequence comparisons • similarity matrices and gap penalties • modified algorithms that use gap penalties • local sequence comparison • Similarity matrices • how to obtain them • relations between similarity matrices and sequence evolution • suitability for matrices for specific sequences
Comparison of biological sequences • Two sequence comparisons (pairwise alignment): • the formulation of the problem • DP algorithm (match = 1, mismatch = 1, gap = 2) • gloabal and local comparisons • affine gap penalties • similarity matrices • Multiple alignment • the formulation of the problem (SOP) • Star alignment • relation with phylogenetic trees, progressive alignment • Sequence classification: profiles and moitifs • profile matrices • HMM (Hidden Markov Models)
Why we need to compare sequences? • Genome is already sequenced (assume...) • There are methods that predict DNA coding regions (genes) • What are biological functions of these genes?? • We can find out what protein (sequence) gene encodes • But we still do not know what this protein does... • However we can search for known proteins with similar sequences and such that functions of these proteins are known • We want to find out something about proteins in humans • The best approach is “experimental”, but tricky with humans... • But we can try to use similar protein (e.g. in mice) and start our experiments with them
Basic assumptions • Will consider proteins/RNA/DNA just as sequences in correspondingly 20 and 4 letter alphabets • Aims of comparison: • to find out how similar the sequences are (some similarity measure) • to find “common motif” of sequences (alignment) • Regarding algorithmic complexity two distinctive cases: • comparison of two sequences (relatively easy) • simultaneous comparison of n sequences (complexity grows exponentially withn) • In this lecture we will consider the problem of comparison of two sequences
Nucleotides and DNA [Watson, Crick 1953] For us DNA is a sequence in 4 letter alphabet [Adapted from Y.Guo]
Proteins For our purposes we will treat proteins as sequences in 20 symbol alphabet [Adapted from R.Shamir]
From DNA to proteins • Each codon consists of 3 nucleotides • Mutations: • Substitution: (changes a single aa) • Insertion/ Deletion: “frame shift” • (change all subsequent aa) • NB!Insertion/ Deletion might be a multiple of 3... • “Silent mutation” – DNA changed, but not aa • “Nonsense mutation” -creates“stop” codon
Genetic code Genetic code Completely worked out in 1962
Evolution of sequences • Mutations are a natural process of DNA evolution • DNAreplicationerrors: • substitutions • insertions • deletions • Similarity between sequences: • indicates their common ancestral origin • indicates similarity of biological functions • Well, this is of course simplification: the change of protein • function will determine whether the organism will have • offsprings and the changed gene will survive • Protein sequence similarity is closely associated with • similarity of DNA coding regions }indels
Sequence evolution • Each codon consists of 3 nucleotides • Mutations: • Substitution: (changes a single aa) • Insertion/ Deletion: “frame shift” • (change all subsequent aa) • NB!Insertion/ Deletion might be a multiple of 3... • “Silent mutation” – DNA changed, but not aa • “Nonsense mutation” -creates“stop” codon
ggcatt agcatt agcata agccta aggatt agcatg gacatt Sequence evolution
Sequence homology Homologs - evolved from the common ancestor Orthologs - the same function in different organisms Paralogs - similar function in the same organism
Orthologs vs paralogs [Adapted from R.Shamir]
How to compare sequences? Given two proteins: >sp|P69905|HBA_HUMAN Hemoglobin VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPT TKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR >tr|Q61287|Q61287_MOUSE Hemoglobin MVLSGEDKSNIKAAWGKIGGHGAEYVAEALERMFASFP TTKTYFPHFDVSHGSAQVKGHGKKVADALASAAGHLDD LPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHH PADFTPAVHASLDKFLASVSTVLTSKYR How to assess their similarity?
Sequence alignment - scores • sequence similarity/identity (%)This is well-defined for aligned sequence parts • “Score” (usually very method-specific in absolute value) • p-value– probability that alignment with given score or higher is found by chanceNormally the given values are only approximations • Expect(E)-value (a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size)The lower the better. But short similar sequences might have comparatively high values (E-value decreases exponentially with Score) • Z-score– number of standard deviations from mean value
How to align two sequences - BLAST • Find two exact similarity regions (usually 4 aa each) • Try to join and extend these match until score falls below threshold • Anyway, how we should do this “correctly”?
The “Manhattan Tourist” problem Visit as many sights as possible starting from top-left corner and moving just down or right
Longest common subsequence Given two sequences A and B find a longest possible sequence C that is subsequence of both A and B (such C does not need to be unique) Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA C = ATTCGCA or ATTAGCT How can we find it?
LCS – dynamic programming solution A = a1 a2an B = b1 b2bm c(i,k) - length of LCS of a1 a2ai and b1 b2bk 0, if i = 0 or k = 0 c(i–1,k–1) +1, if i, k > 0 and ai = bk max{c(i, k–1), c(i–1, k)}, if i, k > 0 and aibk c(i, k) =
LCS – example A =GADTAMAWGRAMMA B = GAGAWKIAMM
LCS - example LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA GAG----AWKI—AMM-
Edit distance • Levenshtein 1966 • Minimal number of operations that transforms one sequence into another • insert, delete, substitute (1 simbols) • Edit distance is0(sequences are identical) or positive • For example “AIMS” & “AMOS”:(distance=2 for all three solutions) AMOS AMOS AIMSAMOS AIM-S A-MOS AIMS AIMS [Adapted from D.Gilbert]
Edit distance Given two sequences A and B find a the smallest possible number of Insertion, Deletion and Substitution operations that chnages A to B Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA [G][G]AT[A]T[C][C][C][C]CG[G-C][G-C]C[G-C]C[A]T[A] ED = 12? How can we find it?
Edit distance A = a1 a2an B = b1 b2bm e(i,k) – lenght of ED for sequencesa1 a2aiandb1 b2bk i, if k = 0 k, if i = 0 e(i–1,k–1), ifi, k > 0 andai = bk min{e(i–1,k–1),e(i,k–1),e(i–1,k)}+1,ifi,k > 0 andaibk e(i, k) =
ED - modifications e(i,0) = i e(0,j) = j { e(i-1,j)+ t e(i,j-1)+ t e(i-1,j-1) + t(ai,bj) e(i,j)= min If you interested in result «up to a sign» it does not matter whether min or max is used. min is more natural for ED, max for LCS. max is also the usual choice for sequence comparison. tij– probability that aa ai changes to aa bj For ED: t = 1 t(ai,bj) = 0 ifai=bj t(ai,bj) = 1 ifaibj For «inverse» LCS: t = 0 t(ai,bj) = 1ifai=bj t(ai,bj) = 0ifaibj
Substitution (similarity) matrices A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 • Similarity Matrix • Most popular: • PAM • Blossum • Gonnet • The one shown is • BLOSSOM 62 • (almost :) "Traditional assumption": substitution score > 0 for substitutions that are more frequent as random ones, and < 0 for less frequent than random ones.
Sequence similarity as the longest path problem We can treat matrix as graph with weighted edges. The problem then translates to finding path with the largest/smallest weight in Directed Acyclic Graph.
Complexity of similarity computation Size of matrix: nm Computing of value for each cell: const Total time: (nm) Total memory: (nm) Notice that if we want just score only two rows are needed. In this case the required memory: (nm) However, if we also need the alignment (and we usually do)?
Interpretation of comparison results Alignment grid (edit graph). Every alignment is a path from (0,0) to (n,m).