1 / 123

B ioinform atics Alignment of biological sequences

B ioinform atics Alignment of biological sequences. UL , 2019, Juris Vi ksna. Topics. Short review about sequence comparison : biological motivation to compare sequences sequence similarity criteria DP basic algorithm for distance computation between sequences

jcabral
Download Presentation

B ioinform atics Alignment of biological sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Alignment of biological sequences UL, 2019,Juris Viksna

  2. Topics • Short review about sequence comparison: • biological motivation to compare sequences • sequence similarity criteria • DP basic algorithm for distance computation between sequences • Global and local sequence comparisons • similarity matrices and gap penalties • modified algorithms that use gap penalties • local sequence comparison • Similarity matrices • how to obtain them • relations between similarity matrices and sequence evolution • suitability for matrices for specific sequences

  3. Comparison of biological sequences • Two sequence comparisons (pairwise alignment): • the formulation of the problem • DP algorithm (match = 1, mismatch = 1, gap = 2) • gloabal and local comparisons • affine gap penalties • similarity matrices • Multiple alignment • the formulation of the problem (SOP) • Star alignment • relation with phylogenetic trees, progressive alignment • Sequence classification: profiles and moitifs • profile matrices • HMM (Hidden Markov Models)

  4. Why we need to compare sequences? • Genome is already sequenced (assume...) • There are methods that predict DNA coding regions (genes) • What are biological functions of these genes?? • We can find out what protein (sequence) gene encodes • But we still do not know what this protein does... • However we can search for known proteins with similar sequences and such that functions of these proteins are known • We want to find out something about proteins in humans • The best approach is “experimental”, but tricky with humans... • But we can try to use similar protein (e.g. in mice) and start our experiments with them

  5. Basic assumptions • Will consider proteins/RNA/DNA just as sequences in correspondingly 20 and 4 letter alphabets • Aims of comparison: • to find out how similar the sequences are (some similarity measure) • to find “common motif” of sequences (alignment) • Regarding algorithmic complexity two distinctive cases: • comparison of two sequences (relatively easy) • simultaneous comparison of n sequences (complexity grows exponentially withn) • In this lecture we will consider the problem of comparison of two sequences

  6. Nucleotides and DNA [Watson, Crick 1953] For us DNA is a sequence in 4 letter alphabet [Adapted from Y.Guo]

  7. Proteins For our purposes we will treat proteins as sequences in 20 symbol alphabet [Adapted from R.Shamir]

  8. From DNA to proteins • Each codon consists of 3 nucleotides • Mutations: • Substitution: (changes a single aa) • Insertion/ Deletion: “frame shift” • (change all subsequent aa) • NB!Insertion/ Deletion might be a multiple of 3... • “Silent mutation” – DNA changed, but not aa • “Nonsense mutation” -creates“stop” codon

  9. Genetic code Genetic code Completely worked out in 1962

  10. Evolution of sequences • Mutations are a natural process of DNA evolution • DNAreplicationerrors: • substitutions • insertions • deletions • Similarity between sequences: • indicates their common ancestral origin • indicates similarity of biological functions • Well, this is of course simplification: the change of protein • function will determine whether the organism will have • offsprings and the changed gene will survive • Protein sequence similarity is closely associated with • similarity of DNA coding regions }indels

  11. Sequence evolution • Each codon consists of 3 nucleotides • Mutations: • Substitution: (changes a single aa) • Insertion/ Deletion: “frame shift” • (change all subsequent aa) • NB!Insertion/ Deletion might be a multiple of 3... • “Silent mutation” – DNA changed, but not aa • “Nonsense mutation” -creates“stop” codon

  12. ggcatt agcatt agcata agccta aggatt agcatg gacatt Sequence evolution

  13. Sequence homology Homologs - evolved from the common ancestor Orthologs - the same function in different organisms Paralogs - similar function in the same organism

  14. Orthologs vs paralogs [Adapted from R.Shamir]

  15. How to compare sequences? Given two proteins: >sp|P69905|HBA_HUMAN Hemoglobin VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPT TKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR >tr|Q61287|Q61287_MOUSE Hemoglobin MVLSGEDKSNIKAAWGKIGGHGAEYVAEALERMFASFP TTKTYFPHFDVSHGSAQVKGHGKKVADALASAAGHLDD LPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHH PADFTPAVHASLDKFLASVSTVLTSKYR How to assess their similarity?

  16. Sequence alignment - BLAST

  17. Sequence alignment - BLAST

  18. Sequence alignment – the results we expect

  19. Sequence alignment - SSEARCH

  20. Sequence alignment - SSEARCH

  21. Sequence alignment - scores • sequence similarity/identity (%)This is well-defined for aligned sequence parts • “Score” (usually very method-specific in absolute value) • p-value– probability that alignment with given score or higher is found by chanceNormally the given values are only approximations • Expect(E)-value (a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size)The lower the better. But short similar sequences might have comparatively high values (E-value decreases exponentially with Score) • Z-score– number of standard deviations from mean value

  22. Z-score

  23. Z-score

  24. How to align two sequences - BLAST • Find two exact similarity regions (usually 4 aa each) • Try to join and extend these match until score falls below threshold • Anyway, how we should do this “correctly”?

  25. The “Manhattan Tourist” problem Visit as many sights as possible starting from top-left corner and moving just down or right

  26. Longest common subsequence Given two sequences A and B find a longest possible sequence C that is subsequence of both A and B (such C does not need to be unique) Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA C = ATTCGCA or ATTAGCT How can we find it?

  27. LCS – dynamic programming solution A = a1 a2an B = b1 b2bm c(i,k) - length of LCS of a1 a2ai and b1 b2bk 0, if i = 0 or k = 0 c(i–1,k–1) +1, if i, k > 0 and ai = bk max{c(i, k–1), c(i–1, k)}, if i, k > 0 and aibk c(i, k) =

  28. LCS – example A =GADTAMAWGRAMMA B = GAGAWKIAMM

  29. LCS - example

  30. LCS - example

  31. LCS - example

  32. LCS - example

  33. LCS - example

  34. LCS - example

  35. LCS - example

  36. LCS - example

  37. LCS - example

  38. LCS - example

  39. LCS - example

  40. LCS - example

  41. LCS - example LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA GAG----AWKI—AMM-

  42. Edit distance • Levenshtein 1966 • Minimal number of operations that transforms one sequence into another • insert, delete, substitute (1 simbols) • Edit distance is0(sequences are identical) or positive • For example “AIMS” & “AMOS”:(distance=2 for all three solutions) AMOS AMOS AIMSAMOS AIM-S A-MOS AIMS AIMS  [Adapted from D.Gilbert]

  43. Edit distance Given two sequences A and B find a the smallest possible number of Insertion, Deletion and Substitution operations that chnages A to B Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA [G][G]AT[A]T[C][C][C][C]CG[G-C][G-C]C[G-C]C[A]T[A] ED = 12? How can we find it?

  44. Edit distance A = a1 a2an B = b1 b2bm e(i,k) – lenght of ED for sequencesa1 a2aiandb1 b2bk i, if k = 0 k, if i = 0 e(i–1,k–1), ifi, k > 0 andai = bk min{e(i–1,k–1),e(i,k–1),e(i–1,k)}+1,ifi,k > 0 andaibk e(i, k) =

  45. ED - modifications e(i,0) = i e(0,j) = j { e(i-1,j)+ t e(i,j-1)+ t e(i-1,j-1) + t(ai,bj) e(i,j)= min If you interested in result «up to a sign» it does not matter whether min or max is used. min is more natural for ED, max for LCS. max is also the usual choice for sequence comparison. tij– probability that aa ai changes to aa bj For ED: t = 1 t(ai,bj) = 0 ifai=bj t(ai,bj) = 1 ifaibj For «inverse» LCS: t = 0 t(ai,bj) = 1ifai=bj t(ai,bj) = 0ifaibj

  46. Substitution (similarity) matrices A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 • Similarity Matrix • Most popular: • PAM • Blossum • Gonnet • The one shown is • BLOSSOM 62 • (almost :) "Traditional assumption": substitution score > 0 for substitutions that are more frequent as random ones, and < 0 for less frequent than random ones.

  47. Sequence similarity as the longest path problem We can treat matrix as graph with weighted edges. The problem then translates to finding path with the largest/smallest weight in Directed Acyclic Graph.

  48. Complexity of similarity computation Size of matrix: nm Computing of value for each cell: const Total time: (nm) Total memory: (nm) Notice that if we want just score only two rows are needed. In this case the required memory: (nm) However, if we also need the alignment (and we usually do)?

  49. Edit distance in linear space?

  50. Interpretation of comparison results Alignment grid (edit graph). Every alignment is a path from (0,0) to (n,m).

More Related