1 / 43

Sequence Similarity

Sequence Similarity. Why sequence similarity. structural similarity >25% sequence identity  similar structure evolutionary relationship all proteins come from < 2000 (super) families related functional role similar structure  similar function functional modules are often preserved.

katy
Download Presentation

Sequence Similarity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Similarity

  2. Why sequence similarity • structural similarity >25% sequence identity  similar structure • evolutionary relationship all proteins come from < 2000 (super)families • related functional role similar structure similar function functional modules are often preserved

  3. Muscle cells and contraction

  4. Actin and myosin during muscle movement

  5. Actin structure

  6. Actin sequence • Actin is ancient and abundant • Most abundant protein in cells • 1-2 actin genes in bacteria, yeasts, amoebas • Humans: 6 actin genes • -actin in muscles; -actin, -actin in non-muscle cells • ~4 amino acids different between each version MUSCLE ACTIN Amino Acid Sequence 1 EEEQTALVCD NGSGLVKAGF AGDDAPRAVF PSIVRPRHQG VMVGMGQKDS YVGDEAQSKR 61 GILTLKYPIE HGIITNWDDM EKIWHHTFYN ELRVAPEEHP VLLTEAPLNP KANREKMTQI 121 MFETFNVPAM YVAIQAVLSL YASGRTTGIV LDSGDGVSHN VPIYEGYALP HAIMRLDLAG 181 RDLTDYLMKI LTERGYSFVT TAEREIVRDI KEKLCYVALD FEQEMATAAS SSSLEKSYEL 241 PDGQVITIGN ERFRGPETMF QPSFIGMESS GVHETTYNSI MKCDIDIRKD LYANNVLSGG 301 TTMYPGIADR MQKEITALAP STMKIKIIAP PERKYSVWIG GSILASLSTF QQMWITKQEY 361 DESGPSIVHR KCF

  7. A related protein in bacteria

  8. Relation between sequence and structure

  9. A multiple alignment of actins

  10. DNA transcription RNA translation Protein Gene expression CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE

  11. Biomolecules as Strings • Macromolecules are the chemical building blocks of cells • Proteins • 20 amino acids • Nucleic acids • 4 nucleotides {A, C, G, ,T} • Polysaccharides

  12. The information is in the sequence • Sequence  Structure  Function • Sequence similarity  Structural and/or Functional similarity • Nucleic acids and proteins are related by molecular evolution • Orthologs: two proteins in animals X and Y that evolved from one protein in immediate ancestor animal Z • Paralogs: two proteins that evolved from one protein through duplication in some ancestor • Homologs: orthologs or paralogs that exhibit sequence similarity

  13. Protein Phylogenies • Proteins evolve by both duplication and species divergence duplication orthologs paralogs

  14. Evolution

  15. Evolution at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… SEQUENCE EDITS …AC - - - - CAGTCACCA… REARRANGEMENTS Inversion Translocation Duplication

  16. Evolutionary Rates next generation OK OK OK Changes in non-functional sites are OK, so will be propagated X X Still OK? Most changes in functional sites are deleterious and will be rejected

  17. Sequence conservation implies function Proteins between humans and rodents are on average 85% identical

  18. Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings x = x1x2...xM, y = y1y2…yN, an alignment is an assignment of gaps to positions 0,…, M in x, and 0,…, N in y, so as to line up each letter in one sequence with either a letter, or a gap in the other sequence

  19. What is a good alignment? Alignment: The “best” way to match the letters of one sequence with those of the other How do we define “best”? Alignment: A hypothesis that the two sequences come from a common ancestor through sequence edits Parsimonious explanation: Find the minimum number of edits that transform one sequence into the other

  20. Scoring Function • Sequence edits: AGGCCTC • Mutations AGGACTC • Insertions AGGGCCTC • Deletions AGG–CTC Scoring Function: Match: + m Mismatch: – s Gap: – d Score F = (# matches)  m – (# mismatches)  s – (#gaps)  d

  21. How do we compute the best alignment? AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA Too many possible alignments: O( 2M+N) AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC

  22. Alignment is additive Observation: The score of aligning x1……xM y1……yN is additive Say that x1…xi xi+1…xM aligns to y1…yj yj+1…yN The two scores add up: F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N]) Key property: optimal solution to the entire problem is composed of optimal solutions to subproblems – Dynamic Programming

  23. Dynamic Programming Construct a DP matrix F: MxN: Suppose we wish to align x1……xM y1……yN Let F(i, j) = optimal score of aligning x1……xi y1……yj

  24. Dynamic Programming (cont’d) Notice three possible cases: • xi aligns to yj x1……xi-1 xi y1……yj-1 yj 2. xi aligns to a gap x1……xi-1 xi y1……yj - • yj aligns to a gap x1……xi - y1……yj-1 yj m, if xi = yj F(i, j) = F(i-1, j-1) + -s, if not F(i, j) = F(i-1, j) – d F(i, j) = F(i, j-1) – d

  25. Dynamic Programming (cont’d) • How do we know which case is correct? Inductive assumption: F(i, j – 1), F(i – 1, j), F(i – 1, j – 1) are optimal Then, F(i – 1, j – 1) + s(xi, yj) F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d Where s(xi, yj) = m, if xi = yj; -s, if not

  26. Example x = AGTA m = 1 y = ATA s = -1 d = -1 F(i,j) i = 0 1 2 3 4 Optimal Alignment: F(4, 3) = 2 AGTA A - TA j = 0 1 2 3

  27. The Needleman-Wunsch Algorithm • Initialization. • F(0, 0) = 0 • F(0, j) = - j  d • F(i, 0) = - i  d • Main Iteration. Filling-in partial alignments • For each i = 1……M For each j = 1……N F(i-1,j-1) + s(xi, yj) [case 1] F(i, j) = max F(i-1, j) – d [case 2] F(i, j-1) – d [case 3] DIAG, if [case 1] Ptr(i,j) = LEFT, if [case 2] UP, if [case 3] • Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

  28. Performance • Time: O(NM) • Space: O(NM) • Possible to reduce space to O(N+M) using Hirschberg’s divide & conquer algorithm

  29. Substitutions of Amino Acids Mutation rates between amino acids have dramatic differences! How can we quantify the differences in rates by which one amino acid replaces another across related proteins?

  30. Substitution Matrices BLOSUM matrices: • Start from BLOCKS database (curated, gap-free alignments) • Cluster sequences according to > X% identity • Calculate Aab: # of aligned a-b in distinct clusters, correcting by 1/mn, where m, n are the two cluster sizes • Estimate P(a) = (b Aab)/(c≤d Acd); P(a, b) = Aab/(c≤d Acd)

  31. Gaps are not inserted uniformly

  32. A state model for alignment M (+1,+1) Alignments correspond 1-to-1 with sequences of states M, I, J I (+1, 0) J (0, +1) -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

  33. Let’s score the transitions s(xi, yj) M (+1,+1) s(xi, yj) s(xi, yj) Alignments correspond 1-to-1 with sequences of states M, I, J -d -d I (+1, 0) J (0, +1) -e -e -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

  34. A probabilistic model for alignment • Assign probabilities to every transition (arrow), and emission (pair of letters or gaps) • Probabilities of mutation reflect amino acid similarities • Different probabilities for opening and extending gap M (+1,+1) I (+1, 0) J (0, +1) -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC-GGTCGATTTGCCCGACC IMMJMMMMMMMJJMMMMMMJMMMMMMMIIMMMMMIII

  35. A Pair HMM for alignments log(1 – 2) M P(xi, yj) log Prob(xi, yj) log(1 – ) log(1 – ) log  log  log  log  I P(xi) J P(yj) Highest scoring path corresponds to the most likely alignment!

  36. How do we find the highest scoring path? • Compute the following matrices (DP) • M(i, j): most likely alignment of x1…xi with y1…yj ending in state M • I(i, j): most likely alignment of x1…xi with y1…yj ending in state I • J(i, j): most likely alignment of x1…xi with y1…yj ending in state J M(i, j) = log( Prob(xi, yj) ) + max{ M(i-1, j-1) + log(1-2), I(i-1, j) + log(1-), J(i, j-1) + log(1-) } I(i, j) = max{ M(i-1, j) + log , I(i-1, j) + log  } log(1 – 2) M P(xi, yj) log Prob(xi, yj) log(1 – ) log(1 – ) log  log  I P(xi) J P(yj) log  log 

  37. The Viterbi algorithm for alignment • For each i = 1, …, M • For each j = 1, …, N M(i, j) = log( Prob(xi, yj) ) + max { M(i-1, j-1) + log(1-2), I(i-1, j) + log(1-), J(i, j-1) + log(1-) } I(i, j) = max { M(i-1, j) + log , I(i-1, j) + log  } J(i, j) = max { M(i-1, j) + log , I(i-1, j) + log  } When matrices are filled, we can trace back from (M, N) the likeliest alignment

  38. One way to view the state paths – State M …… y1 yn x1 …… xm

  39. State I …… y1 yn x1 …… xm

  40. State J …… y1 yn x1 …… xm

  41. Putting it all together States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) …… y1 yn x1 …… xm

  42. Putting it all together States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) Optimal solution is the best scoring path from top-left to bottom-right corner This gives the likeliest alignment according to our HMM …… y1 yn x1 …… xm

  43. Yet another way to represent this model Ix Ix BEGIN END Iy Iy Mx1 Mxm Sequence X We are aligning, or threading, sequence Y through sequence X Every time yj lands in state xi, we get substitution score s(xi, yj) Every time yj is gapped, or some xi is skipped, we pay gap penalty

More Related