1 / 37

ALIGNMENT

ALIGNMENT. How do we tell whether two macromolecules are similar? Why?. SEQUENCE STRUCTURE FUNCTION. Alignments. DNA:DNA polypeptide:polypeptide. Alignments. One-to-One One-to-Database Many-to-Many. Origins of Sequence Similarity. Homology common evolutionary descent

jana
Download Presentation

ALIGNMENT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ALIGNMENT How do we tell whether two macromolecules are similar? Why? SEQUENCE STRUCTURE FUNCTION Chuck Staben

  2. Alignments • DNA:DNA • polypeptide:polypeptide Chuck Staben

  3. Alignments • One-to-One • One-to-Database • Many-to-Many Chuck Staben

  4. Origins of Sequence Similarity • Homology • common evolutionary descent • Similarity in function • convergence • Chance History Necessity Serendipity Chuck Staben

  5. MISMATCH Similarity GAACAAT ||||||| 7/7 OR 100% GAACAAT GAACAAT ||| ||| 6/7 OR 84% GAATAAT Chuck Staben

  6. Mismatches GAACAAT ||| ||| 6/7 OR 84% GAATAAT Same?? GAACAAT ||| ||| 6/7 OR 84% GAAGAAT Chuck Staben

  7. Count this? Terminal Mismatch GAACAATttttt ||| ||| aaaccGAATAAT 6/7 OR 84% Chuck Staben

  8. INDEL INDELS GAAgCAAT ||| |||| 7/7 OR 100% GAA*CAAT (alignment-challenged?) Chuck Staben

  9. vs. GAAggggCAAT ||| |||| GAA****CAAT Indels, cont’d GAAgCAAT ||| |||| GAA*CAAT Chuck Staben

  10. Similarity Scoring • Terminal mismatches (0) • Match score (10) • Mismatch penalty (-9) • Gap penalty (50) • Gap extension penalty (3) DNA Defaults-Bestfit Chuck Staben

  11. DNA Scoring GGGGGGGGGG |||||***** 5(10)-5(9)=5 GGGGGAAAAAGGGGG GGGGG*****GGGGG |||||***** ||||| 10(10)-50-5(3)=35 GGGGGAAAAAGGGGG Chuck Staben

  12. Absurdity of Low Gap Penalty GATCGCTACGCTCAGC A.C.C..C..T Perfect similarity, Every time! Chuck Staben

  13. Algorithms Optimal Score=Optimal Alignment Needleman-Wunsch Dynamic Programming Optimal Local Alignment Smith-Waterman Chuck Staben

  14. Programs • BESTFIT • Smith-Waterman • SINGLE BEST SIMILARITY • GAP • Needleman-Wunsch • End-to-end ALWAYS • COMPARE/DOTPLOT • COMPLETE surface of comparison Chuck Staben

  15. BESTFIT vs GAP 1 ggggg 5 ||||| 3 ggggg 7 1 ...gggggaaaaaggggccccc 19 || |||| || 1 gggggttttttttggggtttcc 22 Chuck Staben

  16. Statistical Significance RaNdOmIzE Quality: 50 Length: 5 Similarity: 100.000 Identity: 100.000 Average quality, 20 randomizations: 34.2 +/- 9.4 Quality > RANDOM + 2() Chuck Staben

  17. Program Limitations • BESTFIT • 1000 vs 10,000 • GAP • 1000 vs 1000 • COMPARE • 1000 vs 1000 Memory Chuck Staben

  18. Protein Similarity • Identity-Easy WEAK Alignments • Chemical Similarity • L vs I, K vs R… • Evolutionary Similarity Chuck Staben

  19. Single-Base Evolution CAU=H CAC=H CGU=R UAU=Y CAA=Q CCU=P GAU=D CAG=Q CUU=L AAU=N Chuck Staben

  20. Substitution Matrices • PAM-Dayhoff • BLOSUM-Henikoff Chuck Staben

  21. PAM-Dayhoff • Related proteins, substitutions constrained by evolution and function • “accepted” by evolution (point accepted mutation) • 1 PAM::1% divergence • PAM120=closely related proteins • PAM250=divergent proteins • Log/odds approach Chuck Staben

  22. BLOSUM-Henikoff&Henikoff • Align “BLOCKS” • Merge blocks at given % similar to one sequence • Calculate “target” frequencies • BLOSUM62=62% similar blocks • good general purpose • BLOSUM30 • weak similarities Chuck Staben

  23. BLOSUM62 Chuck Staben

  24. BLOSUM62-2 Glu Asp Gln Lys Arg His Gly Ala GAA GAU CAA AAA AGA CAUGGA GCA GAG GAC CAG AAG AGG CACGGG GCG Chuck Staben

  25. Gaps • No general theory!! • G+L(n) • indel mutations rare • variation in length “easy” Chuck Staben

  26. Alignment Statistics • Ungapped, local alignments (HSPs) • extreme value, not normal distribution • S(observed score) vs expected distribution p • E=expected number, chance alignments • K,  distribution parameters “chance of finding a needle in a haystack depends on the size of the haystack” Chuck Staben

  27. “Real” Alignments • Multiple HSPs • Karlin-Altshcul Sum Statistics • Heuristic qualities • alignments proceed end-to-end ???? Chuck Staben

  28. Real Alignments Protein-Protein Close-Distant DNA-DNA Chuck Staben

  29. Phylogeny GCG Myoglobin Chuck Staben

  30. Cow-to-Pig 88% identical Chuck Staben

  31. Cow-to-Pig cDNA 80% Identity (88% at aa!) Chuck Staben

  32. DNA similarity reflects polypeptide similarity Chuck Staben

  33. Coding vs Non-coding Regions 90% in Coding 74% in Non-coding Chuck Staben

  34. Third Base of Codon Hypervariable 28 third base 11 second 8 first Chuck Staben

  35. Cow-to-Fish Protein 42% identity 51% similairity Chuck Staben

  36. Cow-to-Fish DNA 48% similairity Significant Chuck Staben

  37. Protein vs DNAAlignments • Polypeptide similarity > DNAs • Coding DNA > Non-coding • 3rd base of codon hypervariable • Moderate Distance  poor DNA similarity Chuck Staben

More Related