230 likes | 702 Views
Bioinformatics. “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) “Nothing in bioinformatics makes sense except in the light of Biology” . Evolution. Three requirements: Template structure providing stability (DNA)
E N D
Bioinformatics • “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) • “Nothing in bioinformatics makes sense except in the light of Biology”
Evolution Three requirements: • Template structure providing stability (DNA) • Copying mechanism (meiosis) • Mechanism providing variation (mutations; insertions and deletions; crossing-over; etc.)
Evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD mutation deletion
Evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD mutation deletion true alignment
Example: Pairwise sequence alignment needs sense of evolution Global dynamic programming MDAGSTVILCFVG Evolution M D A A S T I L C G S Amino Acid Exchange Matrix Search matrix MDAGSTVILCFVG- Gap penalties (open,extension) MDAAST-ILC--GS
Sequence alignmentHistory 1970 Needleman-Wunsch global pair-wise alignment 1981 Smith-Waterman local pair- wise alignment 1984 Hogeweg-Hesper progressive multiple alignment 1989 Lipman-Altschul-Kececioglu simultaneous multiple alignment 1994 Hidden Markov Models (HMM) for multiple alignment 1996 Iterative strategies for progressive multiple alignment revived 1997 PSI-Blast (PSSM)
Pair-wise alignment T D W V T A L K T D W L - - I K Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 22n = ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!
A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa attt---ggcggatcg-cctctacgggcc----
Dynamic programmingScoring alignments Sa,b= + gp(k) = pi + kpeaffine gap penalties pi and pe are the penalties for gap initialisation and extension, respectively
Dynamic programmingScoring alignments T D W V T A L K T D W L - - I K 2020 10 1 Affine gap penalties (open, extension) Amino Acid Exchange Matrix Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)+Po+2Px + +s(L,I)+s(K,K)
Amino acid exchange matrices 2020 How do we get one? And how do we get associated gap penalties? First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1978) – Atlas of Protein Structure
A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 B 0 -1 2 3 -4 1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2 Z 0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 A R N D C Q E G H I L K M F P S T W Y V B Z PAM250 matrix amino acid exchange matrix (log odds) Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation
Pairwise sequence alignment Global dynamic programming MDAGSTVILCFVG Evolution M D A A S T I L C G S Amino Acid Exchange Matrix Search matrix Gap penalties (open,extension) MDAGSTVILCFVG- MDAAST-ILC--GS
Global dynamic programming j-1 i-1 Max{S0<x<i-1, j-1- Pi - (i-x-1)Px} Si-1,j-1 Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px} Si,j = si,j + Max
Pairwise alignment • Global alignment: all gaps are penalised • Semi-global alignment: N- and C-terminal gaps (end-gaps) are not penalised MSTGAVLIY--TS----- ---GGILLFHRTSGTSNS End-gaps End-gaps
Local dynamic programming(Smith & Waterman, 1981) LCFVMLAGSTVIVGTR E D A S T I L C G S Negative numbers Amino Acid Exchange Matrix Search matrix Gap penalties (open, extension) AGSTVIVG A-STILCG
Local dynamic programming(Smith & Waterman, 1981) j-1 i-1 Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px} Si,j + Si-1,j-1 Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px} 0 Si,j = Max
Dot plots • Way of representing (visualising) sequence similarity without doing dynamic programming (DP) • Make same matrix, but locally represent sequence similarity by averaging using a window • See Lesk’s book pp. 167-171
Comparing two sequences We want to be able to choose the best alignment between two sequences. A simple method of finding similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis.
Dot plots can be filtered by window approaches (to calculate running averages) and applying a threshold They can identify insertions, deletions, inversions