540 likes | 664 Views
Outline. The fundamental impartance of Alignment and Statistics The Basic Sequence Similarity Algorithm Heuristics: BLAST, FASTA, SIM4 ESTomics. A story. The GOLD-BUG. By Edgar A. Poe. The Gold-Bug, By Edgar A. Poe. Mr. William Legrand, left New Orleans and
E N D
Outline • The fundamental impartance of Alignment and Statistics • The Basic Sequence Similarity Algorithm • Heuristics: BLAST, FASTA, SIM4 • ESTomics
A story The GOLD-BUG By Edgar A. Poe
The Gold-Bug, By Edgar A. Poe Mr. William Legrand, left New Orleans and took residence on Sullivan’s Island, near South Carolina. His servant was Jupiter, an old negro. He calls Mr Legrand “Massa Will.” One day, Massa Will found a bug, a scarabeus which he believed is totally new.
Jupiter describes the bug in his language: “…de bug is a gole-bug, solid, ebery bit of him, inside and all, sep him wing -- neber feel half so hebby a bug in my life.” The design on the bug’s back resembled a death’s-head …. And the story continues and they were searching for a big treasure hidden by a famous pirate Captain Kidd.
Captain Kidd’s Code 53||^305))6*;4826)4|.)4|);806* ;48^8%60))85;1|(;:|*8^83(88)5*^ ;46(;88*96*?;8)*|(;485);5*^2:*| (;4956*2(5*_4)8%8*;4069285);) 6^8)4||;1(|9;48081;8:8|1;48^85;4) 485^528806*81(|9;48;(88;4(|?34 ;48)4|;161;:188;|?;
E. A. Poe Circumstances, and a certain biased of mind, have led me to take interest in such riddles, and it may well be doubted whether human ingenuity can construct an enigma of the kind which human ingenuity may not, by proper application, resolve.
What Language ? “In the present case -- indeed in all cases of secret writing -- the first question regards the language of the cipher; for the principles of solution, so far, especially, as the more simple ciphers are concerned, depend upon, and are varied by, the genius of a particular idiom.
E. A. Poe … In general, there is no alternative but experiment (directed by probabilities) of every tongue known to him who attempts the solution, until the true one be attained. … But for this consideration, I should have begun my attempts with the Spanish and French, as the tongues in which a secret of this kind would most naturally have been written by a pirate of the Spanish main. As it was, I assumed the cryptograph to be English.”
Statistics No division between words Statistics of the character 8 there are 33. ; there are 26. 4 there are 19. |) there are 16. * there are 13. 5 there are 12. 6 there are 11. ^1 there are 8. 0 there are 6. 92 there are 5 :3 there are 4. ? there are 3. % there are 2. _ there are 1.
We found our first letter! In English the letter which most frequently occurs is e. Afterwards, the succession is: a o i d h n r s t u y c f g l m w b k p q x z
Captain Kidd’s Code: 8 is “e” 53||^305))6*;4826)4|.)4|);806* ;48^8%60))85;1|(;:|*8^83(88)5*^ ;46(;88*96*?;8)*|(;485);5*^2:*| (;4956*2(5*_4)8%8*;4069285);) 6^8)4||;1(|9;48081;8:8|1;48^85;4) 485^528806*81(|9;48;(88;4(|?34 ;48)4|;161;:188;|?; 88 occurs in English in words like: speed, seen, been, agree 8 is e
Captain Kidd’s Code: ; is t and 4is h 53||^305))6*;4826)4|.)4|);806* ;48^8%60))85;1|(;:|*8^83(88)5*^ ;46(;88*96*?;8)*|(;485);5*^2:*| (;4956*2(5*_4)8%8*;4069285);) 6^8)4||;1(|9;48081;8:8|1;48^85;4) 485^528806*81(|9;48;(88;4(|?34 ;48)4|;161;:188;|?; must be “the” most frequent word ;48
The Solution “A good glass in the bishop’s hostel in the devil’s seat forty-one degrees and thirteen minutes northeast and by north main branch seventh limb east side shoot from the left eye of the death’s head a bee-line from the tree through the shot fifty feet out.”
THE ALGORITHMS of GENOMICS The Programming Language of Genomics is BLASTALL
Sequence Comparison Biomolecular sequences • DNA sequences (string over 4 letter alphabet {A, C, G, T}) • RNA sequences (string over 4 letter alphabet {ACGU}) • Protein sequences (string over 20 letter alphabet {Amino Acids}) Sequence similarity helps in the discovery of genes, and the prediction of structure and function of proteins.
The Basic Similarity Analysis Algorithm • Global Similarity • Scoring Schemes • Edit Graphs • Alignment = Path in the Edit Graph • The Principle of Optimality • The Dynamic Programming Algorithm • The Traceback
Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: • GCGCATTTGAGCGA • TGCGTTAGGGTGACCA A possible alignment: - GCGCATTTGAGCGA - - TGCG - - TTAGGGTGACC match mismatch indel
Consider two sequences belong to Over the alphabet
Scoring Schemes Unit-score A C T G - A 1 0 0 0 0 C 1 0 0 0 0 0 G 1 0 0 0 0 0 0 T 1 0 - 0 0 0 0 0
ALIGNMENT A is aligned with A ACG | | | AGG C is aligned with G A | A C | G G | G G is aligned with G Unit-cost Score = (A,A) (C,G) (G,G) + + = 1 + 0 + 1 = 2
GAPS SCORE SCORE 7 0 8 3 “-” is the gap symbol ACATGGAAT ACAGGAAAT ACAT GG-AAT ACA - GG AAAT OPTIMAL ALIGNMENTS AAAGGG GGGAAA - - -AAAGGG GGGAAA- - -
(x,y) = the score for aligning x with y (x,-) = the score for aligning x with - (-,y) = the score for aligning - with y
Alignment A-CG - G ATCGTG Score (A,A) + (-,T) + (C,C) + (G,G) + (-,T ) + (G,G) THE SUM OF THE SCORES OF THE PAIRWISE ALIGNED SYMBOLS
Scoring Scheme Dayhoff score - A R N D C Q E G H I L K M F P S T W Y V -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 -8 3 -3 0 0 -3 -1 0 1 -3 -1 -3 -2 -2 -4 1 1 1 -7 -4 0 A R N D 6 4 ... PTIPLSRLFDNAMLRAHRLHQ SAIENQRLFNIAVSRVQHLHL Partial alignment for Monkey and Trout somatotropin proteins
Scoring Functions Mutations= Substitutions, Insertions, Deletions Scoring function = a sum of a terms each for a pair of aligned residues, and for each gap The meaning = log of the relative likelihood that the sequences are related, compared to being unrelated Identities and conservative substitutions are Positive terms Non-conservative substitutions are Negative terms
The Edit Graph Suppose that we want to align AGT with AT We are going to construct a graph where alignments between the two sequences correspond to paths between the begin and and end nodes of the graph. This is the Edit Graph
The sequence AGT AGT has length 3 AT has length 2 The sequence AT 0 2 1 3 0 1 2 The Edit graph has (3+1)*(2+1) nodes
T G A Begin 0 1 2 3 0 A 1 T 2 End AGT indexes the columns, and AT indexes the rows of this “table”
T G A 0 1 2 3 0 A 1 T 2 Begin End The Graph is directed. The nodes (i,j) will hold values.
T G A Begin 0 1 2 3 0 A 1 T 2 End
A T G 0 1 2 3 0 T - G - A - A - A - A - A - A A A G A T A - A 1 G - T - A - - T - T - T - T A T G T T T T G - 2 T - A - Directed edges get as labels pairs of aligned letters. Begin End
Alignment = Path in the Edit Graph A T G 0 1 2 3 0 T - G - A - A - A - A - A - A A A G A T A - A G - T - A - 1 - T - T - T - T A T G T T T T G - T - A - 2 Begin AGT A-T End Every path from Begin to End corresponds to an alignment Every alignment corresponds to a path between Begin and End
The Principle of Optimality The optimal answer to a problem is expressed in terms of optimal answer for its subproblems
Dynamic Programming Given: Two sequences X and Y Find: An optimal alignment of X with Y Part 1: Compute first the optimal alignment score Part 2: Construct optimal alignment We are looking for the optimal alignment = maximal score path in the Edit Graph from the Begin vertex to the End vertex
The DP Matrix S(i,j) T G A 0 1 2 3 0 A 1 T 2 S(1,0) S(2,1)
The DP Matrix Matrix S =[S(i,j)] S(i,j) = The score of the maximal cost path from the Begin Vertex and the vertex (i,j) Optimal Path to (i,j) (i-1,j-1) The optimal path to (i,j) must pass through one of the vertices (i,j-1) (i-1,j) (i-1,j) (i,j) (i,j-1) (i-1,j-1)
Opt path (i-1,j-1) (i,j-1) (i-1,j) (i,j) - xi (- , yj) S(i-1,j) + yj - Optimal path to (i-1,j) + (- , yj)
Optimal path (i-1,j-1) (i-1,j) S(i-1,j-1) + (xi , yj) (i,j-1) (i,j) Optimal path to (i-1,j-1) + (xi,yj)
Optimal path (i-1,j-1) (i,j-1) S(i,j-1) + (xi, -) (I-1,j) (i,j) Optimal path to (i,j-1) + (xi,-)
The Basic ALGORITHM S(i-1, j-1) + (xi, yj) MAX S(i-1, j) + (xi, -) S(i,j) = S(i, j-1) + (-, yj)
A T G 0 1 2 3 0 T - G - A - A - A - A - A - A A A G A T A - A 1 G - T - A - - T - T - T - T A T G T T T T G - 2 T - A - OPTIMAL ALIGNMENT and TRACBACK 0 0 0 0 1 1 1 0 0 2 1 1 AGT A - T Optimal Alignment
The Basic ALGORITHM: Local Similarity We add this 0, S(i-1, j-1) + (xi, yj), MAX S(i-1, j) + (xi, -), S(i,j) = S(i, j-1) + (-, yj)
General Scoring Schemes Assumptions 1. Independence of mutations at different sites Additive scoring scheme 2. Gaps of any length are considered one mutation All of the efficient alignment algorithms -- employing on the dynamic programming method --are based fundamentally on the of the fact that the scoring function is additive.
Substitutions Matrices belong to Consider ungapped alignment of equal length sequences Compute the probability that the two sequences are related Compute the probability that the two sequences are not related Compute the ratio of the two probabilities
Random Model R Every letterz occurs independently with probability q z
Match Model M a b Aligned pairs of residues occur with joint probability p ab
s(a,b) = the substitution matrix Log-odds ratio log = log = i where
BLAST (BasicLocalAlignment SearchTool) • A suite of sequence comparison algorithms optimized for speed used to search sequence databases for optimal local alignments to a protein or nucleotide query Altschul, Gish, Miller, Myers, Lipman “Basic Local Alignment Search Tool”, J.Mol.Biol. 215(3):403-10 (1990) Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, NAR25(17):3389-402 (1997) (and references therein)