270 likes | 354 Views
Computational Genomics and Proteomics. Pairwise a lignment. Genetic code. Searching for similarities. What is the function of the new gene? The “lazy” investigation: Find the set of similar proteins Identify similarities and differences For long proteins: identify domains.
E N D
Computational Genomics and Proteomics Pairwise alignment
Searching for similarities What is the function of the new gene? The “lazy” investigation: Find the set of similar proteins Identify similarities and differences For long proteins: identifydomains
Is similarity really interesting? Common ancestry is more interesting Makes it more likely that genes share the same function Homology: sharing a common ancestor a binary property (yes/no) It’s a nice tool:When (a known gene) G is homologous to (an unknown) X it means that we gain a lot of information on X Z X Y
How to determine similarity Frequent evolutionary events: Substitution Insertion, deletion Duplication Inversion Evolution at work Common ancestor, usually extinct Z X Y available We’ll use only these
Alignment - the problem Operations: substitution, Insertion deletion GTACT--C GGA-TGAC Algorithmsforgenomes ||||||| algorithms4genomes algorithmsforgenomes ||||||||||algorithms4genomes algorithmsforgenomes ||||||||||? |||||||algorithms4--genomes
Many ways to align 2) 1) 3) GTA--- GTA GT-A ---GCA GCA G-CA • Which alignment is better? • Reflecting evolution ~= • Most probable ~= • Simplest
Scoring Should give reasonable alignments And have to assign scores to: Substitution (or match/mismatch) Gap penalty Linear: g(k)=ak Affine: g(k)=b+ak Concave, e.g.: g(k)=log(k) The score for an alignment is the sum of scores of all columns GTACT--C GGA-TGAC GTA-- G-T-A --ATG -ATG-
Substitution matrices Define a score for match/mismatch of letters DNA - Simple: - Used in genome alignments
Substitution matrices for aa Amino acids are not equal: Some are easily substituted, similar: biochemical properties structure Some mutations occur more often due to similar codons The two above give us substitution matrices http://www.people.virginia.edu/~rjh9u/aminacid.html orange: nonpolar and hydrophobic. green: polar and hydrophilic magenta box are acidic light blue box are basic http://www.cimr.cam.ac.uk/links/codon.htm
BLOSUM 62 substitution matrix http://www.carverlab.org/testing/epp.html
Constant vs. affine gap scoring Gap Scoring intro extension Constant -2 -2 Affine -3 -1 …and +1 for match Seq1 G T A - - G - T - ASeq2 - - A T G - A T G - Const -2 –2 1 –2 –2 (SUM = -7) -2 –2 1 -2 –2 (SUM = -7) Affine –4 1 -4 (SUM = -7) -3 –3 1 -3 –3 (SUM = -11)
Global alignment.The Needleman-Wunsch algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion Dynamic programming Solve smaller subproblem(s) Iteratively extend the solution The best alignment for X[1…i] and Y[1…j]is called M[i, j] X1 … Xi-1 -- Xi Y1 … - Yj-1 Yj -
The algorithm • Goal: find the maximal scoring alignment • Scores: m match, -s mismatch, -g for insertion/deletion • The best alignment for X[1…i]and Y[1…j]is called M[i, j] • 3 ways to extend the alignment X[1…i-1]X[i] X[1…i] - X[1…i-1] X[i] Y[1…j-1]Y[j] Y[1…j-1]Y[j] Y[1…j] - + m M[i,j] = M[i-1, j-1] - s M[i, j-1] - g M[i-1, j] - g
The algorithm – final equation Corresponds to: * X1…Xi-1 XiY1…Yj-1 Yj M[i-1, j-1] + score(X[i],Y[j]) M[i, j] = max M[i, j-1] – g X1…Xi -Y1…Yj-1 Yj M[i-1, j] – g X1…Xi-1 XiY1…Yj - j-1 j i-1 * -g i -g
Example: global alignment of two sequences Align two DNA sequences: GAGTGA GAGGCGA (note the length difference) Parameters of the algorithm: Match:score(A,A) = 1 Mismatch: score(A,T) = -1 Gap: g = -2 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2
The algorithm. Step 1: init Create the matrix Initiation 0 at [0,0] Apply the equation… j 0 1 2 3 4 5 6 i - G A G T G A 0 - 0 1 G 2 A 3 G 4 G 5 C 6 G 7 A M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2
The algorithm. Step 1: init Initiation of the matrix: 0 at pos [0,0] Fill in the first row using the “” rule Fill in the first column using “” - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 A -4 G -6 G -8 C -10 G -12 A -14 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i
The algorithm. Step 2: fill in Continue filling in of the matrix, remembering from which cell the result comes (arrows) - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 A -4 -1 2 G -6 G -8 C -10 G -12 A -14 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i
The algorithm. Step 2: fill in We are done… Where’s the result? - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i
The algorithm. Step 3:traceback Start at the last cell of the matrix Go against the direction of arrows Sometimes the value may be obtained from more than one cell (which one?) - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 M[i-1, j-1] ± 1 M[i, j] = max M[i, j-1] – 2 M[i-1, j] – 2 j i
The algorithm. Step 3: traceback Extract the alignments - G A G T G A - 0 -2 -4 -6 -8 -10 -12 G -2 1 -1 -3 -5 -7 -9 A -4 -1 2 0 -2 -4 -6 G -6 -3 0 3 1 -1 -3 G -8 -5 -2 1 2 2 0 C -10 -7 -4 -1 0 1 1 G -12 -9 -6 -3 -2 1 0 A -14 -11 -8 -5 -4 -1 2 j a a) b GAGT-GA GAGGCGA b) i GA-GTGA GAGGCGA
Affine gaps X1…Xi -Y1…Yj-1 Yj • Penalties: • go - gap opening (e.g. -8) • ge - gap extension (e.g. -2) M[i-1, j-1] X1…XiY1…Yj M[i, j] = max score(X[i], Y[j]) + Ix[i-1, j-1] Iy[i-1, j-1] M[i, j-1] + go X1…-Y1… Yj Ix[i, j] = max Ix[i, j-1] + ge X1… - -Y1…Yj-1 Yj M[i-1, j] + go Iy[i, j] = max Iy[i-1, j] + gx • @ home: think of boundary values M[*,0], I[*,0] etc.
Variation on global alignment Globalalignment: previous algorithms is called global alignment, because it uses all letters from both sequences.CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-globalalignment: don’t penalize for start/end gaps (omit the start/end gaps of sequences).CAGCA-CTTGGATTCTCGG ---CAGCGTGG-------- Applications of semi-global: Finding a gene in genome Placing marker onto a chromosome One sequence much longer than the other Danger! – really bad alignments for divergent seqs seq X: seq Y:
Semi-global alignment Ignore start gaps First row/column set to 0 Ignore end gaps Read the result from last row/column - G A G T G - 0 0 0 0 0 0 A 0 -1 1 -1 -1 -1 G 0 1 -2 2 0 -2 T 0 -1 0 0 3 1
a heuristics Heuristics: A rule of thumb that often helps in solving a certain class of problems, but makes no guarantees.Perkins, DN (1981) The Mind's Best Work