1 / 33

Sequence Comparison

I519 Introduction to Bioinformatics, Fall 2012. Sequence Comparison. Why we compare sequences. Find sequential similarity between protein/DNA sequences To infer functional similarity To infer evolutionary history Find important residues that are important for a protein ’ s function

myra
Download Presentation

Sequence Comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. I519 Introduction to Bioinformatics, Fall 2012 Sequence Comparison

  2. Why we compare sequences • Find sequential similarity between protein/DNA sequences • To infer functional similarity • To infer evolutionary history • Find important residues that are important for a protein’s function • Functional sites of a protein • DNA elements (e.g., transcription factor binding sites)

  3. Comparison of sequences at various levels • We may look at sequences differently • Whole genome comparison (will be covered later) • Whole DNA/protein sequence • Protein domains • Motifs (protein motifs & motifs at DNA level) • For protein-coding genes, comparison at amino acid level instead of nucleotides to achieve higher sensitivity & specificity (20 letters versus 4 letters)

  4. B E . . . Protein domains Domain: structurally/functionally/evolutionally independent units Domain combination: two domains appearing in a protein A C A B D C B C A: all-β regulatory domain B: α/β-substrate binding domain C: α/β-nucleotide binding domain

  5. PROSITE patterns • Described by regular grammars • Programs that allow to search databases for PROSITE patterns (e.g., ScanProsite) • We have seen the ATP-binding motif, [AG]-x{4}-G-K-[ST]); another example [EDQH]-x-K-x-[DN]-G-x-R-[GACV] • Rules: • Each position is separated by a hyphen • One character denotes residuum at a given position • […] denoted a set of allowed residues • (n) denotes repeat of n • (n,m) denoted repeat between n and m inclusive • Ex. ATP/GTP binding motive [SG]=X(4)-G-K-[DT]

  6. Three principle methods of pairwise sequence alignment • Dot matrix analysis • The dynamic programming algorithm • Word or k-tuple methods, such as used by FASTA and BLAST.

  7. The dot matrix method

  8. T A G T C C A T T G A A T Pairwise alignment (of biological molecules) Given 2 DNA sequences v and w: v : m = 7 w : n = 6 matches mismatches insertions deletions 4 1 2 match 2 mismatch V W deletion indels insertion

  9. A simple string comparison solution: Hamming distance • The most often used distance on strings in computer science: the number of positions at which the corresponding symbols are different • Hamming distance always compares i-th letter of v with i-th letter of w V = ATATATAT W= TATATATA Hamming distance:d(v, w)=8 Computing Hamming distance is a trivial task

  10. Hamming distance is easy to compute, but… • This makes some sense on comparing DNA sequences in some cases. But there are other mutations • Substitution ACAGT  ACGGT • Insertion/deletion (indel) ACAGT  ACGT • Inversion ACA……GT  AG……ACT • Translocation AC……AG…TAA  AG…TC……AAA • Duplication • We only consider the first two mutations for now. • There are algorithms for the other mutations…

  11. Comparing two strings: Edit distance • Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN number of elementary operations to transform v w

  12. Edit distance & Hamming distance Edit distance may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT Just one shift Make it all line up W= TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)=8d(v, w)=2 Computing Hamming distance Computing edit distance is a trivial task is a non-trivial task How to find what j goes with what i ???

  13. Edit Distance: Example TGCATAT  ATCCGAT in 5 steps TGCATAT  (delete last T) TGCATA  (delete last A) TGCAT  (insert A at front) ATGCAT  (substitute C for 3rdG) ATCCAT  (insert G before last A) ATCCGAT (Done) Edit distance = 5 But how? Dynamic programming

  14. Giving a population history • If we watched every generation, we could annotate the tree with exactly where mutations have happened. Generation 0: AGGATTA x Generation 129:AGGATA x Generation 172:AGGCCCATTA Gen. 245:AGATA x Gen. 295:GGCCCATTA x x Gen 280: CGATA Current: CGATA Current:GGCCCATTA This would give a history to the current sequences.

  15. Edit distance v.s. ancestral reconstruction • Edit distance simpler than ancestral reconstruction • Orders of the edit operations do not matter. • If two events overlap or even cancel each other in the evolution, they cannot be seen at edit distance. • It is a distance metric. • Identity: d(x,y)=0 iff x=y • Symmetry: d(x,y) = d(y,x) • Triangular Inequality: d(x,z) <= d(x,y) + d(y,z)

  16. Alignment • Hard to visually show the edit distance: • E.g. CT@4, insert C@6, delete@9 • Alignment is much nicer: • ATGCA-TTTA||| | || |ATGTACTT-A match =0, mismatch = -1, indel = -1. Score = the total score of each position of the alignment. • Then computing edit distance is equivalent to finding the optimal (maximum scoring) alignment.

  17. “Optimal” alignment • The word “optimal” alignment is somewhat misleading. Ideally we want to find the “real” alignment of the sequences according to the real evolution instead. • Here we try tofind the “optimal” alignment. • “optimal” solution is not necessary the correct solution. It all depends on how good the score function is. • The identity scoring scheme is not a very accurate one. • For example, transitions and transversions have the same score. • Along this alignment topic, we will refine the score functions.

  18. Scoring sequence alignment • How to score an alignment? • Simplest scoring scheme: • 0 = match • -1 = mismatch • -1 = indel • This is called “linear gap penalty” because the cost of a gap (consecutive indels) is proportional to its length. (We could have each gap position cost g, for some negative constant g.) • Let’s see some examples

  19. Given alignment, it is trivial to compute alignment score AATGCGA-TTTT || | |||G-TG--ACTTTC 6 matches: 0 2 mismatches: -2 4 indels: -4 edit distance (alignment score) = -6 AATG-CGATTTT || | || G-TGAC-TTTC- 5 matches: 0 3 mismatches: -3 4 indels: -4 edit distance (alignment score) = -7

  20. Alignment with DP • The question is how alignment can be computed with a computer? • Dynamic Programming • Requires the subsolution of an optimal solution is also optimal.

  21. A T C G T A C w 1 2 3 4 5 6 7 0 0 v 1 A 2 T 3 G 4 T 5 T A 6 7 T Alignment as a path in the edit graph Every path in the edit graph corresponds to an alignment: AT-GTTA-T ATCG-TAC-

  22. Recursive definition

  23. Dynamic programming algorithm S[0,0] = 0 S[i,0] = S[i-1,0] + g S[0,j] = S[0,j-1] + g for i from 1 to M for j from 1 to N S[i,j] = max{S[i-1,j-1]+s(x[i],y[j]), S[i-1,j]+g,S[i,j-1]+g} Output S[M,N]

  24. Fill up the dynamic programming matrix Scoring function: missmatch = -1 indel = -1 seq[1]=PELICAN seq[2]=CWELACANTH DP Matrix # C W E L A C A N T H # 0 -1 -2 -3 -4 -5 -6 -7 -8 -9-10 P -1 -1 -2 -3 -4 -5 -6 -7 -8 -9-10 E -2 -2 -2 -2 -3 -4 -5 -6 -7 -8 -9 L -3 -3 -3 -3 -2 -3 -4 -5 -6 -7 -8 I -4 -4 -4 -4 -3 -3 -4 -5 -6 -7 -8 C -5 -4 -5 -5 -4 -4 -3 -4 -5 -6 -7 A -6 -5 -5 -6 -5 -4 -4 -3 -4 -5 -6 N -7 -6 -6 -6 -6 -5 -5 -4 -3 -4 -5 A bottom-up calculation to get the optimal score (only!)

  25. Traceback to get the actual alignment # C W E L A C A N T H # 0 -1 -2 -3 -4 -5 -6 -7 -8 -9-10 P -1 -1 -2 -3 -4 -5 -6 -7 -8 -9-10 E -2 -2 -2 -2 -3 -4 -5 -6 -7 -8 -9 L -3 -3 -3 -3 -2 -3 -4 -5 -6 -7 -8 I -4 -4 -4 -4 -3 -3 -4 -5 -6 -7 -8 C -5 -4 -5 -5 -4 -4 -3 -4 -5 -6 -7 A -6 -5 -5 -6 -5 -4 -4 -3 -4 -5 -6 N -7 -6 -6 -6 -6 -5 -5 -4 -3 -4 -5 CWELACANTH -PELICAN-- No need to physically record the green arrows Instead, we will trace back: following the red arrows!

  26. More formal backtracking Idea: We go from upper left to lower right. Backtrack the optimal path! Start in lower right: let i = m, j = n Until i = 0, j = 0: Figure out which of the three terms gave rise to M[i,j] by picking the largest. M[i-1,j]+indel, M[i,j-1]+indel, M[i-1,j-1]+f(s[i],t[j]) Move to the right place (reduce i, reduce j, or reduce both), and write down the configuration of the current column.

  27. How similar biology and informatics are

  28. Space, time requirements • The algorithm runs in O(nm) time: Each step requires only 3 checks to other points in the matrix. • We also need O(nm) space, to store the matrix. • If we only want to know the score of the optimal alignment, we can do that in O(min(m,n)) space. • Reconstructing the alignment also requires only O(m+n) space.

  29. Alignments are scored • Need to score alignments. • The alignment that has highest score may not be the one that actually matches evolutionary history. • So you should never trust that an alignment must be right. It just optimizes the score. • When we move to multiple alignments, things get worse: no guarantee of the optimal score, even.

  30. A related problem: Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink Goal: Finding a longest path in Gfrom “source” to “sink”

  31. Longest Path in DAG Problem • Goal: Find a longest path between two vertices in a weighted DAG • Input: A weighted DAG G with source and sink vertices • Output: A longest path in G from source to sink

  32. “Edit distance problem” Runtime • It takes O(nm) time to fill in the dynamic programming matrix. • Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up the dynamic programming matrix.

  33. Reading: • Chapter 4 (Producing and Analyzing Sequence Alignments) • Next time we will talk about global and local pairwise sequence alignment, focus on • How alignment of biological sequences is different from comparison of two strings (scoring matrix + indel penalties) • Global versus local

More Related