1 / 85

Dynamic Programming: Edit Distance

Dynamic Programming: Edit Distance. Outline. DNA Sequence Comparison: First Success Stories Longest Paths in Graphs Sequence Alignment Edit Distance Longest Common Subsequence Problem Dot Matrices. DNA Sequence Comparison: First Success Story.

justus
Download Presentation

Dynamic Programming: Edit Distance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Programming:Edit Distance

  2. Outline • DNA Sequence Comparison: First Success Stories • Longest Paths in Graphs • Sequence Alignment • Edit Distance • Longest Common Subsequence Problem • Dot Matrices

  3. DNA Sequence Comparison: First Success Story • Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function • In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene

  4. Cystic fibrosis (CF) is a chronic and frequently fatal genetic disease of the body's mucus glands (abnormally high level of mucus in glands). CF primarily affects the respiratory systems in children. Mucus is a slimy material that coats many epithelial surfaces and is secreted into fluids such as saliva In early 1980s biologists hypothesized that CF is an autosomal recessive disorder caused by mutations in a gene that remained unknown till 1989 Heterozygous carriers are asymptomatic Must be homozygously recessive in this gene in order to be diagnosed with CF Cystic Fibrosis

  5. Finding Similarities between the Cystic Fibrosis Gene and ATP binding proteins • ATP binding proteins are present on cell membrane and act as transport channel • In 1989 biologists found similarity between the cystic fibrosis gene and ATP binding proteins • A plausible function for cystic fibrosis gene, given the fact that CF involves sweet secretion with abnormally high sodium level

  6. Cystic Fibrosis: Mutation Analysis If a high % of cystic fibrosis (CF) patients have a certain mutation in the gene and the normal patients don’t, then that could be an indicator of a mutation that is related to CF A certain mutation was found in 70% of CF patients, convincing evidence that it is a predominant genetic diagnostics marker for CF

  7. Cystic Fibrosis and CFTR Gene :

  8. Cystic Fibrosis and the CFTR Protein • CFTR (Cystic Fibrosis Transmembrane conductance Regulator) protein is acting in the cell membrane of epithelial cells that secrete mucus • These cells line the airways of the nose, lungs, the stomach wall, etc.

  9. Mechanism of Cystic Fibrosis • The CFTR protein(1480 amino acids) regulates a chloride ion channel • Adjusts the “wateriness” of fluids secreted by the cell • Those with cystic fibrosis are missing one single amino acid in their CFTR • Mucus ends up being too thick, affecting many organs

  10. Bring in the Bioinformaticians • Gene similarities between two genes with known and unknown function alert biologists to some possibilities • Computing a similarity score between two genes tells how likely it is that they have similar functions

  11. Strings • Definition: An alphabetSis a finite nonempty set. The elements of an alphabet are calledsymbolsor letters. A stringSover an alphabetSis a (finite) concatenation of symbols from S. Thelengthof a string Sis the number of symbols in S, denoted by|S|. The set of strings of lengthnover Sis denoted bySn. • For DNA-sequencesS = {A,G,C,T}. • Known notions: a concatention, substring, prefix, suffix • Let S be an alphabet and letS = s1 . . . snwithsiS. For alli, j {1, . . . , n},i < j, we denote the substringsi . . . sj byS[i, j].

  12. Sequence formats • Single FASTA format: • The first row of the file after the symbol “>” an identifier for a sequencefollows. • In the following lines the sequence itself follows. Empty lines are not read. • Example: >gi|139617|sp|P09184|VSR_ECOLI PATCH REPAIR PROTEIN MADVHDKATRSKNMRAIATDTAIEKRLASLLTGQGLAFRVQDASLPGRPDFVVDEYRCVI FTHGCFWHHHHCYLFKVPATRTEFWLEKIGKNVERDRRDISRLQELGWRVLIVWECALR GREKLTDEALTERLEEWICGEGASAQIDTQGIHLLA

  13. Sequence formats • Single FASTA format: • A file can consist of several sequences, which are written as a seriesof FASTA entries as described above. • Example: >SEQUENCE 1 ACTCFTGCGCGCGC..... >SEQUENCE 2 ACGCGCGCTCFTGCGCGCGC..... >SEQUENCE 3 GTTGGGACTCFTGCGCGCGC..... >SEQUENCE 4 TGCGCACTCFTGCGCGCGC.....

  14. Dot matrix sequence comparison • An (n×m) matrix relatingtwo sequences of length n and m repectively is produced: by placing a dot at each cell for whichthe corresponding symbols match. Here is an example for the two sequences IMISSMISSISSIPPI andMYMISSISAHIPPIE:

  15. Dot plot • Definition: LetS = s1s2 . . . sn and T = t1...tmbe two strings of length n andm respectively.Let M be an n × m matrix. Then M is a dot plot if for i, j, 1in, 1 jm : M[i, j] = 1 forsi = tj and M[i, j] = 0 else. • Note: The longest common substring within the two strings S and T is then the longest matrixsubdiagonal containing only 1s. However, rather than drawing the letter 1 we draw a dot, and insteadof a 0 we leave the cell blank. Some of the properties of a dot plot are • the visualization is easy to understand • it is easy to find common substrings, they appear as contiguous dots along a diagonal • it is easy to find reversed substrings (see assignment) • it is easy to discover displacements • it is easy to find repeats

  16. Dot matrix • Example: DNA sequences which encode the Bacteriophage lambda and Bacteriophage P22 repressorproteins:

  17. A window size and a stringency • Real dot plots of biological sequences will contain a lot of dots, many of which are considered as noise. • To reduce the noise, a window size w and a stringency s are used and a dot is only drawn at point(x, y) if in the next w positions at least s characters are equal. For the example above: w = 1, s = 1 w = 11, s = 7 w = 23, s = 15

  18. Dot matrix repeat detection • Dot matrix analysis of human LDL receptor against itself (protein sequence): w = 1, s = 1 w =?, s =?

  19. Sequence alignment • procedure of comparing • two (pair-wise alignment) or • more (multiplealignment) • sequences by searching for a series of individual characters or character patterns thatare in the same order in both sequences. • Two sequences are aligned by writing them in two rows. Identical or similar charactersare placed in the same column, whereas non-identical characters are either placed in the same columnas a mismatch or are opposite a gap in the other sequence. Two strings: Alignment: IMISSMISSISSIPPI I-MISSMISSISIPPI-||||||||||||||||| MYMISSISAHIPPIE MYMISS-ISAH-IPPIE

  20. String alignment • Given two strings X and Y . An alignment A of X and Y is obtained by insertingdashes (‘-’) so that both resultingstrings X’ and Y’ of equal length can be written one above the otherin such a way that each character in the one string is opposite a unique character in the other string. • Usually, we require that no two dashes are aligned in this way. • Example: X= Y E - S T E R D A Y Y= - E A S T E R S - - • We need a scoring system

  21. Distance • Definition: A set X of elements x, y, ... X is called a metric space if for each pair x, y  Xthere exists a real number d(x, y) with: • d(x, y) 0, d(x, y)= 0 x = y • d(x, y) = d(y, x) • d(x, y)d(x, z) + d(z, y)zX d(x, y) is called the distance of x and y.

  22. Minkowski metric • Definition: Let x = (x1, ... , xn) and y = (y1, ... , yn) be two elements of an n-dimensionalspace X. Then • is called the Minkowski distance with parameter p. • Note: For p= 1 the distance is also called the Manhattan (or city-block) distance, for p= 2 we havethe well-known Euklidean distance.

  23. Hamming metric • Definition: Let x = (x1, ... , xn) and y = (y1, ... , yn) be two strings of length n over analphabet S. Then dH(X, Y ) = |{i|, i  {1, ..., n}, xiyi}| is called the Hammingdistance. • Example: The Hamming distance of the two sequences X = A T A T A T A T Y = T A T A T A T A is equal to dH(X, Y ) =__.

  24. Levenshtein or edit distance • number ofediting operations needed to transform one string into the other • Definition: The Levenshtein distance or edit distance dL between two strings X and Y is the minimum number of edit operations of type {Replacement,Insertion, orDeletion } that one needs to transform string X into string Y : dL(X, Y ) = min{R(X, Y ) + I(X, Y ) + D(X, Y )}. • the two strings need not be of equal length. • Using M for match, an edit transcript is a string over the alphabet I, D, R, M that describes atransformation of X to Y.

  25. Edit distance • Example: Given two strings X= YESTERDAY Y= EASTERS The edit distance is equal to 5, which can be easily seen from the minimum edit transcript: • Edit transcript= D M I M M M M R D D X= Y E S T E R D A Y Y= E A S T E R S • As we see from this example, edit transcripts and alignments are mathematically equivalent ways ofdescribing a relationship between two strings. • However, an edit transcript implies a set of putative mutational events, whereas an alignment presentsa static picture of the relationship.

  26. Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink

  27. Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink

  28. Manhattan Tourist Problem: Formulation Goal: Find the longest path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink” Output: A longest path in Gfrom “source” to “sink”

  29. MTP: An Example 0 1 2 3 4 j coordinate source 3 2 4 0 3 5 9 0 0 1 0 4 3 2 2 3 2 4 13 1 1 6 5 4 2 0 7 3 4 15 19 2 i coordinate 4 5 2 4 1 0 2 3 3 3 20 3 8 5 6 5 2 sink 1 3 2 23 4

  30. MTP: Greedy Algorithm Is Not Optimal 1 2 5 source 3 10 5 5 2 5 1 3 5 3 1 4 2 3 promising start, but leads to bad choices! 5 0 2 0 22 0 0 0 sink 18

  31. MTP: Simple Recursive Program MT(n,m) ifn=0 or m=0 returnMT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) returnmax{x,y}

  32. j 0 1 source 1 0 1 S0,1= 1 i 5 1 5 S1,0= 5 MTP: Dynamic Programming • Calculate optimal path score for each vertex in the graph • Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between

  33. MTP: Dynamic Programming (cont’d) j 0 1 2 source 1 2 0 1 3 S0,2 = 3 i 5 3 -5 1 5 4 S1,1= 4 3 2 8 S2,0 = 8

  34. MTP: Dynamic Programming (cont’d) j 0 1 2 3 source 1 2 5 0 1 3 8 S3,0 = 8 i 5 3 10 -5 1 1 5 4 13 S1,2 = 13 5 3 -5 2 8 9 S2,1 = 9 0 3 8 S3,0 = 8

  35. MTP: Dynamic Programming (cont’d) j 0 1 2 3 source 1 2 5 0 1 3 8 i 5 3 10 -5 -5 1 -5 1 5 4 13 8 S1,3 = 8 5 3 -3 3 -5 2 8 9 12 S2,2 = 12 0 0 0 3 8 9 S3,1 = 9 greedy alg. fails!

  36. MTP: Dynamic Programming (cont’d) j 0 1 2 3 source 1 2 5 0 1 3 8 i 5 3 10 -5 -5 1 -5 1 5 4 13 8 5 3 -3 2 3 3 -5 2 8 9 12 15 S2,3 = 15 0 0 -5 0 0 3 8 9 9 S3,2 = 9

  37. MTP: Dynamic Programming (cont’d) j 0 1 2 3 source 1 2 5 0 1 3 8 Done! i 5 3 10 -5 -5 1 -5 1 5 4 13 8 (showing all back-traces) 5 3 -3 2 3 3 -5 2 8 9 12 15 0 0 -5 1 0 0 0 3 8 9 9 16 S3,3 = 16

  38. si-1, j + weight of the edge between (i-1, j) and (i, j) si, j-1 + weight of the edge between (i, j-1) and (i, j) max si, j = MTP: Recurrence Computing the score for a point (i,j) by the recurrence relation: The running time is n x m for a n by mgrid (n = # of rows, m = # of columns)

  39. A2 A3 A1 B sA1 + weight of the edge (A1, B) sA2 + weight of the edge (A2, B) sA3 + weight of the edge (A3, B) max of sB = Manhattan Is Not A Perfect Grid What about diagonals? • The score at point B is given by:

  40. max of sy + weight of vertex (y, x) where y є Predecessors(x) sx = Manhattan Is Not A Perfect Grid (cont’d) Computing the score for point x is given by the recurrence relation: • Predecessors (x) – set of vertices that have edges leading to x • The running time for a graph G(V, E) (V is the set of all vertices and Eis the set of all edges) is O(E) since each edge is evaluated once

  41. Traveling in the Grid • The only hitch is that one must decide on the order in which visit the vertices. • By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. • We need to traverse the vertices in some order. • Try to find such order for a directed cycle. ???

  42. DAG: Directed Acyclic Graph • Since Manhattan is not a perfect regular grid, we represent it as a DAG • DAG for Dressing in the morning problem

  43. Topological Ordering • A numbering of vertices of the graph is called topological ordering of the DAG if every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label • In other words, if vertices are positioned on a line in an increasing order of labels then all edges go from left to right.

  44. Topological Ordering • 2 different topological orderings of the DAG

  45. Longest Path in DAG Problem • Goal: Find a longest path between two vertices in a weighted DAG • Input: A weighted DAG G with source and sink vertices • Output: A longest path in G from source to sink

  46. max of sv= Longest Path in DAG: Dynamic Programming • Suppose vertex v has indegree 3 and predecessors {u1, u2, u3} • Longest path to v from source is: In General: sv = maxu (su + weight of edge from uto v) su1 + weight of edge fromu1to v su2 + weight of edge fromu2to v su3+ weight of edge fromu3to v

  47. Traversing the Manhattan Grid a) b) • 3 different strategies: • a) Column by column • b) Row by row • c) Along diagonals c)

  48. T A G T C C A T T G A A T Alignment: 2 row representation Given 2 DNA sequences v and w: v : m = 7 w : n = 6 Alignment : 2 * k matrix ( k > m, n ) letters of v A T -- G T T A T -- letters of w A T C G T -- A -- C 4 matches 2 insertions 2 deletions

  49. Aligning DNA Sequences n = 8 V = ATCTGATG matches mismatches insertions deletions 4 m = 7 1 W = TGCATAC 2 match 2 mismatch V W deletion indels insertion

  50. Longest Common Subsequence (LCS) – Alignment without Mismatches • Given two sequences • v = v1v2…vm and w = w1w2…wn • The LCS of vand w is a sequence of positions in • v: 1 < i1 < i2 < … < it< m • and a sequence of positions in • w: 1 < j1 < j2 < … < jt< n • such that it -th letter of v equals to jt-letter of wand t is maximal

More Related