1 / 37

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures. Chapter 11.8: Gaps Lecturer: Dr. Rose Slides by: Dr. Rose February 8, 2007. Gaps. Our investigation of alignment has focused on: Matches Mismatches Spaces An important concept is that of gaps .

lotta
Download Presentation

Bioinformatics Algorithms and Data Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Algorithms and Data Structures Chapter 11.8: Gaps Lecturer: Dr. Rose Slides by: Dr. Rose February 8, 2007

  2. Gaps • Our investigation of alignment has focused on: • Matches • Mismatches • Spaces • An important concept is that of gaps. • Defn. A gap is a maximal consecutive run of spaces in a single string of a given alignment. • Q: Can a single space be a gap? • A: Yes, if there are no adjacent spaces.

  3. Gaps • Gaps can occur: • Before the first character of a string • After the last character of a string • Inside a string • Example: c t g c g g g - - - g g t a a a t - - g c g g - a g a g g - a a a - • Q: How many gaps are there? • A: 5

  4. Gaps • Q: Other than our recognition of gaps, did the preceding example show anything new? • A: No. • Q: Then what motivates the introduction of this concept? • A: We can include a gap term in the objective function for computing alignment. • So??? • So we can influence the distribution of gaps.

  5. Gaps • Analogy: specifying the location of the hole is critical to donut making, otherwise you’ll end of with a berliner.  • Example: • In this objective fcn, each gap contributes the constant weight Wg irrespective of the gap length. • The variable k, indicates the number of gaps in the alignment.

  6. Gaps • Recall, a space in an alignment corresponds to an insertion or deletion in the edit transcript. • A gap corresponds to an atomic insertion or deletion of an entire substring. • Biologically, mutations are such atomic events. • A single mutation can create a gap • The size of the gap can vary over a large range with equal likelihood.

  7. Gaps Sources of mutation mentioned by textbook: • unequal cross-over in meiosis  insertion in one string and corresponding deletion in the other. • http://www4.ncsu.edu:8030/unity/users/b/bnchorle/www/ • DNA slippage slippage in replication procedure resulting in the repetition of a substring • Retrovirus insertions • Translocations of DNA between chromosomes

  8. Gaps • Common gaps in aligned strings can be used to deduce evolutionary history • Mutations at the single character level are frequent. • Does anybody know what these are called?  makes it difficult to determine evolutionary relationship at the DNA sequence level. • Large gaps occur less frequently.  gap features can be used to recognize similarity over long periods of time. • See Figure 11.6 for an example of gap as alignment feature

  9. Gaps • Consider: • An alignment should reflect the cost of mutational events transforming one string into the other. • A single mutation can produce a gap of more than one space • Consequently: • Distribution of spaces into gaps should follow a plausible model • Gap weights should be modeled to reflect biological meaning

  10. Motivation: cDNA Matching • Preliminaries: • A single gene is comprised of exons and introns • Exons are the coding part of the gene • Introns are the noncoding parts between exons • Gene expression: • RNA is transcribed from DNA • DNA:A RNA:U (uracil) • DNA:C RNA:G • DNA:G RNA:C • DNA:T RNA:A

  11. Motivation: cDNA Matching • Gene expression continued: • RNA is transformed into mRNA (messenger RNA) • The introns are excised • The remaining exons are concatenated • The resulting mRNA leaves the cell nucleus • A ribosome: • Translates the mRNA into the corresponding protein by • parsing the mRNA into codons • assembling amino acids in the order specified by the codons. • The resulting sequence of amino acids is the protein

  12. Motivation: cDNA Matching • Imagine that you have the mRNA for a protein and want to find the corresponding gene. • (Wet biology) Take the mRNA and create complimentary DNA (cDNA). • Map mRNA:U  cDNA:A • Note: cDNA differs from DNA is several respects • cDNA does not contain intron substrings • The nucleotides in cDNA compliment the nucleotides in the corresponding DNA, i.e., AT and C  G

  13. Motivation: cDNA Matching • (Wet biology) Hybridize the cDNA with the DNA • In hybridization: complementary nucleotides try to match up, i.e., AT and C  G • Sections of the cDNA will hybridize with the corresponding sections of DNA. • The non-hybridizing segments are gaps • Possibly corresponding to introns

  14. Motivation: cDNA Matching • Now imagine that you have the mRNA sequence for a protein and want to find the corresponding gene without doing wet biology. • Take the mRNA and create complimentary DNA (cDNA). • Map mRNA:U  cDNA:A with a computer • While we are at it, compile of library of each cDNA string that we create for future use.

  15. Motivation: cDNA Matching • Align (hybridize) the cDNA with the DNA • We assume that the relevant genome has been sequenced. • We have a short string (cDNA) and a very long string, the genome. • Align complementary nucleotides in the two strings, i.e., AT and C  G • Sections of the cDNA will align (hybridize) with the corresponding sections of genome. • The non-alinging (non-hybridizing) segments are gaps • Possibly corresponding to introns

  16. Motivation: cDNA Matching • Q: What kind of objective fcn do we need to align cDNA with DNA? • Features: • Small penalties for spaces • Q: Why does this matter? • A: large penalties would force the cDNA to bunch up, not alowing gaps for introns

  17. Motivation: cDNA Matching • Features: continued • Large penalties for mismatches • Some mismatches are unavoidable (sequencing error) • Long sequences of mismatches must be avoided • Positive values for matches • We want to reward exon matches • Gap penalties

  18. Motivation: cDNA Matching • Gap penalities • Q: Assume: match +, mismatch --, space -, what happens if there is no gap penalty? • A: the alignment would be the longest common subsequence.  Match of ALL characters in the cDNA string • Match of cDNA with noncoding DNA  • Tells us nothing about the position of the exons

  19. Motivation: cDNA Matching • Gap penalties continued • Soln: augment objective fcn with a gap term • Complication: pseudogenes

  20. Motivation: Pseudogenes • Pseudogenes • Nonworking inexact copy of a gene • Conceptually: • a trial gene not ready for prime time or • a failed gene mutation • The psuedogene may be very far from the actual gene

  21. Motivation: Pseudogenes • Pseudogenes: processed psuedogenes • contains only exon substrings • introns have been removed & exons concatenated • Theory: mRNA that is re-transcribed back into DNA and inserted into a random position. • Problem: • Assume the DNA might contain the pseudogene & the working gene • how can processed psuedogenes be located?

  22. Gap Weights • Q: What types of gap weight can we choose from? • A: The textbook lists four general types: • Constant gap weight • Affine gap weight • Convex gap weight • Arbitrary gap weight

  23. Gap Weights • Constant gap weight: simplest • No cost for individual space • Gaps are assigned a constant weight Wg • Operator-weight objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) Where Wm= match weight & Wms= mismatch weight • Alphabet-weight objective fcn: Sli=1[s(S´1(i), S´2(i))] - Wg(#gaps) Here s(x,_) = s(_,x)=0 for every letter x in the alphabet.

  24. Gap Weights • Affine gap weight • Extend the constant gap weight with a charge for each space, Ws. • Wg is the gap initiation charge • Ws is the gap extension charge • Gap weight is given by the affine function Wg+ qWs, where q is the number of spaces in the gap. • Operator-weight objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) - Ws(#spaces)

  25. Gap Weights • Affine alphabet-weight objective fcn: Sli=1[s(S´1(i), S´2(i))] - Wg(#gaps) - Ws(#spaces) Here s(x,_) = s(_,x)=0 for every letter x in the alphabet. • An important question is what the values of Wg and Ws should be. • Obviously, this is related to similarity matrix, s(). • Textbook says FASTA uses Wg=10 - Ws= 2 for protein sequences

  26. Gap Weights • convex gap weight • Idea: additional spaces contribute less • Example: Wg + logeq • Longer gaps are somewhat penalized

  27. Arbitrary Gap Weight • Arbitrary gap weight • The gap weight is an arbitrary function, w(q), of the gap length. • Obviously, the preceding weight types are subcases of the arbitrary gap weight model

  28. Arbitrary Gap Weight Arbitrary gap weight recurrence: Three types of alignments for S1[1..i] and S2[1..j] • S1(i) aligns to the left of S2(j),  S1 ends with a gap. • Let E(i, j) be the maximal value for alignment case 1. • S1(i) aligns to the right of S2(j),  S2 ends with a gap • Let F(i, j) be the maximal value for alignment case 2. • S1(i) coaligns with S2(j). • Let G(i, j) be the maximal value for alignment case 3. • Let V(i, j) be the maximal value of E(i, j), F(i, j), & G(i, j).

  29. Arbitrary Gap Weight We have the following recurrences: • V(i, j) = max[E(i, j), F(i, j), G(i, j)], • G(i, j) = V(i - 1, j - 1) + s(S1(i), S2(j)), Where S1(i), S2(j) are co-aligned. • E(i, j) = max0k  j-1[V(i, k) – w(j – k)],  S1 ends with a gap. • F(i, j) = max0l  i-1[V(l, j) – w(i – l)]  S2 ends with a gap.

  30. Arbitrary Gap Weight The base conditions are: V(i, 0) = -w(i), V(0, j) = -w(j), E(i, 0) = -w(i), F(0, j) = -w(j), G(0, 0) = 0, but G(i, j) is undefined if only i or j is 0. If end spaces are free then end gaps are free and: V(i, 0) = 0, V(0, j) = 0

  31. Arbitrary Gap Weight Up until this point all dynamic programming examples have had complexity O(nm). Q: What is the complexity of V(i, j)? A: O(nm2 + n2m)? Q: Why does the consideration of gaps require O(nm2 + n2m)? A: Previous computations depended only on the 3 adjacent cells. Considering gaps entails considering all preceding cells in the row and column.

  32. Arbitrary Gap Weight Thm. If |S1| = n and |S2| = m, the recurrences can be solved in O(nm2 + n2m) Proof. (n+1) * (m+1) cells in the table are filled. To fill cell (i, j): • E(i, j) examines j cells of row i, max0k  j-1[V(i, k) – w(j – k)], A row entails m(m+1)/2 = O(m2) to evaluate E for that row. • F(i, j) examines i cells of column j, max0l  i-1[V(l, j) – w(i – l)] A column entails n(n+1)/2 = O(n2) to evaluate F for that column. • G(i, j) examines one other cell. Since there are n rows and m columns give O(nm2 + n2m)

  33. Affine Gap Weight • O(nm2 + n2m) is expensive. • The affine weight gap model supports O(nm) computation. • Recall, we want to maximize the operator objective fcn: Wm(#matches) – Wms(#mismatches) – Wg(#gaps) - Ws(#spaces) As before, three types of alignments: • S1(i) aligns to the left of S2(j),  S1 ends with a gap. • S1(i) aligns to the right of S2(j),  S2 ends with a gap • S1(i) coaligns with S2(j). We will use E(i, j), F(i, j), G(i, j) & V(i, j), but we will modify the gap weight

  34. Affine Gap Weight Q: How can the cost be reduced fromO(nm2 + n2m) to O(nm)? A: The affine model sets a constant cost per space. Q: How does this help? A: It is not necessary to do row (O(m2)) & column (O(n2)) searches  It doesn’t matter where the gaps occur, only how large they are.

  35. Affine Gap Weight The base conditions where end gaps are included are: V(i, 0) = E(i, 0) = - Wg- iWs, V(0, j) = F(0, j) = - Wg- jWs, If end spaces are free then end gaps are free and: V(i, 0) = V(0, j) = 0

  36. Affine Gap Weight We have the following recurrences: V(i, j) = max[E(i, j), F(i, j), G(i, j)], G(i, j) = V(i - 1, j - 1) + Wm, if S1(i)=S2(j) G(i, j) = V(i - 1, j - 1) - Wms, if S1(i)S2(j) E(i, j) = max[E(i, j - 1), V(i, j - 1) – Wg] - Ws  S1 ends with a gap. F(i, j) = max[F(i - 1, j), V(i - 1, j) – Wg ] - Ws • S2 ends with a gap. Notice that each recurrence entails examining recurrences for a constant number of cells.

  37. Affine Gap Weight The textbook explains E(i, j) in detail. Let’s consider F(i, j) = max[F(i - 1, j), V(i - 1, j) – Wg ] - Ws • F(i, j) is the case where S2 ends with a gap. The recurrence considers two cases: • S2(j) is exactly one place to the left of S1(i) There is a gap aligned with S1(i), then F(i, j) = V(i - 1, j) – Wg - Ws • S2(j) is to the left of S1(i - 1) The same gap aligned with S1(i - 1), extends to S1(i), then F(i, j) = F(i - 1, j) - Ws

More Related