1 / 39

Estimating seed sensitivity on homogeneous alignments

Estimating seed sensitivity on homogeneous alignments. BIBE 2004 Taichung - May 20th, 2004 Gregory Kucherov 1 , Laurent Noé 1 , Yann Ponty 2 1 LORIA, Nancy 2 LRI, Paris, France. Detected seeds. Detected alignment. Seed paradigm ( FASTA, BLAST, PatternHunter, YASS, … ).

ipo
Download Presentation

Estimating seed sensitivity on homogeneous alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Estimating seed sensitivity on homogeneous alignments BIBE2004 Taichung - May 20th, 2004 Gregory Kucherov1, Laurent Noé1, Yann Ponty2 1LORIA, Nancy 2LRI, Paris, France

  2. Detected seeds Detected alignment Seed paradigm (FASTA, BLAST,PatternHunter, YASS, … ) • Start with small conserved and easily detected fragments (seeds). • Then extend the seeds and build possible alignments Dot plot ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgcat aggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt

  3. Spaced Seed Model [Ma & al. 02] • Seed Pattern : ###--#-## ‘#’ : obligatory match position ‘-’ : joker position (“don’t care” position) Weight : 6 [number of #] Span : 9 [number of all symbols] • Example : ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA

  4. How to describe Selectivity and Sensitivity • Selectivity seed weight: number of random occurrences ~ 4-weight . • Sensitivity probability for the seed to detect an interesting similarity. To be specified: • What set of similarities do we want to detect? • What is the probability of each similarity?

  5. What is a good seed?Sensitivity/Selectivity balance • Seed of relativelylargeweight : • Few random seed matches (high selectivity) • Possible loss of similarities (low sensitivity) • Seed of relativelysmallweight : • Detect almost all possible similarities (high sensitivity) • Many random seed matches (low selectivity)

  6. binary word Similarity: notation • Ungapped similarities only (no indels) CTACGATGAGCTGCT |||:||:||||:||| CTATGACGAGCGGCT • All matches are equiprobable, all mismatches are equiprobable (simplification)

  7. { , }n Similarities to be detected [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma et al 2001, 2003, …] • The set: • all strings in (given n) • The probabilities: • Bernoulli model; • Markov models;

  8. Similarities to be detected [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma et al 2001, 2003, …] Advantage: • natural probability model; • DP algorithms to compute sensitivity; Disadvantage: • Uninteresting similarities are included in the set.

  9. Similarities to be detected and their probabilities: our approach Only “true” similarities to be considered Scoring Scheme Score = 12r – 3p +r+r+r-p+r+r-p+r+r+r+r-p+r+r+r CTACGATGAGCTGCT |||:||:||||:||| CTATGACGAGCGGCT

  10. Similarities to be detected and their probabilities: our approach Homogeneous similarities : • do not contain sub-alignment of higher score : • (cf. Maximum Scoring Pairs) • all prefixes and suffixes of the similarity have non-negative score.

  11. Homogeneous alignment Homogeneous similarities (Prefix) Score Alignment homogeneous similarities occur entirely inside shaded area:

  12. Negative suffix Prefix of higher score Suffix of higher score Negative prefix Non homogeneous similarities Score Alignment Score Alignment

  13. Our Model • The set: Homogeneous similarities of • given length n and • given score S • The probabilities : all similarities of the set have same probability.

  14. Problem Statement • Given: 1) a seed  of weight w and span l, 2) integer scoring scheme {r, p}, 3) similarity length n, 4) score S. • Compute: the probability for the seed  to match a random homogeneous similarity of length n and score S

  15. Computation of Seed Sensitivity (homogeneous case) To be computed: probability for a seed to detect a homogeneous similarity, i.e. Two steps of computation • Preprocessing: counting number Nhom of all homogeneous similarities of given length n and score S. • DP algorithm: Space ant time complexity: • Seed sensitivity measure: counting number of homogeneous similarities detected by seed . • DP algorithm: (similar to Keich03).

  16. (k,y) Number of Homogeneous Similarities: Reduction to Graph Path problem (n,S) • Vertices: {(k, y)}, where k is a length of similarity, y is its score. Vertex (k, y) corresponds to the set of all similarities of score y and length k. • 2 edges from each (k, y): • (k+1, y+r) - for match at position (k+1) ; • (k+1, y-p) - for mismatch at position (k+1); • Homogeneous Similarities  Paths from (0,0) to (n,S) inside nxS grid Score (0,0) Alignment

  17. (n,S) (k,y) (0,0) Number of Homogeneous Similarities:Recursive equation Score S fixed: number of possible paths from (0,0) to (n,S). D(y, k) = D(y+r, k+1) + D(y-p, k+1) Taking into account border effects

  18. Time and complexity • Space Complexity • Time Complexity

  19. Computer experiments:Homogeneous vs. All similarities • Compare the sensitivity of seeds on both models • Fixed score S according to the scoring scheme (r=+1; p=-3) • Similarity length varies from 20 to 120 • Two sets of similarities: (1) all similarities of given length and score; (2) only homogeneous similarities of given length and score; • Comparison plots • x axis : similarity length • y axis : sensitivity (probability that the seed matches a similarity)

  20. Experiments (score 16) contiguous seed (weight 11) ###########

  21. Experiments (score 16) spaced seed (weight 11) ###-#--#-#--##-###

  22. Optimal seeds Optimal seeds are different Optimal seeds are the same

  23. Summary • We have proposed • a new definition of seed sensitivity based on the notion of homogeneous similarity; • a DP algorithm to compute the sensitivity of a given seed. • Sensitivity of a seed on homogeneous similarities is usually substantially larger than on all similarities • Optimal seed on homogeneous similarities may be not optimal on all similarities and vice versa.

  24. Extensions • Combining homogeneity constraint with properties of DNA sequences • Distinguishing different mismatches (transitions/transversions): YASS http://www.loria.fr/projects/YASS/

  25. Estimating seed sensitivity on homogeneous alignments Gregory Kucherov1, Laurent Noé1, Yann Ponty2 1LORIA (Laboratoire lorrain de recherche en informatique et ses applications), Nancy,France 2LRI (Laboratoire de recherche en informatique), Paris, France

  26. Collaborators Thanks !! Mikhail Roytberg (Institute of Mathematical Problem of Biology, Russia) for his comments and helpful discussion during the preparation of this work. Alain Denise (Laboratoire de Recherche en Informatique, France) for his help on culminating paths.

  27. Thank you for your attention ! ????

  28. Number of Detected Homogeneous Similarities: Reduction to Graph Path problem(after Keich03) Vertexes:{(k, y, t, u)} , where k is a length of similarity, y is its score; 1≤t≤ l ; u isa binary word of length l-w. {(k, y, t, u)} corresponds to set H (k, y, t, u) consists of homogeneous similarities of length kand score y; • t is a length of maximal prefix of the seed , matching an end of a word v from H(k, y, t, u); • u is a word, consisting of symbols corresponding to the joker position within the matching. • t and uare same for all similarities from H

  29. Seed:###--#-##Scoring Scheme :r = +1 ; p = -2 k = 17, y = 8, t= 8, u = k’ = k + 1, y’ = y + r, t’= t + 1, u’ = u u t Number of Detected Homogeneous Similarities. (k,y) ###--#-##

  30. Seed:###--#-##Scoring Scheme :r = +1 ; p = -2 k’ = 17 + 1, y’ = 8 + r, t’= t + 1, u’ = u k’ = 18, y’ = 9, t’= 9, u’ = u Number of Detected Homogeneous Similarities. (k’,y’) ###--#-## t+1

  31. Seed:###--#-##Scoring Scheme :r = +1 ; p = -2 k = 17, y = 8, t= 8, u = k’ = 17 + 1, y’ = 8 - p, t’= ?, u’ = ? u t Number of Detected Homogeneous Similarities. (k,y) ###--#-##

  32. Seed:###--#-##Scoring Scheme :r = +1 ; p = -2 ? (k’,y’) k’ = 18, y’ = 6, t’= 5, u’ = ###--#-## u’ t’ Number of Detected Homogeneous Similarities. k’ = 17 + 1, y’ = 8 - p, t’= ?, u’ = ? (t’,u’) = F (t,u);

  33. Number of Detected Homogeneous Similarities: Reduction to Graph Path problem(after Keich03) Edges: • 2 edges from “each” (k, y, t, u): • (k+1, y+r, t+1, u) - for match at (k+1)-th position • (k+1, y-p, t’, u’) - for mismatch at (k+1)-th position; t’, u’ can be pre-computed.

  34. Experiments (score 32) contiguous seed (weight 11) ###########

  35. Experiments (score 32) spaced seed (weight 11) ###-#--#-#--##-###

  36. detected alignment Example 1 Score Alignment ###-###

  37. detected alignment Example 2 Score Alignment ###-###

  38. detected alignment Example 3 Score Alignment ###-###

More Related