390 likes | 542 Views
Estimating seed sensitivity on homogeneous alignments. BIBE 2004 Taichung - May 20th, 2004 Gregory Kucherov 1 , Laurent Noé 1 , Yann Ponty 2 1 LORIA, Nancy 2 LRI, Paris, France. Detected seeds. Detected alignment. Seed paradigm ( FASTA, BLAST, PatternHunter, YASS, … ).
E N D
Estimating seed sensitivity on homogeneous alignments BIBE2004 Taichung - May 20th, 2004 Gregory Kucherov1, Laurent Noé1, Yann Ponty2 1LORIA, Nancy 2LRI, Paris, France
Detected seeds Detected alignment Seed paradigm (FASTA, BLAST,PatternHunter, YASS, … ) • Start with small conserved and easily detected fragments (seeds). • Then extend the seeds and build possible alignments Dot plot ctcgactcgggctcacgctcgcaccgggttacagcggtcgattgcat aggcctcgggctcgcgctcgcgcgctagacaccgggttacagcgt
Spaced Seed Model [Ma & al. 02] • Seed Pattern : ###--#-## ‘#’ : obligatory match position ‘-’ : joker position (“don’t care” position) Weight : 6 [number of #] Span : 9 [number of all symbols] • Example : ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA ###--#-## ATCAGTGCAATGCTCAAGA |||||:||:||||:||||| ATCAGCGCGATGCGCAAGA
How to describe Selectivity and Sensitivity • Selectivity seed weight: number of random occurrences ~ 4-weight . • Sensitivity probability for the seed to detect an interesting similarity. To be specified: • What set of similarities do we want to detect? • What is the probability of each similarity?
What is a good seed?Sensitivity/Selectivity balance • Seed of relativelylargeweight : • Few random seed matches (high selectivity) • Possible loss of similarities (low sensitivity) • Seed of relativelysmallweight : • Detect almost all possible similarities (high sensitivity) • Many random seed matches (low selectivity)
binary word Similarity: notation • Ungapped similarities only (no indels) CTACGATGAGCTGCT |||:||:||||:||| CTATGACGAGCGGCT • All matches are equiprobable, all mismatches are equiprobable (simplification)
{ , }n Similarities to be detected [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma et al 2001, 2003, …] • The set: • all strings in (given n) • The probabilities: • Bernoulli model; • Markov models;
Similarities to be detected [Bulher et al 2003, Brejova et al 2003, Choi et al 2003, Keich et al 2004, Ma et al 2001, 2003, …] Advantage: • natural probability model; • DP algorithms to compute sensitivity; Disadvantage: • Uninteresting similarities are included in the set.
Similarities to be detected and their probabilities: our approach Only “true” similarities to be considered Scoring Scheme Score = 12r – 3p +r+r+r-p+r+r-p+r+r+r+r-p+r+r+r CTACGATGAGCTGCT |||:||:||||:||| CTATGACGAGCGGCT
Similarities to be detected and their probabilities: our approach Homogeneous similarities : • do not contain sub-alignment of higher score : • (cf. Maximum Scoring Pairs) • all prefixes and suffixes of the similarity have non-negative score.
Homogeneous alignment Homogeneous similarities (Prefix) Score Alignment homogeneous similarities occur entirely inside shaded area:
Negative suffix Prefix of higher score Suffix of higher score Negative prefix Non homogeneous similarities Score Alignment Score Alignment
Our Model • The set: Homogeneous similarities of • given length n and • given score S • The probabilities : all similarities of the set have same probability.
Problem Statement • Given: 1) a seed of weight w and span l, 2) integer scoring scheme {r, p}, 3) similarity length n, 4) score S. • Compute: the probability for the seed to match a random homogeneous similarity of length n and score S
Computation of Seed Sensitivity (homogeneous case) To be computed: probability for a seed to detect a homogeneous similarity, i.e. Two steps of computation • Preprocessing: counting number Nhom of all homogeneous similarities of given length n and score S. • DP algorithm: Space ant time complexity: • Seed sensitivity measure: counting number of homogeneous similarities detected by seed . • DP algorithm: (similar to Keich03).
(k,y) Number of Homogeneous Similarities: Reduction to Graph Path problem (n,S) • Vertices: {(k, y)}, where k is a length of similarity, y is its score. Vertex (k, y) corresponds to the set of all similarities of score y and length k. • 2 edges from each (k, y): • (k+1, y+r) - for match at position (k+1) ; • (k+1, y-p) - for mismatch at position (k+1); • Homogeneous Similarities Paths from (0,0) to (n,S) inside nxS grid Score (0,0) Alignment
(n,S) (k,y) (0,0) Number of Homogeneous Similarities:Recursive equation Score S fixed: number of possible paths from (0,0) to (n,S). D(y, k) = D(y+r, k+1) + D(y-p, k+1) Taking into account border effects
Time and complexity • Space Complexity • Time Complexity
Computer experiments:Homogeneous vs. All similarities • Compare the sensitivity of seeds on both models • Fixed score S according to the scoring scheme (r=+1; p=-3) • Similarity length varies from 20 to 120 • Two sets of similarities: (1) all similarities of given length and score; (2) only homogeneous similarities of given length and score; • Comparison plots • x axis : similarity length • y axis : sensitivity (probability that the seed matches a similarity)
Experiments (score 16) contiguous seed (weight 11) ###########
Experiments (score 16) spaced seed (weight 11) ###-#--#-#--##-###
Optimal seeds Optimal seeds are different Optimal seeds are the same
Summary • We have proposed • a new definition of seed sensitivity based on the notion of homogeneous similarity; • a DP algorithm to compute the sensitivity of a given seed. • Sensitivity of a seed on homogeneous similarities is usually substantially larger than on all similarities • Optimal seed on homogeneous similarities may be not optimal on all similarities and vice versa.
Extensions • Combining homogeneity constraint with properties of DNA sequences • Distinguishing different mismatches (transitions/transversions): YASS http://www.loria.fr/projects/YASS/
Estimating seed sensitivity on homogeneous alignments Gregory Kucherov1, Laurent Noé1, Yann Ponty2 1LORIA (Laboratoire lorrain de recherche en informatique et ses applications), Nancy,France 2LRI (Laboratoire de recherche en informatique), Paris, France
Collaborators Thanks !! Mikhail Roytberg (Institute of Mathematical Problem of Biology, Russia) for his comments and helpful discussion during the preparation of this work. Alain Denise (Laboratoire de Recherche en Informatique, France) for his help on culminating paths.
Number of Detected Homogeneous Similarities: Reduction to Graph Path problem(after Keich03) Vertexes:{(k, y, t, u)} , where k is a length of similarity, y is its score; 1≤t≤ l ; u isa binary word of length l-w. {(k, y, t, u)} corresponds to set H (k, y, t, u) consists of homogeneous similarities of length kand score y; • t is a length of maximal prefix of the seed , matching an end of a word v from H(k, y, t, u); • u is a word, consisting of symbols corresponding to the joker position within the matching. • t and uare same for all similarities from H
Seed:###--#-##Scoring Scheme :r = +1 ; p = -2 k = 17, y = 8, t= 8, u = k’ = k + 1, y’ = y + r, t’= t + 1, u’ = u u t Number of Detected Homogeneous Similarities. (k,y) ###--#-##
Seed:###--#-##Scoring Scheme :r = +1 ; p = -2 k’ = 17 + 1, y’ = 8 + r, t’= t + 1, u’ = u k’ = 18, y’ = 9, t’= 9, u’ = u Number of Detected Homogeneous Similarities. (k’,y’) ###--#-## t+1
Seed:###--#-##Scoring Scheme :r = +1 ; p = -2 k = 17, y = 8, t= 8, u = k’ = 17 + 1, y’ = 8 - p, t’= ?, u’ = ? u t Number of Detected Homogeneous Similarities. (k,y) ###--#-##
Seed:###--#-##Scoring Scheme :r = +1 ; p = -2 ? (k’,y’) k’ = 18, y’ = 6, t’= 5, u’ = ###--#-## u’ t’ Number of Detected Homogeneous Similarities. k’ = 17 + 1, y’ = 8 - p, t’= ?, u’ = ? (t’,u’) = F (t,u);
Number of Detected Homogeneous Similarities: Reduction to Graph Path problem(after Keich03) Edges: • 2 edges from “each” (k, y, t, u): • (k+1, y+r, t+1, u) - for match at (k+1)-th position • (k+1, y-p, t’, u’) - for mismatch at (k+1)-th position; t’, u’ can be pre-computed.
Experiments (score 32) contiguous seed (weight 11) ###########
Experiments (score 32) spaced seed (weight 11) ###-#--#-#--##-###
detected alignment Example 1 Score Alignment ###-###
detected alignment Example 2 Score Alignment ###-###
detected alignment Example 3 Score Alignment ###-###