230 likes | 338 Views
Better Filtering with Gapped q -grams. S. Burkhardt. J. Kärkkäinen. Center for Bioinformatics, Saarbrücken Max-Planck Institut f. Informatik, Saarbrücken. Outline. Motivation The `classic` q -gram Lemma q -shapes Measuring Filter quality/speed Experimental Results Conclusion.
E N D
Better Filtering withGapped q-grams S. Burkhardt J. Kärkkäinen Center for Bioinformatics, Saarbrücken Max-Planck Institut f. Informatik, Saarbrücken
Outline • Motivation • The `classic` q-gram Lemma • q-shapes • Measuring Filter quality/speed • Experimental Results • Conclusion
The Problem The k-mismatches problem For a pattern P, a string S, a value k : find all occurences of P in S with at most k character replacements.
A common Approach Filter Algorithms Filtration Stage: Examine S with a Filter Criterium Return areas with potential matches Verification Stage: Verify which areas have true matches
Problem Definition String S G C A T T C G A T G G A C T G G A C T A G T G A T T G A G T Pattern P A C T C k = 1 • Find occurences of P with at most k errors
q-gram Lemma The q-gram Lemma For a pattern P, a string S, a value k: Matches to P in S with at most k errors contain at least |P|-q+1-(kq) substrings of length q (q-grams) from S.
q-gram Lemma T C G C G A G A T A T T T T A T A C T C G A T T A C T C G A T T A C q = 3 # of q-grams : |P| - q + 1 k = 1 |P| = 8 => t = 8-3+1-1 = 5 G C A T T C G A T G G A C T G G AC T A G T G A A TC A G T Error number k : at least t = |P| - q + 1 - (qk) common q-grams in |P| letters
Some Definitions In the DP matrix, one can count the number of matching q-grams per diagonal
q-shapes General idea: • Use substrings with gaps (q-shapes) • compute correct threshold t • total length s is called span |Q| = 11 k = 3 3-shape ##.# s = 4 1 gap t = 1 OOXOOXOOXOO OOX OXO XOO OOX OXO XOO OOX OXO XOO 3-gram ### t = 0 no filter! OOOXXOOXOOO OO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O O = match, X = mismatch
q-shapes Judging the quality of q-shapes I We developed a DP based approach for computing the threshold t given a q-shape and a query length |P| Observation: The threshold t is not the only factor that influences the behaviour of a q-shape
q-shapes Judging the quality of q-shapes II We define the minimum coverage as the minimum number of matching characters for any arrangement of t matching q-shapes in P and a substring of length |P| in S ##.# ##.# ----- For t=2 and the 3-shape ##.# the minimum coverage is 5
q-shapes Judging the quality of q-shapes III The value q (i.e.the number of matching characters in a shape) determines the expected number of occurences in a random string S 3-shape: ##.# S = {A,C,G,T} Expected number of occurences of a single 3-shape in S: occ = |S| 1 |S|q
q-shapes Judging the quality of q-shapes IV The speed of the filter step is influenced by the expected number of matching q-shapes in S. The efficiency of the filtration correlates closely with the minimum coverage Speed: value of q Efficiency: minimum coverage
q-shapes Judging the quality of q-shapes V Shapes with maximal minimum coverage for: |Q| = 50, k=5 q=6 : ##......#..#..#.# q=9 : ###..#..#.#...#.## q=10: ###..#..#.#..###.# q=11: #######.##.## q=12: ###.#..###.#..###.# Good shapes are not neccessarily regular or predictable in their form.
Experiments Evaluating q-shapes • Experimental setup for q-shapes: • 50 million character random (Bernoulli) string S • 1000 random queries of length 500 • queries have no approximate matches in S • compute threshold for |Q|=50 • actual value of |Q| is 500! (to reduce runtime of tests) • Experiments show 10x reduced filter efficiency; relative performance between shapes unaffected
Experiments Evaluating q-shapes What we measured for every shape and all queries: A) The total number of occurrences of all shapes Good indicator of the total work for the filter phase B) The number of diagonals containing at least t shapes Good indicator of the filter efficiency The experiments show a good correlation between A and the predicted values as well as B and the minimum coverage
Conclusion Our work…. • An analysis of q-grams with gaps (q-shapes) • Results include: • experimental evidence for their superiority • when compared to standard q-grams • a method to roughly judge their quality, the • minimum coverage • a way to calculate the parameters required to • us them in a filter algorithm
Conclusion Todo…. • an algorithm to predict the best shapes • improve the quality measure for q-grams • extension to the k-differences problem (with • insertions and deletions) • a thorough analysis of filter behaviour for • > k differences (use as a heuristic filter)