1 / 29

Tuning Algorithms for Jumbeled Matching

Explore jumbled matching algorithms, Parikh vectors, approximate permutation matching, and efficient tuning for DNA and protein data. Developments include ABAM and BAM2 to enhance accuracy and speed. Experimental results show significant improvements.

Download Presentation

Tuning Algorithms for Jumbeled Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tuning Algorithms for Jumbeled Matching Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio

  2. Jumbled matching • Interesting variation of string matching. • To find substrings of T which are permutations of P. • For example: P=abcb in T=aababcaabc.

  3. Jumbled matching • Parikh Vector- The pattern can be described as parikh vector. • Vector of multiplicities of the characters. • p(S) is (1,2,1,0) for S = abcb = {a,b,c,d}.

  4. Approximate Permutaion Matching • The string P´ is a k-approximate permutation of P, 0 <= k < m, |P´| = |P| = m holds • set(P´) is the set of characters in P´ and cc(u,c) is the number of occurrences of a character c in a string u.

  5. Motivation • Alignment of strings • SNP discovery • Discovery of repeated patterns • Interpretation of mass spectrometry data

  6. Previous Algorithms • Key Idea- scan the text forward while maintaining counts of characters. • Work in linear time. • These algorithms were developed as filtration methods for online approximate string matching.

  7. Previous Algorithms • Grossi & Luccio’s (Information Processing Letters 1989) and Navarro’s (Proc. WSP 1997) solutions are based on the frequency of characters. • Navarro’s counting algorithm - sliding window approach.

  8. Previous Algorithms • Grossi and Luccio’s (Information Processing Letters 1989) solution maintains a queue of characters. • It grows with the acceptable characters. • Navarro presented a Mcount for multiple patterns (Proc. WSP 1997) .

  9. Previous Algorithms • Cantone and Faro (Proc. PSC 2014) presented the BAM algorithm (Bit-parallel Abelian Matcher). • Associate a counter(bin) to each distinct character in P. • A single 1-bit counter for the remaining characters of the alphabet.

  10. Previous Algorithms • At the start of processing a window, every overflow bit is zero. • 1-bit counter reserved for all the characters not occurring in p is initially null. • And it gets set as soon as any character not in p is encountered in the text window. • It becomes clear that the text window cannot be a permutation of the pattern P.

  11. Bit Parallel simulation P = abbccc cbaother characters

  12. Initialization for state vector P = abbccc c b a All other characters

  13. Forward Processing

  14. Backward Processing

  15. New solutions • Solutions for both exact and approximate jumbled matching. • We present two algorithms that are modifications of BAM. • ABAM (approximate BAM). • BAM2 (enhanced BAM with 2-grams).

  16. Key Idea: Counters • We used bit fields to store counters. • For each character that appears in the pattern. • One for all other characters. • Highest bit is an overflow indicator. • Space to represent number of times the character appears in the pattern + maximum error count k.

  17. State Vector D • Counters are stored in state vector D. • If they do not fit in one word • We can put several different characters in one field. • But then we must verify matches. • Initial vales of D are fetched from precomputed word. • Processing of each character is made by using array M[tj]which has the one in the field for tj. • Value of D is updated by DD + M[tj].

  18. Initialization for state vector D and M[ ] for pattern P = abbccc All other characters x c b a I M[a] M[b] M[c] M[x]

  19. Variations of BAM • BAMs • Some bins are shared if necessary. • If bins are shared, each match candidate needs to be verified. • BAM2 • Handles 2 text characters (2-gram) at a time. • Separate loop for patterns of even and odd length. • Reads four characters before testing D first time. • Hence the minimum width of a field is four bits instead of two.

  20. ABAM • ABAM : Approximate BAM. • C is the error counter. • F[tj] is mask for testing overflow bits.

  21. EBL (Exact Backward for Large alphabets) • EBL is based on SBNDM2. • Instead of representing occurrence vectors. • Array B states of a character is present in the pattern. • When the alignment window contains only acceptable characters, the window is a match candidate. • Acceptable: characters that appear in the pattern. • Update step is simply D = D & B[ti+j-1].

  22. EFS (Exact forward for small alphabets)AFL (Approximate Backward for small alphabets) • EFS: Update step is DD + M[ti] – M[ti-m]. • AFL is modification of Mcount tuned for single pattern. • Different initial value of the counter.

  23. ABS (Approximate Backward for Small Alphabets) • Error count C is updated without conditional code by shifting the corresponding overflow bit to the lowest bit and then masking it. • Shift is utilizing array o[ ] which contains the positions of overflow bits.

  24. Execution times of algorithms (in seconds) for English data

  25. Execution times of algorithms (in seconds) for dna data

  26. Execution times of algorithms (in seconds) for protein data

  27. Experimental Results • English data • BAM2a works more than two times faster than the previous algorithms. • DNA data • EFS works in a double speed an compared to previous algorithms. • Protein data • BAM2a is fastest and takes less than half time compared to previos agorithms.

  28. Concluding remarks • We introduced new variations jumbled matching algorithms. • All the forward algorithms are clearly linear. • The speed of AFL do not depend on the value of k. • Technique of shared bins showed to be useful for jumbled matching.

  29. THANK YOU

More Related