Tuning Algorithms for Jumbeled Matching

Tuning Algorithms for Jumbeled Matching Tamanna Chhabra, Sukhpal Singh Ghuman, Jorma Tarhio

Jumbled matching • Interesting variation of string matching. • To find substrings of T which are permutations of P. • For example: P=abcb in T=aababcaabc.

Jumbled matching • Parikh Vector- The pattern can be described as parikh vector. • Vector of multiplicities of the characters. • p(S) is (1,2,1,0) for S = abcb = {a,b,c,d}.

Approximate Permutaion Matching • The string P´ is a k-approximate permutation of P, 0 <= k < m, |P´| = |P| = m holds • set(P´) is the set of characters in P´ and cc(u,c) is the number of occurrences of a character c in a string u.

Motivation • Alignment of strings • SNP discovery • Discovery of repeated patterns • Interpretation of mass spectrometry data

Previous Algorithms • Key Idea- scan the text forward while maintaining counts of characters. • Work in linear time. • These algorithms were developed as filtration methods for online approximate string matching.

Previous Algorithms • Grossi & Luccio’s (Information Processing Letters 1989) and Navarro’s (Proc. WSP 1997) solutions are based on the frequency of characters. • Navarro’s counting algorithm - sliding window approach.

Previous Algorithms • Grossi and Luccio’s (Information Processing Letters 1989) solution maintains a queue of characters. • It grows with the acceptable characters. • Navarro presented a Mcount for multiple patterns (Proc. WSP 1997) .

Previous Algorithms • Cantone and Faro (Proc. PSC 2014) presented the BAM algorithm (Bit-parallel Abelian Matcher). • Associate a counter(bin) to each distinct character in P. • A single 1-bit counter for the remaining characters of the alphabet.

Previous Algorithms • At the start of processing a window, every overflow bit is zero. • 1-bit counter reserved for all the characters not occurring in p is initially null. • And it gets set as soon as any character not in p is encountered in the text window. • It becomes clear that the text window cannot be a permutation of the pattern P.

Bit Parallel simulation P = abbccc cbaother characters

Initialization for state vector P = abbccc c b a All other characters

Forward Processing

Backward Processing

New solutions • Solutions for both exact and approximate jumbled matching. • We present two algorithms that are modifications of BAM. • ABAM (approximate BAM). • BAM2 (enhanced BAM with 2-grams).

Key Idea: Counters • We used bit fields to store counters. • For each character that appears in the pattern. • One for all other characters. • Highest bit is an overflow indicator. • Space to represent number of times the character appears in the pattern + maximum error count k.

State Vector D • Counters are stored in state vector D. • If they do not fit in one word • We can put several different characters in one field. • But then we must verify matches. • Initial vales of D are fetched from precomputed word. • Processing of each character is made by using array M[tj]which has the one in the field for tj. • Value of D is updated by DD + M[tj].

Initialization for state vector D and M[ ] for pattern P = abbccc All other characters x c b a I M[a] M[b] M[c] M[x]

Variations of BAM • BAMs • Some bins are shared if necessary. • If bins are shared, each match candidate needs to be verified. • BAM2 • Handles 2 text characters (2-gram) at a time. • Separate loop for patterns of even and odd length. • Reads four characters before testing D first time. • Hence the minimum width of a field is four bits instead of two.

ABAM • ABAM : Approximate BAM. • C is the error counter. • F[tj] is mask for testing overflow bits.

EBL (Exact Backward for Large alphabets) • EBL is based on SBNDM2. • Instead of representing occurrence vectors. • Array B states of a character is present in the pattern. • When the alignment window contains only acceptable characters, the window is a match candidate. • Acceptable: characters that appear in the pattern. • Update step is simply D = D & B[ti+j-1].

EFS (Exact forward for small alphabets)AFL (Approximate Backward for small alphabets) • EFS: Update step is DD + M[ti] – M[ti-m]. • AFL is modification of Mcount tuned for single pattern. • Different initial value of the counter.

ABS (Approximate Backward for Small Alphabets) • Error count C is updated without conditional code by shifting the corresponding overflow bit to the lowest bit and then masking it. • Shift is utilizing array o[ ] which contains the positions of overflow bits.

Execution times of algorithms (in seconds) for English data

Execution times of algorithms (in seconds) for dna data

Execution times of algorithms (in seconds) for protein data

Experimental Results • English data • BAM2a works more than two times faster than the previous algorithms. • DNA data • EFS works in a double speed an compared to previous algorithms. • Protein data • BAM2a is fastest and takes less than half time compared to previos agorithms.

Concluding remarks • We introduced new variations jumbled matching algorithms. • All the forward algorithms are clearly linear. • The speed of AFL do not depend on the value of k. • Technique of shared bins showed to be useful for jumbled matching.

THANK YOU

Tuning Algorithms for Jumbeled Matching

Tuning Algorithms for Jumbeled Matching

Presentation Transcript

Pattern Matching Algorithms: An Overview

Analysis and Algorithms for Content-based Event Matching

Schema Matching Algorithms

5. Impedance Matching and Tuning

Fast Matching Algorithms for Repetitive Optimization

Efficient Algorithms for Matching

Module 5: String Matching Algorithms

Algorithms for Maximum Induced Matching Problem

Faster algorithms for string matching with k mismatches

Efficient algorithms for ( δ , γ , α )-matching

Faster algorithms for string matching problems: matching the convolution bound

Strings and Pattern Matching Algorithms

Exact String Matching Algorithms

String Matching Algorithms

Filter Algorithms for Approximate String Matching

Lecture 27. String Matching Algorithms

String Matching Algorithms

Stack-based Algorithms for Pattern Matching on DAGs

Algorithms for Image Matching for Visual Robot Navigation

Recruit candidates using matching Algorithms