Pattern Matching in Weighted Sequences

Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat

Weighted Sequences • A weighted sequenceT, of length n, over an alphabet Σ is a |Σ|×n matrix which contains the probability of each symbol to appear at each position. • Also known as Position Weight Matrix

Pattern Matching in Weighted Sequences • Problem Definition: Given a threshold probability , find all occurrences of the pattern P (|P|=m) in the weighted sequence T (|T|=n) where: • By applying the logarithm

Naïve Algorithm Bounded Alphabet Size • For each  in  • Construct a vector P, such that P[i]=1 if  occurs at position i in P, P[i]=0 otherwise. • Calculate the sum of probabilities by convoluting the row of  in T with P. • For each text position sum the results. • Time: O(n|Σ| log m)

Matching in Weighted Sequences Unbounded Alphabet Size • Input: Triplets (C,I,P), whenever P0. s = # of triplets. • Applying the naive algorithm in this case results in an O(n|Σ| log m) = O(nm log m) algorithm. • This is worse then the trivial algorithm.

Example (a,1,0.2) (a,0,0.5) (b,1,0.7) (a,2,0.4) (a,4,0.1) T: (b,0,0.5) (c,1,0.1) (c,2,0.6) (b,3,1.0) (c,4,0.9) P: a b c (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) R: -0.67 - -0.45 - -

Step 1: Subset Matching • Observation 1: A weighted matching can only appear in positions where a subset match can be found. • Step 1a: Build a new text Ts where for each text position there is a set of all the letters which have non-zero probabilities. • Step 1b: Mark all the positions where a subset match is found. • Time: O(s log2s) (Cole & Hariharan STOC02).

Example (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) P: a b c T’: {a,b} , {a,b,c} , {a,c} , {b} , {a,c} P’: {a} , {b} , {c} • Subset match positions: 0,2

Step 2: Main Idea • Linearize the input into raw vectors T’ and P’ of size O(s), such that: T’ contains the probabilities. P’ contains 1’s and 0’s. • Sum the probabilities using convolution. • The linearization is done using shifting where each symbol is assigned a different shift. • The same shifting will be used in both the text and the pattern.

Example (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) P: a b c • Shifts: a-0, b-3, c-1 (c,-1.0) (c,-.22) (b,-.15) T’: (a,-0.3) (a,-0.7) (a,-0.4) (b,-0.3) (a,-1.0) (c,-.05) (b,0.0) P’: a _ _ c b _ _ 0 1 2 3 4 5 6

Step 2: Linearization • Definition: singleton – a position which assigned only 1 triplet. multiple - a position which assigned more then 1 triplet. • Text - Replace all singletons with the probability of the triplet. The empty and multiple positions will be replaced by 0. • Pattern - Replace all singletons with 1. The empty and multiple positions will be replaced by 0.

Example (c,-1.0) (c,-.22) (b,-.15) • T’: (a,-0.3) (a,-0.7) (a,-0.4) (b,-0.3) (a,-1.0) (c,-.05) (b,0.0) • P’: a _ _ c b _ _ • T’’: -0.3 -0.7 0 0 0 -.05 0 • P’’: 1 0 0 1 1 0 0 • This allow us to sum the probabilities using convolution. • Question: Are we summing the right values?

Step 2: Correctness • Lemma: For any position where a subset match exists, 2 aligned singletons must be originated from the same letter. • Proof: Assume that there is a subset match in position i in the text, and there are 2 aligned singletons in T’(i+j), P’(j).

Step 2: Completeness • Solution: Zero the probability of the triplet after the first time it appeared as a singleton. • Time: O(s log2s) • Problem: Using a several shifting set can cause adding probabilities more then once! • Solution: Use a set of O(log s) such shifting sets. • Problem: We did not sum all probabilities!

Hamming Distance – Text ErrorsBounded Alphabet Size • Problem Definition: Given a threshold probability , find for each text position the minimal number of probabilities, which by changing them to 1, the following will be obtained: • In case of errors in the text, a match can always be found. • This does not apply for the case of errors in the pattern.

Hamming Distance – Text ErrorsAlgorithm Outline… • Sort the probabilities in the weighted sequence. • Divide the list of probabilities into block of size (n|Σ|)0.5 . • Calculate the sum of probabilities for each block. • For each text location, • Start adding blocks until the sum goes below the threshold. • Start adding probabilities from the last block until the sum goes below the threshold. • Time:

Unbounded Alphabet SizeAlgorithm 1 • Divide the list of probabilities into blocks of size s0.5 . • For each block calculate the sum of probabilities (shifting). • For each text position and each block • If subset matching exist, use the shifting algorithm result. • Else – use brute force. • Time: Where k is the number of blocks per text position, where there is no subset match.

Unbounded Alphabet SizeAlgorithm 2 • Sort the probabilities in the weighted sequence. • Divide the list of probabilities into blocks of size s2/3. • For each block, • Calculate the sum of non-frequent letters probabilities. O(sm2/3) • Calculate the sum of frequent letters probabilities. O(s1/3m1/3nlogm) • Continue as in the previous algorithm. • Time: O(sm2/3 + s1/3m1/3nlogm)

Unbounded Alphabet SizeCombined Algorithm • Start with the first algorithm. • If k is small – Complete the first algorithm. • Else – Apply the second algorithm.

Pattern Matching in Weighted Sequences

Pattern Matching in Weighted Sequences

Presentation Transcript

Combinatorial Pattern Matching

Weighted Bipartite Matching

Pattern Matching

Pattern Matching

Pattern Matching

Pattern Matching II

Pattern Matching in Prolog

Pattern Matching

Pattern Matching in Lisp

Pattern Matching

Pattern Matching

Pattern Matching

Pattern matching

Weighted Matching

Pattern Matching

Pattern Matching

Property Matching and Weighted Matching

Pattern Matching

Pattern Matching

Pattern matching

Pattern Matching

Pattern Matching