190 likes | 201 Views
Explore algorithms for pattern matching in weighted sequences with applications in bioinformatics and data analysis. Learn to handle position weight matrices efficiently using convolution techniques.
E N D
Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat
Weighted Sequences • A weighted sequenceT, of length n, over an alphabet Σ is a |Σ|×n matrix which contains the probability of each symbol to appear at each position. • Also known as Position Weight Matrix
Pattern Matching in Weighted Sequences • Problem Definition: Given a threshold probability , find all occurrences of the pattern P (|P|=m) in the weighted sequence T (|T|=n) where: • By applying the logarithm
Naïve Algorithm Bounded Alphabet Size • For each in • Construct a vector P, such that P[i]=1 if occurs at position i in P, P[i]=0 otherwise. • Calculate the sum of probabilities by convoluting the row of in T with P. • For each text position sum the results. • Time: O(n|Σ| log m)
Matching in Weighted Sequences Unbounded Alphabet Size • Input: Triplets (C,I,P), whenever P0. s = # of triplets. • Applying the naive algorithm in this case results in an O(n|Σ| log m) = O(nm log m) algorithm. • This is worse then the trivial algorithm.
Example (a,1,0.2) (a,0,0.5) (b,1,0.7) (a,2,0.4) (a,4,0.1) T: (b,0,0.5) (c,1,0.1) (c,2,0.6) (b,3,1.0) (c,4,0.9) P: a b c (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) R: -0.67 - -0.45 - -
Step 1: Subset Matching • Observation 1: A weighted matching can only appear in positions where a subset match can be found. • Step 1a: Build a new text Ts where for each text position there is a set of all the letters which have non-zero probabilities. • Step 1b: Mark all the positions where a subset match is found. • Time: O(s log2s) (Cole & Hariharan STOC02).
Example (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) P: a b c T’: {a,b} , {a,b,c} , {a,c} , {b} , {a,c} P’: {a} , {b} , {c} • Subset match positions: 0,2
Step 2: Main Idea • Linearize the input into raw vectors T’ and P’ of size O(s), such that: T’ contains the probabilities. P’ contains 1’s and 0’s. • Sum the probabilities using convolution. • The linearization is done using shifting where each symbol is assigned a different shift. • The same shifting will be used in both the text and the pattern.
Example (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) P: a b c • Shifts: a-0, b-3, c-1 (c,-1.0) (c,-.22) (b,-.15) T’: (a,-0.3) (a,-0.7) (a,-0.4) (b,-0.3) (a,-1.0) (c,-.05) (b,0.0) P’: a _ _ c b _ _ 0 1 2 3 4 5 6
Step 2: Linearization • Definition: singleton – a position which assigned only 1 triplet. multiple - a position which assigned more then 1 triplet. • Text - Replace all singletons with the probability of the triplet. The empty and multiple positions will be replaced by 0. • Pattern - Replace all singletons with 1. The empty and multiple positions will be replaced by 0.
Example (c,-1.0) (c,-.22) (b,-.15) • T’: (a,-0.3) (a,-0.7) (a,-0.4) (b,-0.3) (a,-1.0) (c,-.05) (b,0.0) • P’: a _ _ c b _ _ • T’’: -0.3 -0.7 0 0 0 -.05 0 • P’’: 1 0 0 1 1 0 0 • This allow us to sum the probabilities using convolution. • Question: Are we summing the right values?
Step 2: Correctness • Lemma: For any position where a subset match exists, 2 aligned singletons must be originated from the same letter. • Proof: Assume that there is a subset match in position i in the text, and there are 2 aligned singletons in T’(i+j), P’(j).
Step 2: Completeness • Solution: Zero the probability of the triplet after the first time it appeared as a singleton. • Time: O(s log2s) • Problem: Using a several shifting set can cause adding probabilities more then once! • Solution: Use a set of O(log s) such shifting sets. • Problem: We did not sum all probabilities!
Hamming Distance – Text ErrorsBounded Alphabet Size • Problem Definition: Given a threshold probability , find for each text position the minimal number of probabilities, which by changing them to 1, the following will be obtained: • In case of errors in the text, a match can always be found. • This does not apply for the case of errors in the pattern.
Hamming Distance – Text ErrorsAlgorithm Outline… • Sort the probabilities in the weighted sequence. • Divide the list of probabilities into block of size (n|Σ|)0.5 . • Calculate the sum of probabilities for each block. • For each text location, • Start adding blocks until the sum goes below the threshold. • Start adding probabilities from the last block until the sum goes below the threshold. • Time:
Unbounded Alphabet SizeAlgorithm 1 • Divide the list of probabilities into blocks of size s0.5 . • For each block calculate the sum of probabilities (shifting). • For each text position and each block • If subset matching exist, use the shifting algorithm result. • Else – use brute force. • Time: Where k is the number of blocks per text position, where there is no subset match.
Unbounded Alphabet SizeAlgorithm 2 • Sort the probabilities in the weighted sequence. • Divide the list of probabilities into blocks of size s2/3. • For each block, • Calculate the sum of non-frequent letters probabilities. O(sm2/3) • Calculate the sum of frequent letters probabilities. O(s1/3m1/3nlogm) • Continue as in the previous algorithm. • Time: O(sm2/3 + s1/3m1/3nlogm)
Unbounded Alphabet SizeCombined Algorithm • Start with the first algorithm. • If k is small – Complete the first algorithm. • Else – Apply the second algorithm.