1 / 19

Pattern Matching in Weighted Sequences

Explore algorithms for pattern matching in weighted sequences with applications in bioinformatics and data analysis. Learn to handle position weight matrices efficiently using convolution techniques.

dbillings
Download Presentation

Pattern Matching in Weighted Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat

  2. Weighted Sequences • A weighted sequenceT, of length n, over an alphabet Σ is a |Σ|×n matrix which contains the probability of each symbol to appear at each position. • Also known as Position Weight Matrix

  3. Pattern Matching in Weighted Sequences • Problem Definition: Given a threshold probability , find all occurrences of the pattern P (|P|=m) in the weighted sequence T (|T|=n) where: • By applying the logarithm

  4. Naïve Algorithm Bounded Alphabet Size • For each  in  • Construct a vector P, such that P[i]=1 if  occurs at position i in P, P[i]=0 otherwise. • Calculate the sum of probabilities by convoluting the row of  in T with P. • For each text position sum the results. • Time: O(n|Σ| log m)

  5. Matching in Weighted Sequences Unbounded Alphabet Size • Input: Triplets (C,I,P), whenever P0. s = # of triplets. • Applying the naive algorithm in this case results in an O(n|Σ| log m) = O(nm log m) algorithm. • This is worse then the trivial algorithm.

  6. Example (a,1,0.2) (a,0,0.5) (b,1,0.7) (a,2,0.4) (a,4,0.1) T: (b,0,0.5) (c,1,0.1) (c,2,0.6) (b,3,1.0) (c,4,0.9) P: a b c (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) R: -0.67 - -0.45 - -

  7. Step 1: Subset Matching • Observation 1: A weighted matching can only appear in positions where a subset match can be found. • Step 1a: Build a new text Ts where for each text position there is a set of all the letters which have non-zero probabilities. • Step 1b: Mark all the positions where a subset match is found. • Time: O(s log2s) (Cole & Hariharan STOC02).

  8. Example (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) P: a b c T’: {a,b} , {a,b,c} , {a,c} , {b} , {a,c} P’: {a} , {b} , {c} • Subset match positions: 0,2

  9. Step 2: Main Idea • Linearize the input into raw vectors T’ and P’ of size O(s), such that: T’ contains the probabilities. P’ contains 1’s and 0’s. • Sum the probabilities using convolution. • The linearization is done using shifting where each symbol is assigned a different shift. • The same shifting will be used in both the text and the pattern.

  10. Example (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) P: a b c • Shifts: a-0, b-3, c-1 (c,-1.0) (c,-.22) (b,-.15) T’: (a,-0.3) (a,-0.7) (a,-0.4) (b,-0.3) (a,-1.0) (c,-.05) (b,0.0) P’: a _ _ c b _ _ 0 1 2 3 4 5 6

  11. Step 2: Linearization • Definition: singleton – a position which assigned only 1 triplet. multiple - a position which assigned more then 1 triplet. • Text - Replace all singletons with the probability of the triplet. The empty and multiple positions will be replaced by 0. • Pattern - Replace all singletons with 1. The empty and multiple positions will be replaced by 0.

  12. Example (c,-1.0) (c,-.22) (b,-.15) • T’: (a,-0.3) (a,-0.7) (a,-0.4) (b,-0.3) (a,-1.0) (c,-.05) (b,0.0) • P’: a _ _ c b _ _ • T’’: -0.3 -0.7 0 0 0 -.05 0 • P’’: 1 0 0 1 1 0 0 • This allow us to sum the probabilities using convolution. • Question: Are we summing the right values?

  13. Step 2: Correctness • Lemma: For any position where a subset match exists, 2 aligned singletons must be originated from the same letter. • Proof: Assume that there is a subset match in position i in the text, and there are 2 aligned singletons in T’(i+j), P’(j).

  14. Step 2: Completeness • Solution: Zero the probability of the triplet after the first time it appeared as a singleton. • Time: O(s log2s) • Problem: Using a several shifting set can cause adding probabilities more then once! • Solution: Use a set of O(log s) such shifting sets. • Problem: We did not sum all probabilities!

  15. Hamming Distance – Text ErrorsBounded Alphabet Size • Problem Definition: Given a threshold probability , find for each text position the minimal number of probabilities, which by changing them to 1, the following will be obtained: • In case of errors in the text, a match can always be found. • This does not apply for the case of errors in the pattern.

  16. Hamming Distance – Text ErrorsAlgorithm Outline… • Sort the probabilities in the weighted sequence. • Divide the list of probabilities into block of size (n|Σ|)0.5 . • Calculate the sum of probabilities for each block. • For each text location, • Start adding blocks until the sum goes below the threshold. • Start adding probabilities from the last block until the sum goes below the threshold. • Time:

  17. Unbounded Alphabet SizeAlgorithm 1 • Divide the list of probabilities into blocks of size s0.5 . • For each block calculate the sum of probabilities (shifting). • For each text position and each block • If subset matching exist, use the shifting algorithm result. • Else – use brute force. • Time: Where k is the number of blocks per text position, where there is no subset match.

  18. Unbounded Alphabet SizeAlgorithm 2 • Sort the probabilities in the weighted sequence. • Divide the list of probabilities into blocks of size s2/3. • For each block, • Calculate the sum of non-frequent letters probabilities. O(sm2/3) • Calculate the sum of frequent letters probabilities. O(s1/3m1/3nlogm) • Continue as in the previous algorithm. • Time: O(sm2/3 + s1/3m1/3nlogm)

  19. Unbounded Alphabet SizeCombined Algorithm • Start with the first algorithm. • If k is small – Complete the first algorithm. • Else – Apply the second algorithm.

More Related