1 / 32

Property Matching and Weighted Matching

Property Matching and Weighted Matching. Amihood Amir, Eran Chencinski , Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang. Results. Weighted Matching. General Reduction. Property Matching. Property Indexing. Pattern Matching. Property Matching.

Download Presentation

Property Matching and Weighted Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Property Matching and Weighted Matching Amihood Amir, Eran Chencinski, Costas Iliopoulos, Tsvi Kopelowitz and Hui Zhang

  2. Results Weighted Matching General Reduction Property Matching Property Indexing Pattern Matching

  3. Property Matching Def: A property of a string T = t1, …, tn is a set of intervals {(s1, f1), (s2, f2), … , (st, ft)}, s.t. si, fi {1, … , n} and si ≤ fi Property Matching Problem Given a text T with property and a pattern P, Find all locations where P matches T and is fully contained in an interval in .

  4. A A A D B B A B D D B A B D D A D B Property Matching - Example Property Swap Matching Problem

  5. Property Matching Solving Property Matching Problem • Solve regular pattern matching problem • Eliminate results not in property interval • Eliminating results can be done in linear time • If regular problem takes Ω(n) time => Property matching time = regular problem time

  6. Property Indexing Problem Property Indexing Problem • Preprocess T s.t. given a P find occurrences of P in T s.t. P is contained in a property interval • Time: proportional to |P| and tocc • Our solution: Query time O(|P| log|Σ| + tocc ) Preprocessing of O(n log|Σ| + n * log log n)

  7. Weighted Sequence Def 1:weighted sequence is sequence of sets of pairs where and is probability of having symbol at location i. <A,1/2> <B,3/8> <C,1/8> <A,1/3> <B,1/3> <D,1/3> <A,1/4> <C,3/4> <D,1> <B,1/2> <C,1/2> <B,1/9> <C,8/9>

  8. Weighted Sequence Def 2: Given prob ε, P=p1,…,pm occurs at location i of weighted text T w.p. at least ε if:

  9. Weighted Sequence <A,1/2> <B,3/8> <C,1/8> <A,1/3> <B,1/3> <D,1/3> <A,1/4> <C,3/4> <D,1> <B,1/2> <C,1/2> <B,1/9> <C,8/9> A D C C

  10. Goal • Weighted Matching problems = Pattern Matching problems with weighted text. • Goal: Find general reduction for solving weighted matching problems using regular pattern matching algorithms.

  11. Naive Algorithm Algorithm A • Find all possible patterns appearing in weighted text. • Concatenate all patterns to create new text. • Run regular pattern matching algorithm on new regular text. • Check each pattern found for prob. ≥ ε.

  12. Naive Algorithm <A,1/2> <B,3/8> <C,1/8> <A,1/3> <B,1/3> <D,1/3> <A,1/4> <C,3/4> <D,1> <B,1/2> <C,1/2> <B,1/9> <C,8/9> D B B A A A D B C A A A D C B A A A A A A D C C D B B A A C D B C A A C D C B A A C A A C D C C D B B A B A

  13. Naive Algorithm • Clearly this algorithm is inefficient and can be exponential even for |Σ|=2. • Notice that there is a lot of waste: • Many patterns share same substrings. • Given ε, we can ignore patterns w.p. < ε.

  14. Maximal Factor Def 3: Given ε, weighted text T, string X is maximal factor of T at location i if: (a) X appears at location i w.p. ≥ ε (b) if we extend X with 1 character to right or left – the probability drops below ε.

  15. Maximal Factor <A,1/2> <B,3/8> <C,1/8> <A,1/3> <B,1/3> <D,1/3> <A,1/4> <C,3/4> <D,1> <B,1/2> <C,1/2> <B,1/9> <C,8/9> D B A C

  16. Algorithm B Algorithm B • Find all maximal factors in text. • Concatenate factors to create new text. • Run regular pattern matching algorithm on new regular text. Note: A pattern appearing in new text has prob. of appearance ≥ ε.

  17. Total Length of Maximal Factors What is total length of all maximal factors? Consider the following case: <A,1-δ> <B, δ> <A,1-δ> <B, δ> <C,1> <C,1> <A,1-δ> <B,δ> <A,1-δ> <B,δ> such that (1-δ)n/3 = ε. • n/3 maximal factors of length 2/3*n • Total length of all maximal factors is Ω(n2).

  18. Classifying Text Locations Given ε, we classify location i of weighted text into 3 categories: • Solid positions: one character w.p. exactly 1. • Leading positions: at least one character w.p. greater than 1-ε (and less than 1). • Branching positions: all characters have probability of appearance at most 1-ε.

  19. Classifying Text Locations <A,1/2> <B,3/8> <C,1/8> <A,1/3> <B,1/3> <D,1/3> <A,1/4> <C,3/4> <D,1> <B,1/3> <C,2/3> <B,1/9> <C,8/9> If ε ≤ 1/2, at most 1 “eligible” character at leading position

  20. LST Transformation Def 4: The Leading to Solid Transformation of weighted text T=t1,…,tn, LST(T)=t’1,…,t’n is: where leading character has prob. of app. ≥ max{1-ε, ε}

  21. LST Transformation <A,1/2> <B,3/8> <C,1/8> <A,1/3> <B,1/3> <D,1/3> <A,1/4> <C,3/4> <D,1> <B,1/3> <C,2/3> <A,1/2> <B,3/8> <C,1/8> <A,1/3> <B,1/3> <D,1/3> <C,1> <D,1> <B,1/3> <C,2/3> <C,1> <B,1/9> <C,8/9>

  22. Extended Maximal Factor Def 5: X is an extendedmaximal factor of T if X is an maximal factor of LST(T). <A,1> <A,1-δ> <B,δ> <A,1> <A,1-δ> <B,δ> <C,1> <C,1> <C,1> <C,1> <A,1> <A,1-δ> <B,δ> <A,1> <A,1-δ> <B,δ>

  23. Lemma 1 Lemma 1: Total length of all extended maximal factors is at most O(n∙(1/ε)2 log(1/ε)). Corollary: For constant k, total length of all extended maximal factors is linear.

  24. Lemma 1 Why can we assume constant ε? • In practice: want patterns that appear with noticeable probabilities e.g. 90%, 50% or 20%. • Finding patterns w.p. at least 20% => 1/ε=5. • Smaller percentage = smaller ε, rarely in practice.

  25. Proof of Lemma 1 Case 1:ε > 1/2, search patterns w.p. > 50%. Obv: At each location at most 1 char w.p. > 50%. • Total length of all factors is ≤ n. For rest of proof we assume ε ≤ 1/2.

  26. Proof of Lemma 1 Claim 1: A (extended) maximal factor passes by at most O((1/ε)∙log(1/ε)) branching positions. Proof: Denote lb = max. # of branching position passed. In a branching position all characters have prob. of appearance ≤ 1-ε :

  27. Proof of Lemma 1 Claim 2: At most extended maximal factors start at each location. Intuition: <A1,ε> <A2,ε> <A1/ε,ε> <B,1> <C,1> <A1,1/2> <A2,1/2> <B1, 2ε> <B2, 2ε> <B1/2ε,2ε> <C,1>

  28. Proof of Lemma 1 Claim 1: A (extended) maximal factor passes by ≤ O((1/ε)log(1/ε)) branching positions. Claim 2: At most extended maximal factors starting at each location. Corollary: each location is in ≤ O((1/ε)2 log(1/ε)) extended maximal factors.

  29. Proof of Lemma 1 There are lb starting locations, from each location there are ≤ extended maximal factors. Corollary: each location is in ≤ O((1/ε)2 log(1/ε)) extended maximal factors.

  30. Finding Extended Maximal Factors Algorithm for finding extended maximal factors: • Transform T to LST(T) • Find all maximal factors in LST(T) by: (a) At each starting location try to extend until the prob. drops below ε. (b) Backtrack to previous branching position and try to extend the factor and so on ... Run time: linear in the output length.

  31. Framework for Solving Weighted Matching Problems Solving Weighted Matching Problems: • Find all extended maximal factors of T. • Concatenate factors (add $’s betw) to get T’. • Compute property by extending probabilities until below ε • Run property algorithm on text T’ with .

  32. Conclusions • Our framework yields: • Solutions to unsolved weighted matching problems (scaled, swaped, param. matching, indexing) • Efficient solutions to others (exact and approx.) • For constant ε: • Weighted matching problems can be solved in same running times as regular pattern matching • Weighted ndexing can be solved in same times except for O(n log log(n)) preprocessing

More Related