1 / 26

Dictionary Matching with One Gap

Dictionary Matching with One Gap. Amihood Amir, Avivit Levy, Ely Porat and B. Riva Shalom. CPM 2014 - Moscow. ! MIND THE GAP . Outline. The DMG( Dictionary Matching with one Gap ) Problem Motivation Previous Work Bidirectional Suffix Trees Solution Lookup Table addition

tirza
Download Presentation

Dictionary Matching with One Gap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dictionary Matching with One Gap CPM 2014 Amihood Amir, AvivitLevy, Ely Porat and B. Riva Shalom

  2. CPM 2014 - Moscow CPM 2014

  3. !MIND THE GAP CPM 2014

  4. Outline • The DMG(Dictionary Matching with one Gap ) Problem • Motivation • Previous Work • Bidirectional Suffix Trees Solution • Lookup Table addition • Open Problems CPM 2014

  5. The DMG Problem Agapped pattern is a pattern P of the form: P1{1,1} P2{2,2}… Pk-1{k-1,k-1}Pk Each Pj is over alphabet , {j,j} is a sequence of at least j and at most j don’t cares = @. Example: aba{3,6}cbb aba @@@cbb aba@@@@cbb aba@@@@@cbb aba@@@@@@cbb CPM 2014

  6. The DMG Problem The DMGproblem is: • Preprocess: A dictionary D of dgapped patterns P1,…, Pd over alphabet . • Query: A text T of length n over alphabet . • Output:all locations in T where a dictionary gapped pattern ends. We focus on DMG with a single gap. CPM 2014

  7. Example Dictionary: P1 =aba {3,6} cbb P2 = ab {3,6} bbac P3 = aa{3,6} ac Query 1 2 3 4 5 6 7 8 9 1011 text: a b a a b a c b b a c CPM 2014 P2,2 P1,2 P1,1 P3,2 P3,1 P2,1 First =1≤i≤d{ Pi,1 } Second=1≤i≤d{ Pi,2 }

  8. Motivation • Computational Biology • A renew interest due to cyber security. • Network intrusion detection systems perform protocol analysis, content searching and content matching to detect harmful software. • Malware may appear in several packets! CPM 2014

  9. Previous Work • Gapped pattern matching problem was studied for a few decades, eg. [Myers, JACM 1992],[Navaro&Raffinot, Algorithmica 2004],[Bille&Thorup, ICALP 2009] , [Bille&Thorup SODA 2010], [Morgante et al., JCB 2005], [Rahman et al., COCOON 2006], [Bille et al., TCS 2012] DMG problem not studied enough ! [Kucherov&Rosinovich,TCS 1997],[Zhang et al., IPL 2010]-no bounds on the length of the gap. CPM 2014

  10. Bi-directional suffix trees algorithm Gapped pattern: a b{3,6}b b a c Query: a b a a b a c b b a c CPM 2014

  11. Bi-directional suffix trees algorithm Idea: view as [Amir et al., JAL 2000] Gapped patterns:P1=ab a{3,6}a b a c P2=a b a{3,6}b b a P3=a b{3,6}b a a Query: a b a a b a c b b a c CPM 2014 gap Use suffix tree TFR of FirstR Use suffix tree TS of Second

  12. Bi-directional suffix trees algorithm For each text location l Insert tltl+1…tnto TS (the node h) to find labels on the path to h. For f= l --1 to l --1 Insert tftf-1…t1 to TFR(the node g) to find labels on the path to g. Output intersection (for end locations). Finds Pi,2 starting at location l. CPM 2014 Finds Pi,1 ending at location f.

  13. Bi-directional suffix trees algorithm - Intersection Patterns: {(1,4),(2,9),(3,7),…,(6,5),…} TFR TS 1 2 CPM 2014 Range: [1,9] Range: [2,7] 3 5 6 h 7 9 g

  14. Bi-directional suffix trees algorithm (continued) Intersection via range queries: (2,9) (8,8) (3,7) (6,5) Range: [2,7] CPM 2014 (1,4) Range: [1,9]

  15. Time & Space • Preprocessing Time: Dictionary segments suffix tree and reverse suffix tree: O(|D|) Preprocessing grid for range queries: O(d log d). [Chan et al., SoCG 2011] • Preprocessing Space: Dictionary segments suffix tree and reverse suffix tree: O(|D|) Space for grid: O(d logd). [Chan et al., SoCG 2011] CPM 2014

  16. Time & Space • Query Time: For each end text location, we try every gap size: a factor of . The number of range queries is the number of vertical paths in a given path: O(log2min{d, log |D|}). A range query costs: O(log logd+occ). [Chan et al., SoCG 2011] Total: O(n()log logd log2min{d, log |D|}+occ). 1 CPM 2014 3 6 9 g

  17. Lookup Table algorithm Idea: Instead of using range queries in a grid to compute the intersection, we use a pre-computed lookup table. Enables intersection in O(occ) time. • Total query time becomes: • O(n()+occ). CPM 2014

  18. Lookup Table algorithm • Inter[g,h] = all is.t. Pi,1R appears on the path from the root of TFR till node g and Pi,2 appears on the path from the root of TS till node h. 2 1 • P1=(1,4), P2=(2,9), P3=(3,7), P4=(3,2), …,P6=(6,5), P7 =(9,6) CPM 2014 Inter[ 3, 5 ]= {4} 5 3 6 7 h g 9

  19. Lookup Table algorithm • Inter[g,h] = all is.t. Pi,1R appears on the path from the root of TFR till node g and Pi,2 appears on the path from the root of TS till node h. 2 1 • P1=(1,4), P2=(2,9), P3=(3,7), P4=(3,2), …,P6=(6,5), P7 =(9, 6) CPM 2014 Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} 5 3 6 7 g 9 h

  20. Lookup Table algorithm • Inter[g,h] = all is.t. Pi,1R appears on the path from the root of TFR till node g and Pi,2 appears on the path from the root of TS till node h. 2 1 • P1=(1,4), P2=(2,9), P3=(3,7), P4=(3,2), …,P6=(6,5), P7 =(9,6) CPM 2014 Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} Inter[ 6, 7 ]= {3,4,6} 5 3 6 7 9 g h

  21. Lookup Table algorithm • Inter[g,h] = all is.t. Pi,1R appears on the path from the root of TFR till node g and Pi,2 appears on the path from the root of TS till node h. 2 1 • P1=(1,4), P2=(2,9), P3=(3,7), P4=(3,2), …,P6=(6,5), P7 =(9,6) CPM 2014 Inter[ 3, 5 ]= {4} Inter[ 3, 7 ]= {3,4} Inter[ 6, 7 ]= {3,4,6} Inter[ 9, 7 ]= {3,4,6} 5 3 6 7 9 h g

  22. Lookup Table alg. • P1=(1,4), P2=(2,9), P3=(3,7), P4=(3,2), …,P6=(6,5),P7 =(9,6) …. 2 1 7 4 6 5 1 2 1 -- 1 -- -- 2 CPM 2014 3 3 4 -- 3 Inter[3,5]= {4} Inter[3,7]= {3,4} Inter[6,7]= {3,4,7} 5 : 6 6 6 7 9 : 7 9

  23. Lookup Table algorithm • Preprocessing: • Time: Table can be computed using DP in time O(d2ovr + |D|) where ovr is the number of subpatterns including other subpattern as a prefix or suffix. • Space: O(d2 + |D|). • Query time: O(n()+occ). CPM 2014

  24. Bi-directional suffix trees & range queries Our Results • Preprocessing time: O(d log d + |D|). Space: O(d logd + |D|). Query time: O(n()log logd log2(min{d, log |D|} )+occ). • Preprocessing time: O(d2ovr + |D|). Space: O(d2 + |D|). Query time: O(n()+occ). Bi-directional suffix trees & Lookup table CPM 2014

  25. Open Problems • Generalizing to k gaps • Reducing the dependency on the size  • Scalability to different gap bounds in the dictionary • Online algorithm CPM 2014

  26. Thank You! CPM 2014

More Related