1 / 18

Space-Efficient String Mining under Frequency Constraints

Space-Efficient String Mining under Frequency Constraints. Johannes Fischer Ludwig-Maximilians-Universität München. Veli Mäkinen and Niko Välimäki University of Helsinki. Frequent string mining : optimal time. "frequent" is most frequent but does not make a difference...

waite
Download Presentation

Space-Efficient String Mining under Frequency Constraints

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Space-Efficient String Mining under Frequency Constraints Johannes Fischer Ludwig-Maximilians-Universität München Veli Mäkinen and Niko Välimäki University of Helsinki

  2. Frequent string mining : optimal time • "frequent" is most frequent but does not make a difference... • "I" differentiates DB1 from DB2 • "We are" differentiates DB2 from DB1 • String mining under several kind of frequency constraints can be done in optimal linear time using suffix array techniques [FHK06]. DB1 DB2 I am frequent I am also frequent Am I also making a difference We are frequent We are also frequent We are all frequent Workshop on Compression, Santiago, Chile

  3. Frequent string mining : optimal space? • "frequent" is most frequent but does not make a difference... • "I" differentiates DB1 from DB2 • "We are" differentiates DB2 from DB1 • Problem: Can string mining be done using assymptotically the same space as what is needed for storing the string collection? DB1 DB2 I am frequent I am also frequent Am I also making a difference We are frequent We are also frequent We are all frequent Workshop on Compression, Santiago, Chile

  4. Our result: Space-efficient string mining • Given a collection C of d documents with overall length n=||C||=∑{T C}|T|, where T  Σ*, T  C. • We give a string mining algorithm that uses • O(n log |Σ|+d log n) bits of working space and • O(n log n) time. • Since usually d << n, the solution is significantly more space-efficient than previous ones that use O(n log n) working space. Workshop on Compression, Santiago, Chile

  5. High-level description • Tight integration of Kasai et al. [Kasetal01] algorithm to visit all branching substrings of a text and Hui's [Hui92] color set size technique. • Toolbox: compressed suffix array, compressed LCP values, range minimum queries, searchable partial sums. Workshop on Compression, Santiago, Chile

  6. Overview without compressed structures RMQ(LCP,8,14)=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 a a b a # a b a a a b # b b a b b # a b b a # 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # T: SA: LCP: Workshop on Compression, Santiago, Chile

  7. Right-most path of suffix tree a a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

  8. Suffixes-insertion algorithm a b# a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

  9. Maintain only the right-most path a • Once a node is popped,its subtree is ready, and all statistics for the substring ending to the node can be reported b# a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

  10. Hui's algorithm • Store at each node v of suffix tree • the values: • S[v]: number of leaves in the subtree of v, and • C[v]: number of dublicateoccurrences of the substring ending at node v. a a S[v]=3 C[v]=1 S[v]-C[v] tells how many different documents there are in the subtree of v. AKA S[v]-C[v] defines the frequency of the substring ending at node v. D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

  11. Making it all space-efficient [1/5] • Right-most path is kept in a specialstack: • Relative string depths are coded using Elias codes. • Takes O(n) bits. • Allows constant time pop/push. a a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

  12. Making it all space-efficient [2/5] • Preliminary counter S[v] values along the right-most path are encoded identically as the stack. • Once a node v popped its S[v] value is final and this value is added to its parent. • O(n) bits with constant time updates. a a S[v]=3 C[v]=1 SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

  13. Making it all space-efficient [3/5] • Preliminary counter C[v] values along the right-most path are encoded using a dynamic searchable partial sumsstructure. • Once a node v popped its C[v] value is final and this value is added to its parent. • O(n) bits with O(log n) time updates. a a S[v]=3 C[v]=1 SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile

  14. Making it all space-efficient [4/5] • Table D encodes document numbers where suffixes belong to in lex. order. • Predecessor-query on D gives the previous occurrence inside the same document. • RMQ-between the two occurrences gives the string depth where the C[v] counter should be incremented. a a S[v]=3 C[v]=1 D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # RMQ=0 RMQ=2 RMQ=1 Workshop on Compression, Santiago, Chile

  15. Making it all space-efficient [5/5] • Table D does not need to be stored as predecessors can be updated "on-the-fly" using an array pred[1..d]. • Compressed suffix array supportsaccess in O(logε n) time and takes O(n log |Σ|) bits. • A bit-vector B[1,n] marks the document boundaries in the text, so that rank(B,SA[i])=D[i]. • LCP and RMQ structures each take2n(1+o(1)) bits [HS02,FH07]. a a S[v]=3 C[v]=1 D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # RMQ=0 RMQ=2 RMQ=1 Workshop on Compression, Santiago, Chile

  16. Extensions • This presentation only sketched how to compute the frequency values inside one document collection. In addition, • the computation is easy to adjust to report patterns occurring frequently in one document collection and infrequently in the other; • the computation gives a space-efficient construction algorithm for Sadakane's scheme of stroring the frequency values [Sad07]; and • other compressed text indexes can be plugged in to obtain other space/time tradeoffs. Workshop on Compression, Santiago, Chile

  17. Epilogue • Thanks to the discussions with Luis Russo after the workshop, we were able to improve the space from O(n log d) to O(d log n). • The presentation has been changed accordingly. Workshop on Compression, Santiago, Chile

  18. References [FHK06] Johannes Fischer, Volker Heun, Stefan Kramer: Optimal String Mining under Frequency Constraints, Proc. PKDD'06, LNAI 4213, pages 139-150, 2006. [FH07] Johannes Fischer, Volker Heun: A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array. In Proc. ESCAPE'07, LNCS 4614, pages 459- 470, 2007. [FMV07] Johannes Fischer, Veli Mäkinen, Niko Välimäki: Space-efficient String Mining under Frequency Constraints. Submitted. [HS02] Wing-Kai Hon, Kunihiko Sadakane: Space-Economical Algorithms for Finding Maximal Unique Matches. In Proc. CPM 2002, LNCS 2373, pages 144-152, 2002. [Hui92] Lucas Hui: Color Set Size Problem with Application to String Matching. In Proc. CPM 1992, LNCS 644, pages 230-243, 1992. [Kasetal01] Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, Kunsoo Park: Linear-Time Longest- Common-Prefix Computation in Suffix Arrays and Its Applications. In Proc. CPM 2001, LNCS 2089, pages 181-192, 2001. [Sad07] Kunihiko Sadakane: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1): 12-22 (2007) Workshop on Compression, Santiago, Chile

More Related