180 likes | 284 Views
Space-Efficient String Mining under Frequency Constraints. Johannes Fischer Ludwig-Maximilians-Universität München. Veli Mäkinen and Niko Välimäki University of Helsinki. Frequent string mining : optimal time. "frequent" is most frequent but does not make a difference...
E N D
Space-Efficient String Mining under Frequency Constraints Johannes Fischer Ludwig-Maximilians-Universität München Veli Mäkinen and Niko Välimäki University of Helsinki
Frequent string mining : optimal time • "frequent" is most frequent but does not make a difference... • "I" differentiates DB1 from DB2 • "We are" differentiates DB2 from DB1 • String mining under several kind of frequency constraints can be done in optimal linear time using suffix array techniques [FHK06]. DB1 DB2 I am frequent I am also frequent Am I also making a difference We are frequent We are also frequent We are all frequent Workshop on Compression, Santiago, Chile
Frequent string mining : optimal space? • "frequent" is most frequent but does not make a difference... • "I" differentiates DB1 from DB2 • "We are" differentiates DB2 from DB1 • Problem: Can string mining be done using assymptotically the same space as what is needed for storing the string collection? DB1 DB2 I am frequent I am also frequent Am I also making a difference We are frequent We are also frequent We are all frequent Workshop on Compression, Santiago, Chile
Our result: Space-efficient string mining • Given a collection C of d documents with overall length n=||C||=∑{T C}|T|, where T Σ*, T C. • We give a string mining algorithm that uses • O(n log |Σ|+d log n) bits of working space and • O(n log n) time. • Since usually d << n, the solution is significantly more space-efficient than previous ones that use O(n log n) working space. Workshop on Compression, Santiago, Chile
High-level description • Tight integration of Kasai et al. [Kasetal01] algorithm to visit all branching substrings of a text and Hui's [Hui92] color set size technique. • Toolbox: compressed suffix array, compressed LCP values, range minimum queries, searchable partial sums. Workshop on Compression, Santiago, Chile
Overview without compressed structures RMQ(LCP,8,14)=1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 a a b a # a b a a a b # b b a b b # a b b a # 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # T: SA: LCP: Workshop on Compression, Santiago, Chile
Right-most path of suffix tree a a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Suffixes-insertion algorithm a b# a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Maintain only the right-most path a • Once a node is popped,its subtree is ready, and all statistics for the substring ending to the node can be reported b# a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Hui's algorithm • Store at each node v of suffix tree • the values: • S[v]: number of leaves in the subtree of v, and • C[v]: number of dublicateoccurrences of the substring ending at node v. a a S[v]=3 C[v]=1 S[v]-C[v] tells how many different documents there are in the subtree of v. AKA S[v]-C[v] defines the frequency of the substring ending at node v. D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Making it all space-efficient [1/5] • Right-most path is kept in a specialstack: • Relative string depths are coded using Elias codes. • Takes O(n) bits. • Allows constant time pop/push. a a b a# SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Making it all space-efficient [2/5] • Preliminary counter S[v] values along the right-most path are encoded identically as the stack. • Once a node v popped its S[v] value is final and this value is added to its parent. • O(n) bits with constant time updates. a a S[v]=3 C[v]=1 SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Making it all space-efficient [3/5] • Preliminary counter C[v] values along the right-most path are encoded using a dynamic searchable partial sumsstructure. • Once a node v popped its C[v] value is final and this value is added to its parent. • O(n) bits with O(log n) time updates. a a S[v]=3 C[v]=1 SA: LCP: 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # Workshop on Compression, Santiago, Chile
Making it all space-efficient [4/5] • Table D encodes document numbers where suffixes belong to in lex. order. • Predecessor-query on D gives the previous occurrence inside the same document. • RMQ-between the two occurrences gives the string depth where the C[v] counter should be incremented. a a S[v]=3 C[v]=1 D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # RMQ=0 RMQ=2 RMQ=1 Workshop on Compression, Santiago, Chile
Making it all space-efficient [5/5] • Table D does not need to be stored as predecessors can be updated "on-the-fly" using an array pred[1..d]. • Compressed suffix array supportsaccess in O(logε n) time and takes O(n log |Σ|) bits. • A bit-vector B[1,n] marks the document boundaries in the text, so that rank(B,SA[i])=D[i]. • LCP and RMQ structures each take2n(1+o(1)) bits [HS02,FH07]. a a S[v]=3 C[v]=1 D: SA: LCP: 0 1 2 3 0 3 1 1 0 1 0 1 2 3 1 2 0 3 1 2 2 3 2 5 12 18 23 4 22 8 9 1 10 2 6 15 19 11 18 3 21 7 14 16 20 13 0 0 0 0 0 1 1 2 3 1 2 3 2 3 0 1 1 2 2 2 1 2 3 # # # #a a a a a a a a a a b b b b b b b b b ##a a a b b b b b ##a a a a b b b ab b #a a b b ##ab#a a b#a#a#aab#b ##a#b#b b## # RMQ=0 RMQ=2 RMQ=1 Workshop on Compression, Santiago, Chile
Extensions • This presentation only sketched how to compute the frequency values inside one document collection. In addition, • the computation is easy to adjust to report patterns occurring frequently in one document collection and infrequently in the other; • the computation gives a space-efficient construction algorithm for Sadakane's scheme of stroring the frequency values [Sad07]; and • other compressed text indexes can be plugged in to obtain other space/time tradeoffs. Workshop on Compression, Santiago, Chile
Epilogue • Thanks to the discussions with Luis Russo after the workshop, we were able to improve the space from O(n log d) to O(d log n). • The presentation has been changed accordingly. Workshop on Compression, Santiago, Chile
References [FHK06] Johannes Fischer, Volker Heun, Stefan Kramer: Optimal String Mining under Frequency Constraints, Proc. PKDD'06, LNAI 4213, pages 139-150, 2006. [FH07] Johannes Fischer, Volker Heun: A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array. In Proc. ESCAPE'07, LNCS 4614, pages 459- 470, 2007. [FMV07] Johannes Fischer, Veli Mäkinen, Niko Välimäki: Space-efficient String Mining under Frequency Constraints. Submitted. [HS02] Wing-Kai Hon, Kunihiko Sadakane: Space-Economical Algorithms for Finding Maximal Unique Matches. In Proc. CPM 2002, LNCS 2373, pages 144-152, 2002. [Hui92] Lucas Hui: Color Set Size Problem with Application to String Matching. In Proc. CPM 1992, LNCS 644, pages 230-243, 1992. [Kasetal01] Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, Kunsoo Park: Linear-Time Longest- Common-Prefix Computation in Suffix Arrays and Its Applications. In Proc. CPM 2001, LNCS 2089, pages 181-192, 2001. [Sad07] Kunihiko Sadakane: Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms 5(1): 12-22 (2007) Workshop on Compression, Santiago, Chile