380 likes | 527 Views
ESA2008@Universitat Karlsruhe, Sep 15, 2008. An Online Algorithm for Finding the Longest Previous Factors. Daisuke Okanohara University of Tokyo. Kunihiko Sadakane Kyushu University. Problem: Finding the longest previous factors (matching). Input : A text T[0…n-1]
E N D
ESA2008@Universitat Karlsruhe, Sep 15, 2008 An Online Algorithm for Finding the Longest Previous Factors Daisuke OkanoharaUniversity of Tokyo Kunihiko SadakaneKyushu University
Problem: Finding the longest previous factors (matching) • Input : A text T[0…n-1] • At all position k, report the longest substring T[k-len+1…k] that also occurs in previous positions (history)T[pos…pos+len-1] = T[k-len+1…k] • c.f. LZ77, LZ-factorization (pos, len) = (0, 4) (pos, len) = (5, 2)
Applications • Data Compression • LZ77, Prediction by Partial Matching • Pattern Analysis • Log analysis • Data Mining
Previous approach • Sequential search on the fly • O(n2) time for a text of length n • Offline- Index approach • Read an whole text beforehand, and build an index (suffix array/trees) for it. • Search the match using the index [Chen 07] [Chen 08] [Crochemore 08] [Kolpakov 01] [Larsson 99] • 6n bytes, and O(n log n) time [Chen 08]Suffix Arrays with Range Minimum Query
New Problem: Online finding the longest previous factors • Report match information just after reading each character • A case where we don’t know the length of data beforehand, e.g. streaming data • Previous approaches cannot deal with this problem
Our approach for new problem • Online construction of enhanced prefixarrays • Update an index just after reading each character • Although many methods used in LZ77 cannot report the longest match, our method can. • Succinct data structures • Keep all information very compactly; using about the same space for an original text
Prefix arrays • Keep NOTsuffix arrays (SA), but prefix arrays (PA) • because when a character is added at the last of a text, SA may cause W (n) changes, but PA not • In PA, prefixes are sorted in the reverse-lexicographic order Tnew=aaaaz T=aaaa SA for T SA for Tnew PA for T PA for Tnew 0 $ 1 a$ 2 aa$ 3 aaa$ 4 aaaa$ 0 $ 4 aaaaz$ 3 aaaz$ 2 aaz$ 1 az$ 5 z$ 0 $ 1 $a 2 $aa 3 $aaa 4 $aaaa 0 $ 1 $a 2 $aa 3 $aaa 4 $aaaa 5 $aaaaz
Our idea • Weiner’s suffix tree construction algorithm • Insert the suffixes from the shortest ones • Modify it to the insert prefixes form the shortest ones • Similar idea is used for the incremental construction of compressed suffix arrays [Chan, et. al 2007], [Lippert 2005] • We extend this work to the succinct version • Our algorithm reports matching information as a by-product of construction • Do not require tree representation, we just use array information
Preliminary: Dynamic Rank/Select Dictionary (DRSD) • For an text T[0…n-1], DRSD supports: • rank(T, c, i): return the number of c in T[0…i] • select(T, c, i): return the position of i-th c in T • insert(T, c, i): insert c at T[i] • delete(T, i): delete T[i] • These operations can be supported in time (O(logn) time if s < logn), bits space where sis the alphabet size [Lee, et. al. 07],
Preliminary: Range Minimum Query (RMQ) • Given an array E[0…n-1] of elements from totally ordered set, rmq(E, l, r) returns the index of the smallest element in E[l…r] • i.e. rmq(E, l, r) = argmink∈[l, r]E[k] • return the leftmost such element in the tie • In the static case, RMQ can be supported in O(1) time using 2n+o(n) bits space [Fischer, 2007] • In the dynamic case, RMQ/insert/delete can be supported in O(Tlogn) time using O(n) bits if the lookup cost (E[i]) is O(T)
Data structures • Keep the following data structures for T[0…k] • Assume T[0]=$, $ is the unique smallest character • B[0…k]: (Prefix-) BW-transformed Text • B[i] = T[PA[i]+1] and B[i] = $ if PA[i]=k • H[0…k]: Height Array • will be explained in the next slide • C[0…s-1] : Cumulative Array • C[c] = the total number of characters c’ s.t. c’ < c in T • s: The position for the next prefix to be inserted
T = $abaababa PA stores the end position of each prefix(we will omit this) Prefix stores prefixes sorted in the reverse-lexicographic order (Neither PA nor prefix are stored explicitly) We can examine PA[i] by using SAlookupoperation using O(log2n) time as in FM-index[Ferragina 00]
B stores the next character for each prefix(Burrows Wheeler’s transform for prefix arrays) T = $abaababa
H stores the length of the longest common suffix between adjacent prefixes T = $abaababa
T = $abaababa s = 4 s denotes the position where $ in B, andthe longest prefix is placed.
T = $abaababa C[c] = the number of characters c’ that is smaller than c in T(=B) C[$]=0 C[a]=1 C[b]=6
T = $abaababaa The next character `a’comes !
T = $abaababaa Replace $ in B[s] with a (because $ is placed in the position of the longest prefix) a
T = $abaababaa Find the position for the new prefix $abaababaa Count the number of a in B[0…s-1] = rank(B, a, s-1) = 2
T = $abaababaa Insert $abaababaa at 3rd position in aC[a]+rank(B, a, s-1) =3 s := C[a]+rank(B, a, s-1), insert(B, s, $)
T = $abaababaa Update H This is actually the length of the longest match in the history
T = $abaababa Recall that in the previous step, $abaaand $aba are placed in the prefixes whose B is `a’ These positions can be found by using rank and select c. f. succ(T, `c’, s) = select(T, c, rank(T, s, c))
T = $abaababa RMQ(H, 4, 6) = 5, H[5] = 0 Therefore RMQ(H, 4, 6) + 1 is the new value for the next H entry
T = $abaababa RMQ(H, 3, 3) = 3, H[3] = 3 Therefore RMQ(H, 3, 3) + 1 is the new value for the nextH entry
T = $abaababaa rmq(H, 3, 3) + 1 rmq(H, 4, 6) + 1
T = $abaababaa Report max(4, 1) = 4as the length of thelongest factor and report the positionof $abaa as SAlookup[2]- len = 0 Report (pos=0, len=4) as the max. matching
Overall algorithm All operations are rank, select, RMQ
Overall Analysis • H is stored in 2n bits [Sadakane , Soda 02] • naïve representation requires O(n log n) bits • requires one SA lookup operation to decode • B is stored in nlogs+ o(nlogs) bits • by using dynamic rank/select dictionary • The bottleneck of our algorithm is rmq(H, I, r) which requires O(log3n) time • SAlookuprequires O(log2n) time
Overall Analysis (cont.) • We can solve the online longest previous factor problem in O(log3n) time for each character, using nlog2s+ o(nlogs) + O(n) bits of space • where s is the alphabet size, and n is the length of a text
Simulating window buffer • If the working space is limited, we often discards the history from the oldest ones • We can simulate this by using the almost the same operations as in the insertion operation • We actually do not discard a character but ignore it • If we actually discard an oldest character , it may cause W(n) changes in B and H • The effect of discarded character is remained (prefixes are sorted according to the discarded characters) • But this does not cause the problem if we only report the matching information up to the history size
Experiments • In experiment, we used a simpler data structure (algorithm is same) • B and H is store in the balanced binary tree • Each leaf stores the small block of B and H • We call this implementation as OS • Compare OS with otheroffline algorithms • Require to read the whole text beforehand • CPSa, CPSd: SA+LCP with stack [Chen, et. al. 07] • CPS6n: SA with RMQ [Chen, et. al. 08] • kk-lz: mreps, specialized for σ=4 [Kolpakov 01]
Peak memory usage in bytes per input symbol • The space of OS is smallest in many real data especially when the values in H is small
Runtime in milliseconds for searching the longest previous factors • OS is about 2~10 times slower than the fastest ones due to the dynamic operations
Conclusion • Solve online longest matching problem by using enhanced prefix arrays • Simple and easy to implement • Require about 3~6 times space of the input text • Actually this is a by-product of construction of compressed suffix trees c.f. Weiner’s algorithm • Simple; and much room for improvements • by using better rank/select/rmq implementation
Future work • Construction of compressed suffix trees • Update the parenthesis tree efficiently • Actually, the time complexity for this is smaller • Practical improvements • Currently, dynamic succinct data structure is not efficient due to cache misses, and memory fragmentation • Approximated version of longest matching problem; enough for many application Thank you for you attention !
Weiner’s suffix tree’s construction alg. $a $abraca $abracada $abra $ab $abracadab $abracad $abr $a $abraca $abracada $abra $ab $abracad $abr $a $abraca $abracada $abra $ab $abracadab $abracad $abr $abracadabr . a a ba $ $ $abr $abr $ab $ $abracad $abrac $abr $abracad abr $abrac $abr a $abrac ab $abrac $abracada $ $ $abracad $ $abrac $abr $abracad $abrac $abracada