Lecture on Information Knowledge Network "Information retrieval and pattern matching"

Lecture on Information Knowledge Network"Information retrieval and pattern matching" Laboratory of Information Knowledge Network, Division of Computer Science, Graduate School of Information Science and Technology, Hokkaido University Takuya KIDA Lecture on Information knowledge network

The 6thPattern matching on compression text About data compression Motivation and aim of this study Pattern matching on Huffman encoded text Pattern matching on LZW compressed text Unified framework: Collage system Aspect of speeding-up of pattern matching by text compression: BPE compression Lecture on Information knowledge network

About data compression Lossless compression Lossy compression Non-universal encoding Entropy encoding JPEG Huffmanencoding Arithmeticencoding Run-length MPEG Universal encoding Statistical MP3 Grammar-based PPM Dictionary-based Sequitur LZ78 sort-based LZ77 LZW BPE BWT used for image and voice data ※reference： Managing Gigabytes: Compressing and Indexing Documents and Images, I. H. Witten, A. Moffat, T. C. Bell, Morgan Kaufmann Pub, 1999. Lecture on Information knowledge network

Ordinal pattern matching machine Ordinal pattern matching machine decompress Pattern matching machinefor compressed texts Aim of this study Original text Original text Compressedtext Compressedtext Lecture on Information knowledge network

Example of application We want to pack a lot of data into a small computer such as a mobile phone and PDA as much as possible! Because of small amount of memory, to construct an extra index structure isn’t good solution! However, we want to retrieve at high speed! E-books Personal databases Short memos E-mails Business cards Directories KOJIEN Schedule tables E2J/J2E dictionaries ※写真はsharp mi110と東芝V601T Lecture on Information knowledge network

There might hardly be "To decrease capacity, the text data is preserved by compressing it" in the category that personally uses the computer today when the capacity of the hard disk and the memory has grown enough. I have not used this function though the function to reduce capacity putting compression on Windows in each folder is provided. It will be seemed as an advantage none to compress the text data because there are 100 harms though preserving it by compressing it if it is a multimedia data like the image and the voice data, etc. is natural. However, the good policy doing the compression preservation deleting neither for instance a large amount of log file nor past mail data, etc.In a word 011110000111100111111101011010001010101001111010001011100110101111011000111011111101001101011111001101001110011011000001111110101101011111111100000101001001010011010 Difficulty of PM on compressed texts Document files Compressed document files • The starting position of each codewordis invisible • Representation of each string is not unique Lecture on Information knowledge network

Our goal Goal：　Do pattern matching faster than the above! Search-on-the-fly method Decompress-then-search method Search-without-decompress method ※ 上図イラストは竹田正幸先生の作 Lecture on Information knowledge network

Lempel-Ziv-Welch (LZW)compression c c 0 0 a a b b 1 1 2 2 3 3 c c b b a a a a 4 4 5 5 9 9 10 10 b b b b a a a a 6 6 7 7 8 8 12 12 b b Dictionary trie 11 11 T. A. Welch: A technique for high performance data compression, IEEE Comput., Vol.17, pp.8-19, 1984. a b ab ab ba b c aba bc abab TextT: 1 2 4 4 5 2 3 6 9 Compressed text E(T): 11 Let D be the set of strings entered in the dictionary trie D = {a, b, c, ab, ba, bc, ca, aba, abb, bab, bca, abab} D is constructed adaptively Dictionary trie |D| = O(compressed text length) ※ LZW is used for UNIX compress command, GIF image format, and so on. Lecture on Information knowledge network

a b b a b 5 0 1 2 3 4 {ababb, bb} {aba} c 7 6 a {abca} b 9 8 b : goto function {bb} 1 2 3 4 3 1 4 5 : failure function { } : output Move of Aho-Corasick PM machine AC machine for pattern set Π= {aba, ababb, abca, bb} 0 Current state： abababba Text： bb ababb aba aba Output： Lecture on Information knowledge network

a b b a b 5 0 1 2 3 4 {ababb, bb} {aba} c 7 6 a {abca} b 9 8 b : goto function {bb} 1 2 1 2 3 4 4 3 1 1 4 4 5 : failure function { } : output Idea for doing pattern matching on LZW texts T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple pattern matching in LZW compressed text, Proc. Data Compression Conference, pp. 103-112, IEEE Computer Society, Mar. 1998. • To simulate the move of AC machine on LZW compressed texts 0 Current state： abababba Text： Comp. text： 1 2 4 4 5 bb ababb aba aba Output： Lecture on Information knowledge network

Core functions：Jump & Output • Can we compute two functions Jump andOutput well? • functionJump(q, u) ： • simulates the consecutive transitions caused by stringu in O(1) time. • The domain is Q×D. • returns the state numberof AC machine • functionOutput(q, u) ： • reports the occurrences within the string obtained by concatenating the string corresponding to state q and string u in O(r) time. • The domain is Q×D. • returns the set of pattern IDs It needs O(m|D|) space by a naïve way. It can be realized in O(m2+|D|) space! It needs O(m|D|) space by a naïve way. It can be realized in O(m2+|D|) space! Lecture on Information knowledge network

Ancestor(N'1(q, u'), |u'|－|u|) if u is a factor of some pattern, Jump(q, u) = δ(ε, u) otherwise. function Jump Let δ(q,u) be the (extended) state transition function※ of the AC machine. O(m3) space δ(q, u) if u is a factor of some pattern, Jump(q, u) = δ(ε, u) otherwise. O(|D|)space O(m2) space O(m2) space※ O(|D|)space ※ δ(q,u) returns the state position after making transition from the state q by string u. ※ u’ is the string corresponding to the nearest ancestor node of u that is also explicit on the generalized suffix trie for P. Lecture on Information knowledge network

~ u ~ A(u) = ｛〈i,p〉| p∈Π, |u|< i <|u|, |p|< i, and u[i－|p|+1...i ]=p｝ Output(q, u) = Output(q, u) ∪ A(u) ~ ~ u p1 p1 p2 p2 function Output ：the longest prefix of u that is also a suffix of a pattern. Note that state q corresponds to a prefix of some pattern O(|D|)space O(m2)space q u Lecture on Information knowledge network

Pseudo code of Kida, et al.[1998]’s algorithm • PMonLZW(E(T) = u1u2…un, Π: pattern set) • Construct AC machine and generalized suffix trie for Π; • Initialize the dictionary trie for E(T); • Preprocess Jump(q,u) and Output(q,u) for any q and u∈{a pattern π∈Πのfactor} • l ← 0; • q ← q0; • for i ← 1…n do • for each 〈d ,π〉∈Output(q, ui) do • report pattern π occurs at position l+d; • q ← Jump(q, ui); • l ← l + |ui|; • Update the dictionary trie;/* enter the string for node ui+1into D*/ • Update variables for Jump(q, ui+1) and Output(q, ui+1);/* compute δ(ε,ui+1), A(ui+1), ui+1’, and |ui+1| by using its parent info.*/ • end of for • end of for Lecture on Information knowledge network

The result of Kida, et al. [1998] • The original idea is from • A. Amir, G. Benson, and M. Farach: Let sleeping files lie: Pattern matching in Z-compressed files, J. Computer and System Sciences, Vol.52, pp.299-307, 1996. • It simulates KMP on LZW compressed texts • By simulating Aho-Corasick（AC）pattern matching machine, we can do multiple pattern matching. • It takes O(m2 +|D|) time and space for preprocessing. • It scans compressed texts in O(n+r) time with O(m2+|D|) space for multiple patterns, and reports all the occurrences. ※ This firstly appears in “T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple pattern matching in LZW compressed text, Proc. Data Compression Conference, pp. 103-112, IEEE Computer Society, Mar. 1998.” Its Journal ed. Appears in “T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa: Multiple Pattern Matching in LZW Compressed Text, Journal of Discrete Algorithms, 1(1), pp. 133-158, Hermes Science Publishing, Dec. 2000.” Lecture on Information knowledge network

Idea for applying bit-parallel technique a b a c b c a a a a a 0 a 0 0 0 1 1 1 0 1 1 1 1 0 1 1 0 0 a 1 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 b 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 a 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 c 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 T. Kida, M. Takeda, A. Shinohara, and S. Arikawa: Shift-And approach to pattern matching in LZW compressed text, Proc. CPM'99, LNCS1645, pp. 1-13, Springer-Verlag, Jul. 1999. aabac Pattern P:= Mask bits abc a aabaacaabacab Text T:= Jump! Jump! Lecture on Information knowledge network

Extended state updating function f’ • For anya∈Σ, u∈Σ*, S∈{1,…, m}, we define as follows. • M(a) = { 1< i < m | P[i] = a } • f(S, a) = ((S ⊕ 1)∪{1}) ∩ M(a) • f’(S,ε) = S and f’(S, ua) = f’( f(S, u), a) • M’(u) = f’({1,・・・, m}, u) • Then, for any u∈Σ*, S∈{1,・・・, m}, we define as • f’(S, u) = ((S ⊕ |u|)∪{1,・・・, |u|}) ∩ M’(u) O(|D|)time and space O(1)time Lecture on Information knowledge network

function Output (Bit-paralleltype) • Definition： • Output(S, u) = { 1 < j < |u| | m∈S } • U(u) = {1 < j < |u| | i <m andu[1..i]=P[m-i+1..m]} • A(u) = {1 < j < |u| | m < i and u[1-m+1..i]=P } • Output(S, u) =((m ⊖ S)∩U(u)) ∪ A(u) O(|D|)time and space O(|D|)time and space q u P P (m ⊖ S)∩U(u) A(u) Lecture on Information knowledge network

The result of Kida, et al. [1999] • applied the bit-parallel technique based on Shift-And method to processing of functions Jump and Output to speed up. • It uses O(m+|Σ|) time and space for preprocessing. • For a given pattern, it scans a given compressed text in O(n+r) time and O(m+|D|) space, and it reports all the occurrences. • It excels in the extensibility as well as Shift-And method. • pattern matching for a generalized pattern • pattern matching with allowing k mismatches • multiple pattern matching Lecture on Information knowledge network

Achievement of our aim! 1.4 AlphaStation XP1000 (Alpha21264: 667MHz) Tru64 UNIX V4.0F Genbank（DNA base sequence）17.1Mbyte 1.2 1.0 Search-on-the-fly method 0.8 CPUtime（sec.） compress(LZW)+KMP 0.6 gunzip(LZ77)+KMP 0.4 Search-without-decompress method T. Kida, et al.[1998] 0.2 Speeding-up by bit-parallelism[1999] 0 5 10 15 20 25 30 Pattern length Lecture on Information knowledge network

Take a breath 2010.12.24RG Gundam1/1（@Higashi-Shizuoka Park） Lecture on Information knowledge network

If … Goal 2 A new goal! The time for doing pattern matching on the original text The time for doing compressed pattern matching ＞ × × × × Why do you need compressed PM? We have enough storage space now. Why do you compress small data like text documents? NO～ YES! Lecture on Information knowledge network

Matching by KMPon the original text A new goal! （Goal 2） 1.4 AlphaStation XP1000 (Alpha21264: 667MHz) Tru64 UNIX V4.0F Genbank（DNA base sequence）17.1Mbyte 1.2 1.0 Search-on-the-fly method 0.8 CPUtime（sec.） compress(LZW)+KMP 0.6 gunzip(LZ77)+KMP 0.4 Search-without-decompress method T. Kida, et al.[1998] 0.2 Overwhelmingly faster! Speeding-up by bit-parallelism[1999] 0 5 10 15 20 25 30 Pattern length Lecture on Information knowledge network

dictionary G AB → H → DE I → GC Byte Pair Encoding (BPE)method 18 ABABCDEBDEFABDEABC Text G GGCDEBDEFGDEGC H GGCHBHFGHGC I Size：２５６ = 1 byte GIHBHFGHI After substitutions 9 Lecture on Information knowledge network

0.8 0.7 0.6 0.5 0.4 Search-without-decompress method 0.3 Agrep on the original text Compressed PM on BPE (KMP type) 0.2 Search-without-decompress method 0.1 Compressed PM on BPE (BM type) Shibata, et al. (2000) 5 10 15 20 25 30 0.0 Achievement of Goal 2 AlphaStation XP1000 (Alpha21264: 667MHz) Tru64 UNIX V4.0F Medline（English text） 60.3Mbyte The fastest in the previous Matching by KMP on the original text CPUtime（sec.） Pattern length Lecture on Information knowledge network

Summarize the above… ordinal GOAL GOAL GOAL GOAL for LZSS for LZW 2 4 1 3 for BPE The original uncompressed text High compression Text compressed by LZSS Medium compression Text compressed by LZW Low compression Text compressed by BPE …but it’s the most suitable for PM! Lecture on Information knowledge network

Paradigm shift 1 Develop a novel compression method which is suitable for pattern matching! Choosing a suitable compressionenables us to accelerate pattern matching! Develop pattern matching algorithmsfor each compression methods Lecture on Information knowledge network

Dense codingtype • [ETDC] Nieves R. Brisaboa, Eva Lorenzo Iglesias, Gonzalo Navarro, and Jose R. Parama:An efficient compression code for text databases, In ECIR2003, pp. 468-481, 2003. • [SCDC] Nieves R. Brisaboa, Antonio Farina, Gonzalo Navarro, and Maria F. Esteller:(s, c)-dense coding: An optimized compression code for natural language text databases, In SPIRE2003, pp. 122-136, 2003. • [FibC] ShmuelTomi Klein and MiriKopel Ben-Nissan: Using fibonacci compression codes as alternatives to dense codes, In DCC2008, pp. 472-481, 2008. • [SVVC] Nieves R. Brisaboa, Antonio Farina, Juan-Ramon Lopez, Gonzalo Navarro, and Eduardo R. Lopez: A new searchable variable-to-variable compressor, In DCC2010, pp. 199-208, 2010. • VF coding type (including grammar-based compressions) • [BPEX] ShirouMaruyama, Yohei Tanaka, Hiroshi Sakamoto, and Masayuki Takeda: Context-sensitive grammar transform: Compression and pattern matching, In SPIRE2008, LNCS5280, pp. 27-38, Nov. 2008. • [DynC] Shmuel T. Klein and Dana Shapira: Improved variable-to-fixed length codes, In SPIRE2008, pp. 39-50, 2009. • [STVF] Takashi Uemura, Satoshi Yoshida, Takuya Kida, Tatsuya Asai, and Seishi Okamoto: Training parse trees for efficient VF coding, In SPIRE2010, pp. 179-184, 2010. Data compression methods for PM Lecture on Information knowledge network

Paradigm shift 2 Break difficulties of various processing by using the compression technology! We can speed up pattern matching by compressing the data. We use the data compression technology to reduce the cost for storing and transferring the data. Lecture on Information knowledge network

Doing something by using compression • Speeding up the calculation of similarity between two long strings by compression technique. • “A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices”,M. Crochemore, G. M. Landau, and M. Ziv-Ukelson, Proceeding of 13th Symposium on Discrete Algorithm, pp.679-688, 2002 • Processing a very huge graph structure on memory at high speed by compression technique. • Shinichi Nakano（Gunma University） “Graph compression with query support”Their method can represent a triangulated planar graph in 2m+o(n) bit and moreover can support some queries on it. • Speeding up the query processing for XML data by compression technique. • Tetsuya Maita and Hiroshi Sakamoto（Kyushu Institute of Technology） Lecture on Information knowledge network

The 6th summary • Pattern matching algorithms on compressed texts • Pattern matching on Huffman encoded text → automaton with synchronization • Pattern matching on LZW compressed text → simulating the move of KMP(AC) on the compressed text • Unified framework: Collage system • A formal system to represent a text compressed by lexicographical compression method • We have clarified what kind of compression methods are suitable for pattern matching. • Aspect of speed-up pattern matching by compression • BPE compression: it has low compression ratio, but it can speed up pattern matching • Our experimental results showed that we could do pattern matching faster than doing on the original texts • A big paradigm shift caused • The data compression technology can be used in the other purposes rather than reducing the data size • The next theme (which is the final topic of "Information retrieval and pattern matching“) • Various topics I didn’t mention about Lecture on Information knowledge network

Lecture on Information Knowledge Network "Information retrieval and pattern matching"