Transform Based and Search Aware Text Compression Schemes and Compressed Domain Text Retrieval

Transform Based and Search Aware Text Compression Schemes and Compressed Domain Text Retrieval Nan Zhang PhD Candidate School of Computer Science University of Central Florida Spring, 2005

Acknowledgement • Chair of Committee: • Dr. Amar Mukherjee • Committee Member • Dr. Mostafa Bassiouni • Dr. Sheau-Dong Lang • Dr. Huaxin You

Outline • Motivation • Background and Review • Transform Based Text Compression • Compressed Pattern Matching • Search Aware Text Compression and Retrieval • Conclusion and Future Works

Growth of Data • According to Berkeley’s report 2003: • Print, film, magnetic, and optical storage media produced about 5 exabytes of new information in 2002. 92% of the new information was stored on magnetic media, mostly in hard disks. • 1 exabytes =1,000,000,000,000,000,000 bytes OR 10^18 bytes • the amount of new information stored on paper, film, magnetic, and optical media has about doubled from 2000 to 2003. • Information flows through electronic channels -- telephone, radio, TV, and the Internet -- contained almost 18 exabytes of new information in 2002, 3.5 times more than is recorded in storage media. 98% of this total is the information sent and received in telephone calls - including both voice and data on both fixed lines and wireless.

Motivation • Decrease storage requirements • Effective use of communication bandwidth • Increased data security • Multimedia data on information superhighway • Growing number of “killer” applications. • Find information from the “sleeping” files.

Brief History of Data Compression • 1838: Morse Code: using shorter codes for more commonly used letters. • Late 1940’s: born of Information Theory. • 1948:Claude Shannon and Robert Fano using probabilities of blocks. • 1953: Huffman Coding. • 1977: Arithmetic Coding. • 1977, 1978 (LZ77, LZ78): Abraham Lempel and Jacob Ziv they assigned codewords to repeating patterns in the text. • 1984 (LZW):Terry Welch, from LZ78. • 1987 (DMC): Cormack and Horspool, Dynamic Markov Model. • 1989 (PPM): Cleary and Witten. • 1994(BWT):Burrows and Wheeler, block sorting. • Others: Run-Length Coding (RLE), Move-to-Front (MTF), Vector Quantization (VQ), Discrete Cosine Transform (DCT), Wavelet methods, etc.

Our Contribution (I) • Star family compression algorithms: • First invented by Dr. Mukherjee in 1995. Joint work of M5 Group. • A reversible transformation that can beapplied to a source text that improves existing algorithm'sability to compress. • We use a static dictionary to convert the English words into predefined symbol sequences. • Create additional context information that is superior to the original text. • Use ternary tree data structure to expedite the transformation.

Our Contribution (II) • Exact and approximate pattern matching in Burrows-Wheeler transformed (BWT) text. • Compressed Pattern Matching (CPM) domain. • Extract auxiliary arrays to facilitate matching. • K-mismatching. • K-approximate matching. • CPM up to MTF stage in BWT based compression.

Our Contribution (III) • Modified LZW algorithm that allows random access and partial decoding for the compressed text retrieval. • Search aware text compression. • Random access and Partial decoding. • Parallel processing. • Balance compression ration and access speed.

Background • Basic Concepts • Entropy. • Lossy and Lossless Compression. • Transform based compression. • Compressed (domain) Pattern Matching. • Suffix Tree, Suffix Array. • Ternary Tree. • Our research is about Lossless Text Compression and Searching.

Lossless Text Compression • Basic techniques • Run-Length • Move-to-Front • Distance Coding • Statistical methods: variable-sized codes based on the statistics of the symbols or group of symbols. • Huffman, Arithmetic • PPM (PPMC, PPMD, PPMD+, PPMZ …) • Dictionary methods: encode a sequence of symbols with another token • LZ (LZ77, LZ78, LZW, LZH …) • Transform base methods: • BWT

Pattern Matching • Exact pattern matching: • given a string P and a text T regarded as a longer string, find all the occurrences, if any, of pattern P in text T. • Applications: word processors, text information retrieval, internet search engines such as Google and Yahoo, online electronic publications, biology pattern finding such as DNA and RNA sequences stored as strings, etc. • Methods: Boyer-Moore, Knuth-Moore-Pratt, Aho-Corasick, Karp-Rabin,… They are O(n) algotirhms.

Pattern Matching • Subproblems: • suffix tree, suffix array, longest common ancestor, longest common substring • When we search in the compressed text, we will try to build such structures on the fly by (partially) decomposing the compressed text

Approximate Pattern Matching • Looking for simliar patterns involves: • definition of distance/similarity between the patterns • the degree of error allowed • Dynamic programming algorithms have been used to solve the approximate pattern matching problems.

Text Information Retrieval • Many data are in compressed form: .gz,.Z,.bz,… • Approaches • Decompress + Searching. • Search directly on compressed file • High efficiency. But not possible for all compressed forms. • Tradeoff: decompress as few as possible and use the information to do searching. • Search Aware Compression

Data Compression Stages • Preliminary compression steps. (LIPT, BWT) • Organization by context. (MTF,DC) • Probability estimation. (PPM, DMC) • Length-reducing code. (Huffman, Arithmetic)

Star Transform Model • Performs a lossless, reversible transformation to a source file prior to applying an existing compression algorithm • The basic philosophy of the compression algorithm is to transform the text into some intermediate form, which can be compressed with better efficiency.

Data compression Transformed text: ***a^ ** * ***b. Original text: This is a test. Transform encoding Compressed text: (binary code) Transformed text: ***a^ ** * ***b. Data decompression Original text: This is a test. Transform decoding Star Transform Model

Dictionary • All the transforms use an English language dictionary D that has about 60,000 words (current dictionary) 500Kbytes. • Shared by both compression and decompression ends. • This English dictionary Dispartitioned into disjoint dictionaries Di, each containing words of length i, where i = 1,2…n. • Each dictionary Di is partially sorted according to the frequency of words in the English language. • A mapping is used to generate the encoding for all words in each dictionary Di.Di[j] denotes the jthword in dictionary Di. • If the word in the input text is not in the English dictionary (viz. a new word in the lexicon) it will be passed to the transformed text unaltered. Special characters and alphabet capitalization is also handled by escape character.

Star Family Transforms • *-Encodingreplaces each character in the input word by a special placeholder character ‘*’ and retains at most two characters from the original word. • It preserves the length of the original word in the encoding. • e.g. the one letter word ‘a’ will be encoded as ‘*’ , ‘am’ will be encoded as ‘*a’ , the word ‘there’ will be encoded as ‘*****’, and ‘which’ will be encoded as ‘a*****’.

Star Family Transforms • LPT (Length Preserved Transform) encodes the input word as *pc1c2c3where c1c2c3give the dictionary offset of the corresponding word (c1 cycles through z-a, c2 cycles through A-Z, c3 cycles through a-z), and p is the padding sequence (suffix of the string ‘…nopqrstuvw’) to equalize the length of encoded and original input word. • The number of dictionary offset characters(at the end) is fixed for all words. • e.g. the first word of length nine in the original dictionary will be encoded as *stuvwzAa

Star Family Transforms • RLPT(Reversed Length Preserved Transform) alsoencodes the input word as *pc1c2c3but uses the reverse string for p i.e. suffix of the string ‘wvutsrqpon…’ . e.g. the first word of length nine in the original dictionary will be encoded as *wvutszAa

Star Family Transforms • SCLPT (Shortened Context) transform does not preserve the original length of the input word. • Encodes the input word as *pc1c2c3but discards all the characters except the first one in the LPT padding sequence p. This character is used to determine the sequence of characters that follow up to the last character ‘w’ in LPT. • e.g. the first word of length nine in the original dictionary will be encoded as *szAa.

Star Family Transforms • LIPT (Length Index Preserving Transform)introduces the idea of using a character denoting the length of the original word. It is given by *cl[c1][c2][c3] where cl is an alphabet [a-z] denoting the length of input word ( a denotes length of 1, b length of 2,…, z length of 26, A 27…., Z denotes 52). The number of Last characters giving the dictionary offset (with respect to the start of each length block) is variable and not fixed as in last three transforms. Each of c1,c2,c3 cycles through [a-zA-z]. • e.g first word of length ten in the original dictionary will be encoded as *j, second word as *ja,fifty fourth as *jaa, 2758th word will be encoded as *jaaa .

Star Family Transforms • Original Text (83 bytes): It is truly a magnificent piece of work. constitutionalizing internationalizations • *- (84 bytes): r*~ q* v*D** * OC********* Vl*** y* Q***. b****************** *********************

Star Transforms Examples • RLPT (84 bytes): r*~ q* *wtbE * *wvutsrqwzD *wvaM y* *zbR. *wvutsrqponmlkjizaC *wvutsrqponmlkjihgzaA • SCLPT (48 bytes): *s~ *r *wtbE * *qwzD *wvaM *z *zbR. *izaC *gzaA • LIPT (44 bytes): *br~ *bq *eazD *a *kYC *eZl *by *dQ. *sb *u

Test Corpus Data Compression

Results

Dictionary Overhead • The size of English dictionary is 500kb uncompressed and 197KB when compressed with Bzip2. • Let the uncompressed size of the data to be transmitted be F and the uncompressed dictionary size be D. Then for Bzip2 with LIPT : F 2.16 + D 2.38 F 2.38, which gives F 5.4 MB. • To break even the overhead associated with dictionary, transmission of 5.4MB data has to be achieved. If the normal file size for a transmission is 1 MB then the dictionary overhead will break even after about 5.4MB transmissions. All the transmission above this number contributes towards gain achieved by LIPT. • Similarly for PPMD with LIPT, F 4.14 MB. • With increasing dictionary size, this threshold will go up, but in a scenario where thousands of files are transmitted, the amortized cost will be negligible.

Fast Implementation • Hash table • Binary tree • Digital search tries • Ternary search trees Searching for a string of lengthk in a ternary search tree with nstrings will require at most O(log n+k) CHAR comparisons

Timing Performance -- Compared with LIPT • Encoding • Decoding 76.3% 84.9%

Encoding Bzip2 -9 Gzip -9 PPMD (k=5) 28.1% 50.4% 21.2% Decoding 18.6% Some minimal increase Timing Performance -- Compared with BackendCompressor In many applications, encoding is performed offline. The decoding speed is more important.

Compressed PatternMatching • Motivation: • When the sheer volume of data provided to users, they are increasingly in compressed format: .gz, .tar.gz, etc. • Storage & transmission • Information retrieval • Speed

Pattern Matching Algorithms • Naïve: O(mn) |P|=n |T|=m • BM: O(n+m) • KMP: O(n+m) • Aho-Corasick: O(n+m) • Shift-And: O(mn) bit operations • Suffix tree: O(m+n) • Space reduction: Suffix array. Can use binary search

Approximate Pattern Matching • Edit distance problem • K-mismatch problem • K-approximate problem

CompressedPatternMatching • Definition Given T a text string, P a pattern, and Z the compressed representation of T, locate the occurrences of P in T with minimal (or no) decompression of Z. • Approach • Decompress + Searching • Search directly on compressed file • High efficiency. But not possible for all compressed forms • Tradeoff: decompress as few as possible and use the information to do searching.

Compressed Pattern Matching • Amir et.al. search a pattern in LZ-77 compressed text on time, where m=|P|, u=|T|, and n=|Z|. • Bunke et.al. search for patterns in run-length encoded files in time. Where is the length of pattern when it is compressed. • Moura.et. al. proposed algorithms on Huffman-encoded files. • Special compressed schemes facilitate latter pattern Matching (Manber, Shibata, Kida, etc).

Compressed Pattern Matching • Little has been done on searching directly on text compressed with the Burrows-Wheeler Transform (BWT). • Compression Ratio: BWT is significantly superior to the popular LZ-based methods such as gzip and compress, andonly second to PPM. • Running Time: much faster than PPM, but comparable with LZ-based algorithms

Exact Matching in BWT Text • BWT performs a permutation of the characters in the text, such that characters in lexically similar contexts will be near to each other.

Burrows-Wheeler Transform • Forward Transform Given an input text, • Form u permutations of T by cyclic rotations of characters in T, forming a uxu matrix M’, with each row representing one permutation of T. • Sort the rows of M’ lexicographically to form another matrix M, including T as one of rows. • Record L, the last column of M, and id, the row number where T is in M. • Output of BWT is (L, id).

M’ bananas sbanana asbanan nasbana anasban nanasba ananasb M ananasb anasban asbanan bananas nanasba nasbana sbanana bnnsaaa,4 Burrows-Wheeler Transform • Forward Transform : bananas (Stable Sort: preserves original order of records with equal keys ) M contains all the suffixes of the text and they are sorted!

Burrows-Wheeler Transform • Inverse Transform: Given only the (L,id) pair • Sort L to produce F, the array of first characters • Compute V, provides 1-1 mapping between the elements of L and F. F[V [j]]=L[j]. • Generate original text T, the symbol L[j] cyclically precedes the symbol F[j] in T, that is, L[V [j]] cyclically precedes the symbol L[j] in T.

Burrows-Wheeler Transform F L a b a n a n b s id =4 n a n a s a

BWT Based Compression • BWT • Move-to-Front (MTF) • Run-Length-Encoding (RLE) • Variance Length Encoding (VLE) • Huffman or • Arithmetic Input ->BWT -> MTF-> RLE -> VLE -> Output

Motivations • BWT provides a lexicographic ordering of the input text as part of its inverse transform process. • It maybe possible to perform an initial match on a symbol and its context, and then decode only that part of text.

Derivation of Auxiliary Arrays • Start from F, L and V arrays, the output of bwt– before the mtf and further encoding. • Given F and L, obtain a set of bi-grams from text T and pattern P.

Transform Based and Search Aware Text Compression Schemes and Compressed Domain Text Retrieval

Transform Based and Search Aware Text Compression Schemes and Compressed Domain Text Retrieval

Presentation Transcript

Text and Web Search

Search And Text Analysis

CS276A Text Retrieval and Mining

Text Based Information Retrieval - Text Mining

Information Retrieval and Text Mining

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

CS276A Text Retrieval and Mining

Compressed indices for text based on Ziv-Lempel compression

On Compression-Based Text Classification

Text compression

Text Compression

ITR Collaborative: Compressed Search and Retrieval for Very Large Text and Image Repositories

Text Compression

Text-Based Content Search and Retrieval in ad hoc P2P Communities

Text Compression

CS276A Text Retrieval and Mining

Language-Model Based Text-Compression