280 likes | 521 Views
Combinatorial Pattern Matching 2007. A Lempel-Ziv text index on secondary storage. Diego Arroyuelo and Gonzalo Navarro. Index. T. P. P. P. P. Introduction. The full-text searching problem :
E N D
Combinatorial Pattern Matching 2007 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro
Index T P P P P Introduction The full-text searching problem: • to find all the occ occurrences of a pattern P[1..m] in a text T[1..u] (both over an alphabetSof sizes) We are interested in indexed text searching: • an index on T allows us to find quickly the pattern occurrences In our work the index • replaces the text (self-indexing) • is compressed (LZ) (compression+search)
Applications and goals Main applications of text searching: • Computational Biology (DNA and protein sequences) • Oriental language texts (Japanese, Chinese, Korean, etc.) • “Natural language” texts (English, Spanish, etc.) • Music (MIDI pitch sequences) • Program code • Etc. Compressed self-indexes: • Reduce the space requirement (not storing the text + compressing) • Are useful in cases where accessing the text is expensive (for example, web search engines)
Motivations • The use of a compressed self-index may totally remove the need to use the disk • However… Huge texts Sequential text searching + compression improves disk performance Compressed self-indexes More disk accesses but smaller seek time
Motivations By reducing the space of the index we aim at: • Saving disk space(important for storage media of limited size) • Reducing the seek time when searching(because the index is smaller)
Model of computation We assume a model of computation where: • A disk page of size B can be transferred to main memory in a single disk access • We can hold a constant number of disk pages in main memory • We count every disk access • The text is static
Related Works • String B-trees [FG, JACM 1999]: 3 – 4 times text size • Compact Pat Trees [CM, SODA 1996]: 5 – 6 times text size • Compressed Suffix Arrays [MNS, ISAAC 2003] • About 0.25 – 0.5 times text size • 2(1 + m · logBu) accesses for counting • O(log u) extra accesses per occurrence! Can we define a small an efficient index on secondary storage?
RevTrie LZTrie Searching LZ78 compressed texts: the LZ-index Different types of occurrences… LZ78 parses the text into phrases
Shortest possible LZ78 phrases containing P LZTrie P P P Occurrences of Type 1 Occurrences contained in a single phrase By LZ78, P is a suffix of such phrases Subtrees containing ocurrences of type 1
LZTrie RevTrie Pr P P P Occurrences of Type 1 • As P is a suffix of such phrases, Pr is a prefix of the corresponding reverse phrases • We need the Reverse Trie (RevTrie) to solve this problem Occurrences contained in a single phrase navigation between tries!
k-1 k P2 P1 RevTrie P LZTrie Pr1 P2 k-1 k Node RNode Occurrences of Type 2 Occurrences spanning two consecutive phrases Phrases endingwith P1 Phrases startingwith P2
Occurrences of Type 3 O(m2) occurrences of type 3 in the worst case O(m2) random accesses in the worst case Occurrences spanning more than two consecutive phrases
The LZ-index • A compressed full-text self-index based on the LZTrie[Navarro, JDA 2004] • Four data structures compose the LZ-index • LZTrie: the trie formed by all the LZ78 phrases B0,…,Bn • RevTrie: the trie formed by all the reverse LZ78 phrases Br0,…,Brn • Node: a mapping from phrase identifiers to their node in LZTrie • RNode: a mapping from phrase identifiers to their node in RevTrie • Overall: the LZ-index requires 4nlogn(1+o(1)) = 4uHk + o(ulogs) bits, for k = o(logsu) • We don’t need to store the text!
The LZ-index on secondary storage • The LZ-index was originally designed for main memory • It has a non-regular pattern of access to the index components • We define a version of LZ-index for secondary storage • We divide the problem as follows: • Solving the Basic Trie Operations • Reducing the Navigation Between Structures
Solving the basic trie operations • We cut the tries into disjoint blocks of size at most B, using the Clark and Munro Strategy • Every block stores a subtree of the whole trie • We arrange these blocks in a tree by adding inter-block pointers • We are able to compute • parent(x) • child(x, a) • depth(x) • subtreesize(x) • preorder(x) • ancestor(x, y) With one extra disk access in the worst case
LZTrie RevTrie Pr P P P Reducing the navigation between structures We avoid random accesses to report only one occurrence We would need a data structure able of finding all these subtrees without random accesses Occurrences contained in a single phrase For counting...
k k-1 P2 P1 RevTrie LZTrie Pr1 P2 y k-1 k y’ Reducing the navigation between structures Occurrences spanning two consecutive phrases LRmapping
Reducing the navigation between structures • We add some redundancy to reduce the number of accesses between index components • Many random accesses now become a single access + sequential scanning (please read the paper for other technical details) • The overall space requirement is 8uHk + o(ulogs) bits, for any k = o(logsu) • The space can be dropped to6uHk + o(ulogs) bits if we only need to count pattern occurrences
Experimental results • We indexed: • XMLfilefrom Pizza&Chili Corpus(200 megabytes) (http://pizzachili.dcc.uchile.cl) • We searched for 5,000 random patterns • count and locate queries • We assume a disk page of 32 kilobytes (i.e., 8,192 integers of 32 bits)
Experimental results • We compared against • Suffix Arrays for secondary storage: • The two-level hierarchy of[BYBZ, 1996] • String B-trees: • We use the model provided in[FG, 1996] • Compact Pat Trees (CPT) [CM, 1996]
Experimental results (count) String B-trees LZ-index Suffix Array CPT 3.3 times smaller than String B-trees
Experimental results (count) LZ-index String B-trees Suffix Array CPT
Experimental results (locate) CPT Suffix Array String B-trees LZ-index • Average number of accesses to report the first occurrence • LZ-index 11 • String B-trees 12 2.6 times smaller than String B-trees
Experimental results (locate) CPT String B-trees Suffix Array LZ-index
Conclusions • The LZ-index can be adapted to work on secondary storage • Requiring up to 8uHk + o(ulogs) bits, for any k = o(logsu) • Our index is significantly smaller than any other practical secondary-memory data structure • LZ-index requires more disk accesses • But a smaller index would have a smaller seek time
Future work • We assumed a constant main-memory space, but… • To implement our index in a real practical setting • Handling dynamism(String B-trees require 13.5 times the text size!) • Direct construction on secondary storage • adapting [AN, ISAAC 2005] to work on disk
Questions? Contact darroyue@dcc.uchile.cl gnavarro@dcc.uchile.cl
Thanks! Contact darroyue@dcc.uchile.cl gnavarro@dcc.uchile.cl