1 / 28

A Lempel-Ziv text index on secondary storage

Combinatorial Pattern Matching 2007. A Lempel-Ziv text index on secondary storage. Diego Arroyuelo and Gonzalo Navarro. Index. T. P. P. P. P. Introduction. The full-text searching problem :

wan
Download Presentation

A Lempel-Ziv text index on secondary storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combinatorial Pattern Matching 2007 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro

  2. Index T P P P P Introduction The full-text searching problem: • to find all the occ occurrences of a pattern P[1..m] in a text T[1..u] (both over an alphabetSof sizes) We are interested in indexed text searching: • an index on T allows us to find quickly the pattern occurrences In our work the index • replaces the text (self-indexing) • is compressed (LZ) (compression+search)

  3. Applications and goals Main applications of text searching: • Computational Biology (DNA and protein sequences) • Oriental language texts (Japanese, Chinese, Korean, etc.) • “Natural language” texts (English, Spanish, etc.) • Music (MIDI pitch sequences) • Program code • Etc. Compressed self-indexes: • Reduce the space requirement (not storing the text + compressing) • Are useful in cases where accessing the text is expensive (for example, web search engines)

  4. Motivations • The use of a compressed self-index may totally remove the need to use the disk • However… Huge texts Sequential text searching + compression improves disk performance Compressed self-indexes More disk accesses but smaller seek time

  5. Motivations By reducing the space of the index we aim at: • Saving disk space(important for storage media of limited size) • Reducing the seek time when searching(because the index is smaller)

  6. Model of computation We assume a model of computation where: • A disk page of size B can be transferred to main memory in a single disk access • We can hold a constant number of disk pages in main memory • We count every disk access • The text is static

  7. Related Works • String B-trees [FG, JACM 1999]: 3 – 4 times text size • Compact Pat Trees [CM, SODA 1996]: 5 – 6 times text size • Compressed Suffix Arrays [MNS, ISAAC 2003] • About 0.25 – 0.5 times text size • 2(1 + m · logBu) accesses for counting • O(log u) extra accesses per occurrence! Can we define a small an efficient index on secondary storage?

  8. RevTrie LZTrie Searching LZ78 compressed texts: the LZ-index Different types of occurrences… LZ78 parses the text into phrases

  9. Shortest possible LZ78 phrases containing P LZTrie P P P Occurrences of Type 1 Occurrences contained in a single phrase By LZ78, P is a suffix of such phrases Subtrees containing ocurrences of type 1

  10. LZTrie RevTrie Pr P P P Occurrences of Type 1 • As P is a suffix of such phrases, Pr is a prefix of the corresponding reverse phrases • We need the Reverse Trie (RevTrie) to solve this problem Occurrences contained in a single phrase navigation between tries!

  11. k-1 k P2 P1 RevTrie P LZTrie Pr1 P2 k-1 k Node RNode Occurrences of Type 2 Occurrences spanning two consecutive phrases Phrases endingwith P1 Phrases startingwith P2

  12. Occurrences of Type 3 O(m2) occurrences of type 3 in the worst case O(m2) random accesses in the worst case Occurrences spanning more than two consecutive phrases

  13. The LZ-index • A compressed full-text self-index based on the LZTrie[Navarro, JDA 2004] • Four data structures compose the LZ-index • LZTrie: the trie formed by all the LZ78 phrases B0,…,Bn • RevTrie: the trie formed by all the reverse LZ78 phrases Br0,…,Brn • Node: a mapping from phrase identifiers to their node in LZTrie • RNode: a mapping from phrase identifiers to their node in RevTrie • Overall: the LZ-index requires 4nlogn(1+o(1)) = 4uHk + o(ulogs) bits, for k = o(logsu) • We don’t need to store the text!

  14. The LZ-index on secondary storage • The LZ-index was originally designed for main memory • It has a non-regular pattern of access to the index components • We define a version of LZ-index for secondary storage • We divide the problem as follows: • Solving the Basic Trie Operations • Reducing the Navigation Between Structures

  15. Solving the basic trie operations • We cut the tries into disjoint blocks of size at most B, using the Clark and Munro Strategy • Every block stores a subtree of the whole trie • We arrange these blocks in a tree by adding inter-block pointers • We are able to compute • parent(x) • child(x, a) • depth(x) • subtreesize(x) • preorder(x) • ancestor(x, y) With one extra disk access in the worst case

  16. LZTrie RevTrie Pr P P P Reducing the navigation between structures We avoid random accesses to report only one occurrence We would need a data structure able of finding all these subtrees without random accesses Occurrences contained in a single phrase For counting...

  17. k k-1 P2 P1 RevTrie LZTrie Pr1 P2 y k-1 k y’ Reducing the navigation between structures Occurrences spanning two consecutive phrases LRmapping

  18. Reducing the navigation between structures • We add some redundancy to reduce the number of accesses between index components • Many random accesses now become a single access + sequential scanning (please read the paper for other technical details) • The overall space requirement is 8uHk + o(ulogs) bits, for any k = o(logsu) • The space can be dropped to6uHk + o(ulogs) bits if we only need to count pattern occurrences

  19. Experimental results • We indexed: • XMLfilefrom Pizza&Chili Corpus(200 megabytes) (http://pizzachili.dcc.uchile.cl) • We searched for 5,000 random patterns • count and locate queries • We assume a disk page of 32 kilobytes (i.e., 8,192 integers of 32 bits)

  20. Experimental results • We compared against • Suffix Arrays for secondary storage: • The two-level hierarchy of[BYBZ, 1996] • String B-trees: • We use the model provided in[FG, 1996] • Compact Pat Trees (CPT) [CM, 1996]

  21. Experimental results (count) String B-trees LZ-index Suffix Array CPT 3.3 times smaller than String B-trees

  22. Experimental results (count) LZ-index String B-trees Suffix Array CPT

  23. Experimental results (locate) CPT Suffix Array String B-trees LZ-index • Average number of accesses to report the first occurrence • LZ-index  11 • String B-trees  12 2.6 times smaller than String B-trees

  24. Experimental results (locate) CPT String B-trees Suffix Array LZ-index

  25. Conclusions • The LZ-index can be adapted to work on secondary storage • Requiring up to 8uHk + o(ulogs) bits, for any k = o(logsu) • Our index is significantly smaller than any other practical secondary-memory data structure • LZ-index requires more disk accesses • But a smaller index would have a smaller seek time

  26. Future work • We assumed a constant main-memory space, but… • To implement our index in a real practical setting • Handling dynamism(String B-trees require 13.5 times the text size!) • Direct construction on secondary storage • adapting [AN, ISAAC 2005] to work on disk

  27. Questions? Contact darroyue@dcc.uchile.cl gnavarro@dcc.uchile.cl

  28. Thanks! Contact darroyue@dcc.uchile.cl gnavarro@dcc.uchile.cl

More Related