190 likes | 299 Views
Compressed Index for Dictionary Matching. WK Hon (NTHU) , TW Lam (HKU) , R Shah (LSU) , SL Tam (HKU) , JS Vitter (Purdue). Outline. Dictionary Matching Problem Summary of Results Description of Our Solution (Brief): Based on (I) Suffix Tree
E N D
Compressed Index for Dictionary Matching WK Hon(NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Outline • Dictionary Matching Problem • Summary of Results • Description of Our Solution (Brief): Based on (I) Suffix Tree (II) A Simple Sampling Idea (III) Handling Irregularities • Open Problems
Dictionary Matching • Input: A set of d short patterns, { P1, P2, …, Pd } of total length n • Problem: Preprocess the patterns, and create an index so that: on receiving any textT, we can report for each Pj, all positions in T where it occurs
Dictionary Matching • Relevant parameters to measure index’s performance: d = # of patterns n = total length of patterns |T| = length of T s = size of alphabet of T and patterns occ = total occurrences in search result
optimal e= constant in (0,1) |patterns| + o(n log s) Summary of Results
a v t e e t h c h a a i r t v t e Patricia trie for { ate, chair, chat, hat, have, vet } Existing Solution I: Patricia Trie • Compact trie storing all d patterns
Existing Solution I: Patricia Trie • Advantage: Space: |patterns| + O( d log n ) bits Very small overhead in addition to the input patterns
Existing Solution I: Patricia Trie Searching Strategy: For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found • Disadvantage: Searching: worst-case O(|T|n + occ) time
v a e i r t c t h i r h a t v r $ e e a $ e i v $ e e r i t t r t $ suffix tree for { ate, chair, chat, hat, have, vet } Existing Solution II: Suffix Tree • Compact trie storing all suffixes of all d patterns
Matching Time = O(|T|) Existing Solution II: Suffix Tree • SameSearching Strategy: • For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found Searching: worst-case O(|T| + occ) time
Existing Solution II: Suffix Tree Disadvantage: Space: O( n log n ) bits could be much larger than O( n log s ), the space for|patterns|
no suffixes: poor searching all suffixes: poor space some suffixes: good space + searching Our Solution
v a e i r t c t h r h a t $ e a $ i v e e r t t a = 2 for { ate, chair, chat, hat, have, vet } Our Solution: Sampling • Store one suffix for every a suffixes
irregularities Our Solution: Sampling • Store one suffix for every a suffixes v a e i r t c t h r h a t $ e a $ i v e e r t t a = 2 for { ate, chair, chat, hat, have, vet }
Need to handle irregularities Matching time = O(|T|) despite irregularities Our Solution: Sampling • SameSearching Strategy: • For each position k in T • Match T from the root starting at k • Report occurrence of any Pj found
Y-fast trie When a = logsn Handling irregularities predecessor search in a set of (log n)-bit integers Search: O(|T| log log n + occ) time Space: O( n log s ) bits
Sting B-tree When a = (log1+en) / logs Handling irregularities predecessor search in a set of (log1+en)-bit strings Search: O(|T| (logen + log d) + occ) time Space: |patterns| + o(n log s) bits
Sting B-tree When a = (log1+en) / logs Handling irregularities predecessor search in a set of (log1+en)-bit strings Search: O(|T| (logen + log d) + occ) time Space: nHk + o(n log s) bits FerVen 07
Open Problems Compressed + Dynamic Version: Can an index support update in the set of patterns ? Target: Achieve nHk-type space bound External Memory Version: Can an index operate in external memory and still support fast searching ?