860 likes | 996 Views
Dictionary Matching and Indexing with Edits and Don’t Cares. Richard Cole NYU. Lee-Ad Gottlieb NYU. Moshe Lewenstein Bar-Ilan. Pattern Matching. Various problems of the following flavor: Preprocess a text t , or a collection of strings d 1 ,…,d x ,
E N D
Dictionary Matching and Indexing with Edits and Don’t Cares Richard Cole NYU Lee-Ad Gottlieb NYU Moshe Lewenstein Bar-Ilan
Pattern Matching • Various problems of the following flavor: • Preprocess a text t, or a collection of strings d1,…,dx, • so that given a query string p, all matches with the text can be found quickly. Indexing Dictionary queries Dictionary matching All-to-all matching
Pattern Matching • Dictionary queries. BateBeat Boat Boot Beta
Pattern Matching • Dictionary matching. BateBeat Boat Boot The fish beat my boot.
Pattern Matching • Text indexing. abracadabra ra ra
Pattern Matching • All-to-all matching. BateBeat Boat Boot bat boots be
Previous Work b Dictionary Queries BateBeat Boat Boot a e o t a Beta a o e t t t
Previous Work b Dictionary Queries BateBeat Boat Boot a e o t a Beta a o e t t t
Suffix Tree g o Text Indexing Oogog oogog ogog gog og g o g o g g o g o oog g
Suffix Tree g o Text Indexing Oogog oogog ogog gog og g o g o g g o g o oog g
Suffix Tree g o Text Indexing Oogog oogog ogog gog og g o g o g g o g o oog g
Suffix Tree g o Text Indexing Oogog oogog ogog gog og g o g o g g o g o oog g
Approximate Matches • Wildcards (don’t cares) Boat Bo*t • Substitutions Boat Boot • Edits – insertions and deletions Boat B_at
Previous Work – Best Results • Indexing and Dictionary Matching (edits) • Buchsbaum, Goodrich, Westbrook. k=1 p log log n + occ query time n log n space • Dictionary Queries (substitutions) • Brodal, Gasieniec. k=1 p + occ query time n space
Previous Work – Basic Intuition • abracadabra • Build a suffix tree for • abracadab • abracada • abracad • abraca • abrac • abra • abr • ab • a • abracadabra • And for • a • ar • arb • arba • arbad • arbada • arbadac • arbadaca • arbadacar abrac*dabra
New Results • Indexing, Dictionary Queries, Dictionary Matches • Substitutions k< log np + [(c1log n)k log log n] / k! + occ query time n(c2log n)k / k! space • Edits k< log np + [(c3log n)k log log n] / k! + 3kocc query time n(c4log n)k / k! space • Wildcards in pattern k< log np + 2klog log n / k! + occ query time n +(k+log n)k / k! space
Dictionary Wildcard Queries Three data structures for dictionary wildcard queries • Naïve: • O(n) space kp query time • Less-naïve: • O(n1+k) p • New data structure: • O(n logkn) 2kp
Naïve Approach Query string: *it f p s a i a i i r t y n t t
Naïve Approach Query string: *it f p s a i a i i r t y n t t
Naïve Approach Query string: *it f p s a i a i i r t y n t t
Naïve Approach Query string: *it f p s a i a i i r t y n t t
Naïve Approach Query string: *it f p s a i a i i r t y n t t
Naïve Approach Query string: *it f p s a i a i i r t y n t t
Naïve Approach Query string: *it f p s a i a i i Query time: k p r t y n t t
Less-Naïve Approach f p s a i a i i r t y n t t
Less-Naïve Approach f p s * a i a i i a i n r t y n t r t t t y
Less-Naïve Approach Query string: *it f p s * a i a i i a i n r t y n t r t t t y
Less-Naïve Approach Query string: *it f p s * a i a i i a i n r t y n t r t t t y
Less-Naïve Approach Query string: *it f p s * a i a i i a i n r t y n t r t t t y
Less-Naïve Approach Query string: *it f p s * a i a i i a i n r t y n t r t t t y Query time: p
Less-Naïve Approach * f p s a i a i i * * r t y n t * t Space: O(n1+k)
New Approach f p s a i a i i r t y n t t
New Approach f p s * a i a i i a i r t y n t n y t t
New Approach Query string: *it f p s * a i a i i a i r t y n t n y t t
New Approach Query string: *it f p s * a i a i i a i r t y n t n y t t
New Approach Query string: *it f p s * a i a i i a i r t y n t n y t t
New Approach Query string: *it f p s * a i a i i a i r t y n t n y t t
New Approach Query string: *it f p s * a i a i i a i r t y n t n y t t
New Approach Query string: *it f p s * a i a i i a i r t y n t n y t t Query time: 2kp
Space Analysis • Create a wildcard subtree at each node in the original trie. • heaviest child is not in the wildcard tree. • Look at any leaf of the trie • How many of its ancestors were not the heaviest child? log2n • So it appears in at most log nwildcard trees. • Space: n log n n logkn
Edit Distance • Wildcards is (algorithmically) the simplest type of approximate search. • What issues come up when dealing with substitutions, insertions and deletions?
Substitution Search Query string: aab a b a b a a b a
Substitution Search Query string: aab a b a b a a b a
Substitution Search Query string: aab a b a b a a b a
Substitution Search Query string: aab a b a b a a b a
Substitution Search Query string: aab a b a b a a b a
Substitution Search Query string: aab a b a b a a b a
Substitution Search Query string: aab a b a b a a b a
Substitution Search Query string: aab a b a b a a b a
Substitution Tree Query string: aab a b a b a a b a