1 / 68

Efficient Approximate Search on String Collections Part II

This article presents various algorithms and methods for efficient approximate search on string collections, including inverted list-based algorithms, n-gram signature algorithms, length-normalized algorithms, selectivity estimation, and more. The focus is on optimizing search efficiency and accuracy.

Download Presentation

Efficient Approximate Search on String Collections Part II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Approximate Search on String CollectionsPart II Marios Hadjieleftheriou Chen Li

  2. Outline • Motivation and preliminaries • Inverted list based algorithms • Gram Signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions

  3. N-Gram Signatures • Use string signatures that upper bound similarity • Use signatures as filtering step • Properties: • Signature has to have small size • Signature verification must be fast • False positives/False negatives • Signatures have to be “indexable”

  4. Known signatures • Minhash • Jaccard, Edit distance • Prefix filter (CGK06) • Jaccard, Edit distance • PartEnum (AGK06) • Hamming, Jaccard, Edit distance • LSH (GIM99) • Jaccard, Edit distance • Mismatch filter (XWL08) • Edit distance

  5. 3 4 5 7 8 9 10 12 13 1 2 6 11 14 q s Prefix Filter • Bit vectors: • Mismatch vector: s: matches 6, missing 2, extra 2 • If |sq|6 then s’s s.t. |s’|3, |s’q| • For at least k matches, |s’| = l - k + 1

  6. Using Prefixes • Take a random permutation of n-gram universe: • Take prefixes from both sets: • |s’|=|q’|=3, if |sq|6 then s’q’ 11 14 8 2 3 4 5 10 12 6 9 1 7 13 q s

  7. t1 t2 t4 t6 t8 t11 t14 w1 w1 w2 w2 0 0 w4 w4 0 0 q s α w(s)-α s’ s/s’ Prefix Filter for Weighted Sets • For example: • Order n-grams by weight (new coordinate space) • Query: w(qs)=Σiqswi  τ • Keep prefix s’ s.t. w(s’)  w(s) - α • Best case: w(q/q’s/s’) = α • Hence, we need w(q’s’) τ-α w1 w2  …  w14

  8. Prefix Filter Properties • The larger we make α, the smaller the prefix • The larger we make α, the smaller the range of thresholds we can support: • Because τα, otherwise τ-α is negative. • We need to pre-specify minimum τ • Can apply to Jaccard, Edit Distance, IDF

  9. Other Signatures • Minhash (still to come) • PartEnum: • Upper bounds Hamming • Select multiple subsets instead of one prefix • Larger signature, but stronger guarantee • LSH: • Probabilistic with guarantees • Based on hashing • Mismatch filter: • Use positional mismatching n-grams within the prefix to attain lower bound of Edit Distance

  10. Signature Indexing • Straightforward solution: • Create an inverted index on signature n-grams • Merge inverted lists to compute signature intersections • For a given string q: • Access only lists in q’ • Find strings s with w(q’ ∩ s’) ≥ τ - α

  11. The Inverted Signature Hashtable (CCVX08) • Maintain a signature vector for every n-gram • Consider prefix signatures for simplicity: • s’1={ ‘tt ’, ‘t L’}, s’2={‘t&t’, ‘t L’}, s’3=… • co-occurence lists: ‘t L’: ‘tt ’  ‘t&t’  … ‘&tt’: ‘t L’  … • Hash all n-grams (h: n-gram  [0, m]) • Convert co-occurrence lists to bit-vectors of size m

  12. Hash Signatures lab s’1 5 at&, la s’2 at& 4 t&t, at& s’3 t&t 5 t L, at& s’4 t L 1 abo, t&t s’5 la 0 t&t, la … … Hashtable at& 100011 t&t 010101 lab … t L la … Example

  13. q at& lab t&t res … q’ 1 1 1 0 … at& r lab 1 1 0 1 … p Using the Hashtable? • Let list ‘at&’ correspond to bit-vector 100011 • There exists string s s.t. ‘at&’  s’ and s’ also contains some n-grams that hash to 0, 1, or 5 • Given query q: • Construct query signature matrix: • Consider only solid sub-matrices P: rq’, pq • We need to look only at rq’ such that w(r)τ-α and w(p)τ

  14. Verification • How do we find which strings correspond to a given sub-matrix? • Create an inverted index on string n-grams • Examine only lists in r and strings with w(s)τ • Remember that rq’ • Can be used with other signatures as well

  15. Outline • Motivation and preliminaries • Inverted list based algorithms • Gram Signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions

  16. Length Normalized Measures • What is normalization? • Normalize similarity scores by the length of the strings. • Can result in more meaningful matches. • Can use L0 (i.e., the length of the string), L1, L2, etc. • For example L2: • Let w2(s)  Σtsw(t)2 • Weight can be IDF, unary, language model, etc. • ||s||2 =w2(s)-1/2

  17. The L2-Length Filter (HCKS08) • Why L2? • For almost exact matches. • Two strings match only if: • They have very similar n-gram sets, and hence L2 lengths • The “extra” n-grams have truly insignificant weights in aggregate (hence, resulting in similar L2 lengths).

  18. Example • “AT&T Labs – Research”  L2=100 • “ATT Labs – Research”  L2=95 • “AT&T Labs”  L2=70 • If “Research” happened to be very popular and had small weight? • “The Dark Knight”  L2=75 • “Dark Night”  L2=72

  19. Why L2 (continued) • Tight L2-based length filtering will result in very efficient pruning. • L2 yields scores bounded within [0, 1]: • 1 means a truly perfect match. • Easier to interpret scores. • L0 and L1 do not have the same properties • Scores are bounded only by the largest string length in the database. • For L0 an exact match can have score smaller than a non-exact match!

  20. Example • q={‘ATT’, ‘TT ’, ‘T L’, ‘LAB’, ‘ABS’}  L0=5 • s1={‘ATT’}  L0=1 • s2=q   L0=5 • S(q, s1)=Σw(qs1)/(||q||0||s1||0)=10/5 = 2 • S(q, s2)=Σw(qs2)/(||q||0||s2||0)=40/25<2

  21. Problems • L2 normalization poses challenges. • For example: • S(q, s) = w2(qs)/(||q||2 ||s||2) • Prefix filter cannot be applied. • Minimum prefix weight α? • Value depends both on ||s||2 and ||q||2. • But ||q||2 is unknown at index construction time

  22. Important L2 Properties • Length filtering: • For S(q, s) ≥ τ • τ||q||2 ||s||2  ||q||2 / τ • We are only looking for strings within these lengths. • Proof in paper • Monotonicity …

  23. Monotonicity • Let s={t1, t2, …, tm}. • Let pw(s, t)=w(t) / ||s||2(partial weight of s) • Then: S(q, s) =Σ tqs w(t)2 / (||q||2||s||2)= Σtqspw(s, t) pw(q, t) • If pw(s, t) > pw(r, t): • w(t)/||s||2 > w(t)/||r||2 ||s||2 < ||r||2 • Hence, for any t’  t: • w(t’)/||s||2 > w(t’)/||r||2pw(s, t’) > pw(r, t’)

  24. id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 2 3 3 4 3 1 0 4 2 1 0 3 1 0 4 1 2 4 4 2 Indexing • Use inverted lists sorted by pw(): • pw(0, ic) > pw(4, ic) > pw(1, ic) > pw(2, ic)  • ||0||2 < ||4||2 < ||1||2 < ||2||2

  25. 4 4 0 0 3 at ch ck ic ri st ta ti tu uc 4 2 0 0 4 2 0 0 2 2 3 4 4 4 4 2 4 4 2 1 2 1 1 1 3 3 L2 Length Filter • Given q and τ, and using length filtering: • We examine only a small fraction of the lists

  26. at ch ck ic ri st ta ti tu uc 2 4 3 2 1 2 4 0 4 3 0 4 1 2 0 Monotonicity • If I have seen 1 already, then 4 is not in the list: 3 1 3 1 4

  27. Other Improvements • Use properties of weighting scheme • Scan high weight lists first • Prune according to string length and maximum potential score • Ignore low weight lists altogether

  28. Conclusion • Concepts can be extended easily for: • BM25 • Weighted Jaccard • DICE • IDF • Take away message: • Properties of similarity/distance function can play big role in designing very fast indexes. • L2 super fast for almost exact matches

  29. Outline • Motivation and preliminaries • Inverted list based algorithms • Gram signature algorithms • Length-normalized measures • Selectivity estimation • Conclusion and future directions

  30. The Problem • Estimate the number of strings with: • Edit distance smaller than k from query q • Cosine similarity higher than τ to query q • Jaccard, Hamming, etc… • Issues: • Estimation accuracy • Size of estimator • Cost of estimation

  31. Motivation • Query optimization: • Selectivity of query predicates • Need to support selectivity of approximate string predicates • Visualization/Querying: • Expected result set size helps with visualization • Result set size important for remote query processing

  32. Flavors • Edit distance: • Based on clustering (JL05) • Based on min-hash (MBKS07) • Based on wild-card n-grams (LNS07) • Cosine similarity: • Based on sampling (HYKS08)

  33. Selectivity Estimation for Edit Distance • Problem: • Given query string q • Estimate number of strings s  D • Such that ed(q, s)  δ

  34. Sepia - Clustering (JL05, JLV08) • Partition strings using clustering: • Enables pruning of whole clusters • Store per cluster histograms: • Number of strings within edit distance 0,1,…,δ from the cluster center • Compute global dataset statistics: • Use a training query set to compute frequency of strings within edit distance 0,1,…,δ from each query

  35. Edit Vectors • Edit distance is not discriminative: • Use Edit Vectors • 3D space vs 1D space Ci Luciano <2,0,0> 2 <1,1,1> 3 Lukas Lucia pi q Lucas <1,1,0> 2

  36. # Edit Vector Cn pn <0, 0, 0> 4 C1 p1 F1 <0, 0, 1> 12 <1, 0, 2> 7 … # Edit Vector <0, 0, 0> 3 C2 F2 p2 <0, 1, 0> 40 <1, 0, 1> 6 … v(q,pi) v(pi,s) ed(q,s) # % <1, 0, 1> <0, 0, 1> 1 1 14 2 <1, 0, 1> <0, 0, 1> 4 57 # Edit Vector 3 <1, 0, 1> <0, 0, 1> 7 100 <0, 0, 0> 2 … … Fn <1, 0, 2> 84 <1, 1, 0> <1, 0, 2> 3 21 25 <1, 1, 1> 1 <1, 1, 0> <1, 0, 2> 4 63 75 … <1, 1, 0> <1, 0, 2> 5 84 100 … … Visually ... Global Table

  37. Selectivity Estimation • Use triangle inequality: • Compute edit vector v(q,pi) for all clusters i • If |v(q,pi)|  ri+δ disregard cluster Ci δ ri pi q

  38. Selectivity Estimation • Use triangle inequality: • Compute edit vector v(q,pi) for all clusters i • If |v(q,pi)|  ri+δ disregard cluster Ci • For all entries in frequency table: • If |v(q,pi)| + |v(pi,s)|  δ then ed(q,s)  δ for all s • If ||v(q,pi)| - |v(pi,s)||  δ ignore these strings • Else use global table: • Lookup entry <v(q,pi), v(pi,s), δ> in global table • Use the estimated fraction of strings

  39. # Edit Vector <0, 0, 0> 4 F1 <0, 0, 1> 12 <1, 0, 2> 7 … v(q,pi) v(pi,s) ed(q,s) # % <1, 0, 1> <0, 0, 1> 1 1 14 2 <1, 0, 1> <0, 0, 1> 4 57 3 <1, 0, 1> <0, 0, 1> 7 100 … … <1, 1, 0> <1, 0, 2> 3 21 25 <1, 1, 0> <1, 0, 2> 4 63 75 <1, 1, 0> <1, 0, 2> 5 84 100 … … Example • δ =3 • v(q,p1) = <1,1,0> v(p1,s) = <1,0,2> • Global lookup: [<1,1,0>,<1,0,2>, 3] • Fraction is 25% x 7 = 1.75 • Iterate through F1, and add up contributions Global Table

  40. Cons • Hard to maintain if clusters start drifting • Hard to find good number of clusters • Space/Time tradeoffs • Needs training to construct good dataset statistics table

  41. VSol – minhash (MBKS07) • Solution based on minhash • minhash is used for: • Estimate the size of a set |s| • Estimate resemblance of two sets • I.e., estimating the size of J=|s1s2| / |s1s2| • Estimate the size of the union |s1s2| • Hence, estimating the size of the intersection • |s1s2| J~(s1, s2)  ~(s1, s2)

  42. Minhash • Given a set s = {t1, …, tm} • Use independent hash functions h1, …, hk: • hi: n-gram  [0, 1] • Hash elements of s, k times • Keep the k elements that hashed to the smallest value each time • We reduced set s, from m to k elements • Denote minhash signature with s’

  43. How to use minhash • Given two signatures q’, s’: • J(q, s) Σ1ik I{q’[i]=s’[i]} / k • |s|  ( k / Σ1ik s’[i] ) - 1 • (qs)’ = q’  s’ = min1ik(q’[i], s’[i]) • Hence: • |qs|  (k / Σ1ik (qs)’[i]) - 1

  44. t1 t2 … t10 1 3 1 5 5 8 Inverted list … … … 14 25 43 Minhash VSol Estimator • Construct one inverted list per n-gram in D • The lists are our sets • Compute a minhash signature for each list

  45. Selectivity Estimation • Use edit distance length filter: • If ed(q, s)  δ, then q and s share at least L = |s| - 1 - n (δ-1) n-grams • Given query q = {t1, …, tm}: • Answer is the size of the union of all non-empty L-intersections (binomial coefficient: m choose L) • We can estimate sizes of L-intersections using minhash signatures

  46. q = t1 t2 … t10 1 3 1 5 5 8 … … … 14 25 43 Example • δ = 2, n = 3  L = 6 • Look at all 6-intersections of inverted lists • Α = |ι1, ..., ι6  [1,10](ti1  ti2  …  ti6)| • There are (10 choose 6) such terms Inverted list

  47. The m-L Similarity • Can be done efficiently using minhashes • Answer: • ρ = Σ1jk I{ i1, …, iL: ti1’[j] = … = tiL’[j] } • A  ρ  |t1… tm| • Proof very similar to the proof for minhashes

  48. Cons • Will overestimate results • Many L-intersections will share strings • Edit distance length filter is loose

  49. OptEQ – wild-card n-grams (LNS07) • Use extended n-grams: • Introduce wild-card symbol ‘?’ • E.g., “ab?” can be: • “aba”, “abb”, “abc”, … • Build an extended n-gram table: • Extract all 1-grams, 2-grams, …, n-grams • Generalize to extended 2-grams, …, n-grams • Maintain an extended n-grams/frequency hashtable

  50. n-gram table n-gram Frequency ab 10 Dataset bc 15 string de 4 ef 1 abc gh 21 def hi 2 ghi … … … ?b 13 a? 17 ?c 23 … … abc 5 def 2 … … Example

More Related