680 likes | 695 Views
This article presents various algorithms and methods for efficient approximate search on string collections, including inverted list-based algorithms, n-gram signature algorithms, length-normalized algorithms, selectivity estimation, and more. The focus is on optimizing search efficiency and accuracy.
E N D
Efficient Approximate Search on String CollectionsPart II Marios Hadjieleftheriou Chen Li
Outline • Motivation and preliminaries • Inverted list based algorithms • Gram Signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions
N-Gram Signatures • Use string signatures that upper bound similarity • Use signatures as filtering step • Properties: • Signature has to have small size • Signature verification must be fast • False positives/False negatives • Signatures have to be “indexable”
Known signatures • Minhash • Jaccard, Edit distance • Prefix filter (CGK06) • Jaccard, Edit distance • PartEnum (AGK06) • Hamming, Jaccard, Edit distance • LSH (GIM99) • Jaccard, Edit distance • Mismatch filter (XWL08) • Edit distance
3 4 5 7 8 9 10 12 13 1 2 6 11 14 q s Prefix Filter • Bit vectors: • Mismatch vector: s: matches 6, missing 2, extra 2 • If |sq|6 then s’s s.t. |s’|3, |s’q| • For at least k matches, |s’| = l - k + 1
Using Prefixes • Take a random permutation of n-gram universe: • Take prefixes from both sets: • |s’|=|q’|=3, if |sq|6 then s’q’ 11 14 8 2 3 4 5 10 12 6 9 1 7 13 q s
t1 t2 t4 t6 t8 t11 t14 w1 w1 w2 w2 0 0 w4 w4 0 0 q s α w(s)-α s’ s/s’ Prefix Filter for Weighted Sets • For example: • Order n-grams by weight (new coordinate space) • Query: w(qs)=Σiqswi τ • Keep prefix s’ s.t. w(s’) w(s) - α • Best case: w(q/q’s/s’) = α • Hence, we need w(q’s’) τ-α w1 w2 … w14
Prefix Filter Properties • The larger we make α, the smaller the prefix • The larger we make α, the smaller the range of thresholds we can support: • Because τα, otherwise τ-α is negative. • We need to pre-specify minimum τ • Can apply to Jaccard, Edit Distance, IDF
Other Signatures • Minhash (still to come) • PartEnum: • Upper bounds Hamming • Select multiple subsets instead of one prefix • Larger signature, but stronger guarantee • LSH: • Probabilistic with guarantees • Based on hashing • Mismatch filter: • Use positional mismatching n-grams within the prefix to attain lower bound of Edit Distance
Signature Indexing • Straightforward solution: • Create an inverted index on signature n-grams • Merge inverted lists to compute signature intersections • For a given string q: • Access only lists in q’ • Find strings s with w(q’ ∩ s’) ≥ τ - α
The Inverted Signature Hashtable (CCVX08) • Maintain a signature vector for every n-gram • Consider prefix signatures for simplicity: • s’1={ ‘tt ’, ‘t L’}, s’2={‘t&t’, ‘t L’}, s’3=… • co-occurence lists: ‘t L’: ‘tt ’ ‘t&t’ … ‘&tt’: ‘t L’ … • Hash all n-grams (h: n-gram [0, m]) • Convert co-occurrence lists to bit-vectors of size m
Hash Signatures lab s’1 5 at&, la s’2 at& 4 t&t, at& s’3 t&t 5 t L, at& s’4 t L 1 abo, t&t s’5 la 0 t&t, la … … Hashtable at& 100011 t&t 010101 lab … t L la … Example
q at& lab t&t res … q’ 1 1 1 0 … at& r lab 1 1 0 1 … p Using the Hashtable? • Let list ‘at&’ correspond to bit-vector 100011 • There exists string s s.t. ‘at&’ s’ and s’ also contains some n-grams that hash to 0, 1, or 5 • Given query q: • Construct query signature matrix: • Consider only solid sub-matrices P: rq’, pq • We need to look only at rq’ such that w(r)τ-α and w(p)τ
Verification • How do we find which strings correspond to a given sub-matrix? • Create an inverted index on string n-grams • Examine only lists in r and strings with w(s)τ • Remember that rq’ • Can be used with other signatures as well
Outline • Motivation and preliminaries • Inverted list based algorithms • Gram Signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions
Length Normalized Measures • What is normalization? • Normalize similarity scores by the length of the strings. • Can result in more meaningful matches. • Can use L0 (i.e., the length of the string), L1, L2, etc. • For example L2: • Let w2(s) Σtsw(t)2 • Weight can be IDF, unary, language model, etc. • ||s||2 =w2(s)-1/2
The L2-Length Filter (HCKS08) • Why L2? • For almost exact matches. • Two strings match only if: • They have very similar n-gram sets, and hence L2 lengths • The “extra” n-grams have truly insignificant weights in aggregate (hence, resulting in similar L2 lengths).
Example • “AT&T Labs – Research” L2=100 • “ATT Labs – Research” L2=95 • “AT&T Labs” L2=70 • If “Research” happened to be very popular and had small weight? • “The Dark Knight” L2=75 • “Dark Night” L2=72
Why L2 (continued) • Tight L2-based length filtering will result in very efficient pruning. • L2 yields scores bounded within [0, 1]: • 1 means a truly perfect match. • Easier to interpret scores. • L0 and L1 do not have the same properties • Scores are bounded only by the largest string length in the database. • For L0 an exact match can have score smaller than a non-exact match!
Example • q={‘ATT’, ‘TT ’, ‘T L’, ‘LAB’, ‘ABS’} L0=5 • s1={‘ATT’} L0=1 • s2=q L0=5 • S(q, s1)=Σw(qs1)/(||q||0||s1||0)=10/5 = 2 • S(q, s2)=Σw(qs2)/(||q||0||s2||0)=40/25<2
Problems • L2 normalization poses challenges. • For example: • S(q, s) = w2(qs)/(||q||2 ||s||2) • Prefix filter cannot be applied. • Minimum prefix weight α? • Value depends both on ||s||2 and ||q||2. • But ||q||2 is unknown at index construction time
Important L2 Properties • Length filtering: • For S(q, s) ≥ τ • τ||q||2 ||s||2 ||q||2 / τ • We are only looking for strings within these lengths. • Proof in paper • Monotonicity …
Monotonicity • Let s={t1, t2, …, tm}. • Let pw(s, t)=w(t) / ||s||2(partial weight of s) • Then: S(q, s) =Σ tqs w(t)2 / (||q||2||s||2)= Σtqspw(s, t) pw(q, t) • If pw(s, t) > pw(r, t): • w(t)/||s||2 > w(t)/||r||2 ||s||2 < ||r||2 • Hence, for any t’ t: • w(t’)/||s||2 > w(t’)/||r||2pw(s, t’) > pw(r, t’)
id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 2 3 3 4 3 1 0 4 2 1 0 3 1 0 4 1 2 4 4 2 Indexing • Use inverted lists sorted by pw(): • pw(0, ic) > pw(4, ic) > pw(1, ic) > pw(2, ic) • ||0||2 < ||4||2 < ||1||2 < ||2||2
4 4 0 0 3 at ch ck ic ri st ta ti tu uc 4 2 0 0 4 2 0 0 2 2 3 4 4 4 4 2 4 4 2 1 2 1 1 1 3 3 L2 Length Filter • Given q and τ, and using length filtering: • We examine only a small fraction of the lists
at ch ck ic ri st ta ti tu uc 2 4 3 2 1 2 4 0 4 3 0 4 1 2 0 Monotonicity • If I have seen 1 already, then 4 is not in the list: 3 1 3 1 4
Other Improvements • Use properties of weighting scheme • Scan high weight lists first • Prune according to string length and maximum potential score • Ignore low weight lists altogether
Conclusion • Concepts can be extended easily for: • BM25 • Weighted Jaccard • DICE • IDF • Take away message: • Properties of similarity/distance function can play big role in designing very fast indexes. • L2 super fast for almost exact matches
Outline • Motivation and preliminaries • Inverted list based algorithms • Gram signature algorithms • Length-normalized measures • Selectivity estimation • Conclusion and future directions
The Problem • Estimate the number of strings with: • Edit distance smaller than k from query q • Cosine similarity higher than τ to query q • Jaccard, Hamming, etc… • Issues: • Estimation accuracy • Size of estimator • Cost of estimation
Motivation • Query optimization: • Selectivity of query predicates • Need to support selectivity of approximate string predicates • Visualization/Querying: • Expected result set size helps with visualization • Result set size important for remote query processing
Flavors • Edit distance: • Based on clustering (JL05) • Based on min-hash (MBKS07) • Based on wild-card n-grams (LNS07) • Cosine similarity: • Based on sampling (HYKS08)
Selectivity Estimation for Edit Distance • Problem: • Given query string q • Estimate number of strings s D • Such that ed(q, s) δ
Sepia - Clustering (JL05, JLV08) • Partition strings using clustering: • Enables pruning of whole clusters • Store per cluster histograms: • Number of strings within edit distance 0,1,…,δ from the cluster center • Compute global dataset statistics: • Use a training query set to compute frequency of strings within edit distance 0,1,…,δ from each query
Edit Vectors • Edit distance is not discriminative: • Use Edit Vectors • 3D space vs 1D space Ci Luciano <2,0,0> 2 <1,1,1> 3 Lukas Lucia pi q Lucas <1,1,0> 2
# Edit Vector Cn pn <0, 0, 0> 4 C1 p1 F1 <0, 0, 1> 12 <1, 0, 2> 7 … # Edit Vector <0, 0, 0> 3 C2 F2 p2 <0, 1, 0> 40 <1, 0, 1> 6 … v(q,pi) v(pi,s) ed(q,s) # % <1, 0, 1> <0, 0, 1> 1 1 14 2 <1, 0, 1> <0, 0, 1> 4 57 # Edit Vector 3 <1, 0, 1> <0, 0, 1> 7 100 <0, 0, 0> 2 … … Fn <1, 0, 2> 84 <1, 1, 0> <1, 0, 2> 3 21 25 <1, 1, 1> 1 <1, 1, 0> <1, 0, 2> 4 63 75 … <1, 1, 0> <1, 0, 2> 5 84 100 … … Visually ... Global Table
Selectivity Estimation • Use triangle inequality: • Compute edit vector v(q,pi) for all clusters i • If |v(q,pi)| ri+δ disregard cluster Ci δ ri pi q
Selectivity Estimation • Use triangle inequality: • Compute edit vector v(q,pi) for all clusters i • If |v(q,pi)| ri+δ disregard cluster Ci • For all entries in frequency table: • If |v(q,pi)| + |v(pi,s)| δ then ed(q,s) δ for all s • If ||v(q,pi)| - |v(pi,s)|| δ ignore these strings • Else use global table: • Lookup entry <v(q,pi), v(pi,s), δ> in global table • Use the estimated fraction of strings
# Edit Vector <0, 0, 0> 4 F1 <0, 0, 1> 12 <1, 0, 2> 7 … v(q,pi) v(pi,s) ed(q,s) # % <1, 0, 1> <0, 0, 1> 1 1 14 2 <1, 0, 1> <0, 0, 1> 4 57 3 <1, 0, 1> <0, 0, 1> 7 100 … … <1, 1, 0> <1, 0, 2> 3 21 25 <1, 1, 0> <1, 0, 2> 4 63 75 <1, 1, 0> <1, 0, 2> 5 84 100 … … Example • δ =3 • v(q,p1) = <1,1,0> v(p1,s) = <1,0,2> • Global lookup: [<1,1,0>,<1,0,2>, 3] • Fraction is 25% x 7 = 1.75 • Iterate through F1, and add up contributions Global Table
Cons • Hard to maintain if clusters start drifting • Hard to find good number of clusters • Space/Time tradeoffs • Needs training to construct good dataset statistics table
VSol – minhash (MBKS07) • Solution based on minhash • minhash is used for: • Estimate the size of a set |s| • Estimate resemblance of two sets • I.e., estimating the size of J=|s1s2| / |s1s2| • Estimate the size of the union |s1s2| • Hence, estimating the size of the intersection • |s1s2| J~(s1, s2) ~(s1, s2)
Minhash • Given a set s = {t1, …, tm} • Use independent hash functions h1, …, hk: • hi: n-gram [0, 1] • Hash elements of s, k times • Keep the k elements that hashed to the smallest value each time • We reduced set s, from m to k elements • Denote minhash signature with s’
How to use minhash • Given two signatures q’, s’: • J(q, s) Σ1ik I{q’[i]=s’[i]} / k • |s| ( k / Σ1ik s’[i] ) - 1 • (qs)’ = q’ s’ = min1ik(q’[i], s’[i]) • Hence: • |qs| (k / Σ1ik (qs)’[i]) - 1
t1 t2 … t10 1 3 1 5 5 8 Inverted list … … … 14 25 43 Minhash VSol Estimator • Construct one inverted list per n-gram in D • The lists are our sets • Compute a minhash signature for each list
Selectivity Estimation • Use edit distance length filter: • If ed(q, s) δ, then q and s share at least L = |s| - 1 - n (δ-1) n-grams • Given query q = {t1, …, tm}: • Answer is the size of the union of all non-empty L-intersections (binomial coefficient: m choose L) • We can estimate sizes of L-intersections using minhash signatures
q = t1 t2 … t10 1 3 1 5 5 8 … … … 14 25 43 Example • δ = 2, n = 3 L = 6 • Look at all 6-intersections of inverted lists • Α = |ι1, ..., ι6 [1,10](ti1 ti2 … ti6)| • There are (10 choose 6) such terms Inverted list
The m-L Similarity • Can be done efficiently using minhashes • Answer: • ρ = Σ1jk I{ i1, …, iL: ti1’[j] = … = tiL’[j] } • A ρ |t1… tm| • Proof very similar to the proof for minhashes
Cons • Will overestimate results • Many L-intersections will share strings • Edit distance length filter is loose
OptEQ – wild-card n-grams (LNS07) • Use extended n-grams: • Introduce wild-card symbol ‘?’ • E.g., “ab?” can be: • “aba”, “abb”, “abc”, … • Build an extended n-gram table: • Extract all 1-grams, 2-grams, …, n-grams • Generalize to extended 2-grams, …, n-grams • Maintain an extended n-grams/frequency hashtable
n-gram table n-gram Frequency ab 10 Dataset bc 15 string de 4 ef 1 abc gh 21 def hi 2 ghi … … … ?b 13 a? 17 ?c 23 … … abc 5 def 2 … … Example