680 likes | 1.02k Views
Efficient Approximate Search on String Collections Part II. Marios Hadjieleftheriou. Chen Li. Outline. Motivation and preliminaries Inverted list based algorithms Gram Signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions.
E N D
Efficient Approximate Search on String CollectionsPart II Marios Hadjieleftheriou Chen Li
Outline • Motivation and preliminaries • Inverted list based algorithms • Gram Signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions
N-Gram Signatures • Use string signatures that upper bound similarity • Use signatures as filtering step • Properties: • Signature has to have small size • Signature verification must be fast • False positives/False negatives • Signatures have to be “indexable”
Known signatures • Minhash • Jaccard, Edit distance • Prefix filter (CGK06) • Jaccard, Edit distance • PartEnum (AGK06) • Hamming, Jaccard, Edit distance • LSH (GIM99) • Jaccard, Edit distance • Mismatch filter (XWL08) • Edit distance
3 4 5 7 8 9 10 12 13 1 2 6 11 14 q s Prefix Filter • Bit vectors: • Mismatch vector: s: matches 6, missing 2, extra 2 • If |sq|6 then s’s s.t. |s’|3, |s’q| • For at least k matches, |s’| = l - k + 1
Using Prefixes • Take a random permutation of n-gram universe: • Take prefixes from both sets: • |s’|=|q’|=3, if |sq|6 then s’q’ 11 14 8 2 3 4 5 10 12 6 9 1 7 13 q s
t1 t2 t4 t6 t8 t11 t14 w1 w1 w2 w2 0 0 w4 w4 0 0 q s α w(s)-α s’ s/s’ Prefix Filter for Weighted Sets • For example: • Order n-grams by weight (new coordinate space) • Query: w(qs)=Σiqswi τ • Keep prefix s’ s.t. w(s’) w(s) - α • Best case: w(q/q’s/s’) = α • Hence, we need w(q’s’) τ-α w1 w2 … w14
Prefix Filter Properties • The larger we make α, the smaller the prefix • The larger we make α, the smaller the range of thresholds we can support: • Because τα, otherwise τ-α is negative. • We need to pre-specify minimum τ • Can apply to Jaccard, Edit Distance, IDF
Other Signatures • Minhash (still to come) • PartEnum: • Upper bounds Hamming • Select multiple subsets instead of one prefix • Larger signature, but stronger guarantee • LSH: • Probabilistic with guarantees • Based on hashing • Mismatch filter: • Use positional mismatching n-grams within the prefix to attain lower bound of Edit Distance
Signature Indexing • Straightforward solution: • Create an inverted index on signature n-grams • Merge inverted lists to compute signature intersections • For a given string q: • Access only lists in q’ • Find strings s with w(q’ ∩ s’) ≥ τ - α
The Inverted Signature Hashtable (CCVX08) • Maintain a signature vector for every n-gram • Consider prefix signatures for simplicity: • s’1={ ‘tt ’, ‘t L’}, s’2={‘t&t’, ‘t L’}, s’3=… • co-occurence lists: ‘t L’: ‘tt ’ ‘t&t’ … ‘&tt’: ‘t L’ … • Hash all n-grams (h: n-gram [0, m]) • Convert co-occurrence lists to bit-vectors of size m
Hash Signatures lab s’1 5 at&, la s’2 at& 4 t&t, at& s’3 t&t 5 t L, at& s’4 t L 1 abo, t&t s’5 la 0 t&t, la … … Hashtable at& 100011 t&t 010101 lab … t L la … Example
q at& lab t&t res … q’ 1 1 1 0 … at& r lab 1 1 0 1 … p Using the Hashtable? • Let list ‘at&’ correspond to bit-vector 100011 • There exists string s s.t. ‘at&’ s’ and s’ also contains some n-grams that hash to 0, 1, or 5 • Given query q: • Construct query signature matrix: • Consider only solid sub-matrices P: rq’, pq • We need to look only at rq’ such that w(r)τ-α and w(p)τ
Verification • How do we find which strings correspond to a given sub-matrix? • Create an inverted index on string n-grams • Examine only lists in r and strings with w(s)τ • Remember that rq’ • Can be used with other signatures as well
Outline • Motivation and preliminaries • Inverted list based algorithms • Gram Signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions
Length Normalized Measures • What is normalization? • Normalize similarity scores by the length of the strings. • Can result in more meaningful matches. • Can use L0 (i.e., the length of the string), L1, L2, etc. • For example L2: • Let w2(s) Σtsw(t)2 • Weight can be IDF, unary, language model, etc. • ||s||2 =w2(s)-1/2
The L2-Length Filter (HCKS08) • Why L2? • For almost exact matches. • Two strings match only if: • They have very similar n-gram sets, and hence L2 lengths • The “extra” n-grams have truly insignificant weights in aggregate (hence, resulting in similar L2 lengths).
Example • “AT&T Labs – Research” L2=100 • “ATT Labs – Research” L2=95 • “AT&T Labs” L2=70 • If “Research” happened to be very popular and had small weight? • “The Dark Knight” L2=75 • “Dark Night” L2=72
Why L2 (continued) • Tight L2-based length filtering will result in very efficient pruning. • L2 yields scores bounded within [0, 1]: • 1 means a truly perfect match. • Easier to interpret scores. • L0 and L1 do not have the same properties • Scores are bounded only by the largest string length in the database. • For L0 an exact match can have score smaller than a non-exact match!
Example • q={‘ATT’, ‘TT ’, ‘T L’, ‘LAB’, ‘ABS’} L0=5 • s1={‘ATT’} L0=1 • s2=q L0=5 • S(q, s1)=Σw(qs1)/(||q||0||s1||0)=10/5 = 2 • S(q, s2)=Σw(qs2)/(||q||0||s2||0)=40/25<2
Problems • L2 normalization poses challenges. • For example: • S(q, s) = w2(qs)/(||q||2 ||s||2) • Prefix filter cannot be applied. • Minimum prefix weight α? • Value depends both on ||s||2 and ||q||2. • But ||q||2 is unknown at index construction time
Important L2 Properties • Length filtering: • For S(q, s) ≥ τ • τ||q||2 ||s||2 ||q||2 / τ • We are only looking for strings within these lengths. • Proof in paper • Monotonicity …
Monotonicity • Let s={t1, t2, …, tm}. • Let pw(s, t)=w(t) / ||s||2(partial weight of s) • Then: S(q, s) =Σ tqs w(t)2 / (||q||2||s||2)= Σtqspw(s, t) pw(q, t) • If pw(s, t) > pw(r, t): • w(t)/||s||2 > w(t)/||r||2 ||s||2 < ||r||2 • Hence, for any t’ t: • w(t’)/||s||2 > w(t’)/||r||2pw(s, t’) > pw(r, t’)
id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 2 3 3 4 3 1 0 4 2 1 0 3 1 0 4 1 2 4 4 2 Indexing • Use inverted lists sorted by pw(): • pw(0, ic) > pw(4, ic) > pw(1, ic) > pw(2, ic) • ||0||2 < ||4||2 < ||1||2 < ||2||2
4 4 0 0 3 at ch ck ic ri st ta ti tu uc 4 2 0 0 4 2 0 0 2 2 3 4 4 4 4 2 4 4 2 1 2 1 1 1 3 3 L2 Length Filter • Given q and τ, and using length filtering: • We examine only a small fraction of the lists
at ch ck ic ri st ta ti tu uc 2 4 3 2 1 2 4 0 4 3 0 4 1 2 0 Monotonicity • If I have seen 1 already, then 4 is not in the list: 3 1 3 1 4
Other Improvements • Use properties of weighting scheme • Scan high weight lists first • Prune according to string length and maximum potential score • Ignore low weight lists altogether
Conclusion • Concepts can be extended easily for: • BM25 • Weighted Jaccard • DICE • IDF • Take away message: • Properties of similarity/distance function can play big role in designing very fast indexes. • L2 super fast for almost exact matches
Outline • Motivation and preliminaries • Inverted list based algorithms • Gram signature algorithms • Length-normalized measures • Selectivity estimation • Conclusion and future directions
The Problem • Estimate the number of strings with: • Edit distance smaller than k from query q • Cosine similarity higher than τ to query q • Jaccard, Hamming, etc… • Issues: • Estimation accuracy • Size of estimator • Cost of estimation
Motivation • Query optimization: • Selectivity of query predicates • Need to support selectivity of approximate string predicates • Visualization/Querying: • Expected result set size helps with visualization • Result set size important for remote query processing
Flavors • Edit distance: • Based on clustering (JL05) • Based on min-hash (MBKS07) • Based on wild-card n-grams (LNS07) • Cosine similarity: • Based on sampling (HYKS08)
Selectivity Estimation for Edit Distance • Problem: • Given query string q • Estimate number of strings s D • Such that ed(q, s) δ
Sepia - Clustering (JL05, JLV08) • Partition strings using clustering: • Enables pruning of whole clusters • Store per cluster histograms: • Number of strings within edit distance 0,1,…,δ from the cluster center • Compute global dataset statistics: • Use a training query set to compute frequency of strings within edit distance 0,1,…,δ from each query
Edit Vectors • Edit distance is not discriminative: • Use Edit Vectors • 3D space vs 1D space Ci Luciano <2,0,0> 2 <1,1,1> 3 Lukas Lucia pi q Lucas <1,1,0> 2
# Edit Vector Cn pn <0, 0, 0> 4 C1 p1 F1 <0, 0, 1> 12 <1, 0, 2> 7 … # Edit Vector <0, 0, 0> 3 C2 F2 p2 <0, 1, 0> 40 <1, 0, 1> 6 … v(q,pi) v(pi,s) ed(q,s) # % <1, 0, 1> <0, 0, 1> 1 1 14 2 <1, 0, 1> <0, 0, 1> 4 57 # Edit Vector 3 <1, 0, 1> <0, 0, 1> 7 100 <0, 0, 0> 2 … … Fn <1, 0, 2> 84 <1, 1, 0> <1, 0, 2> 3 21 25 <1, 1, 1> 1 <1, 1, 0> <1, 0, 2> 4 63 75 … <1, 1, 0> <1, 0, 2> 5 84 100 … … Visually ... Global Table
Selectivity Estimation • Use triangle inequality: • Compute edit vector v(q,pi) for all clusters i • If |v(q,pi)| ri+δ disregard cluster Ci δ ri pi q
Selectivity Estimation • Use triangle inequality: • Compute edit vector v(q,pi) for all clusters i • If |v(q,pi)| ri+δ disregard cluster Ci • For all entries in frequency table: • If |v(q,pi)| + |v(pi,s)| δ then ed(q,s) δ for all s • If ||v(q,pi)| - |v(pi,s)|| δ ignore these strings • Else use global table: • Lookup entry <v(q,pi), v(pi,s), δ> in global table • Use the estimated fraction of strings
# Edit Vector <0, 0, 0> 4 F1 <0, 0, 1> 12 <1, 0, 2> 7 … v(q,pi) v(pi,s) ed(q,s) # % <1, 0, 1> <0, 0, 1> 1 1 14 2 <1, 0, 1> <0, 0, 1> 4 57 3 <1, 0, 1> <0, 0, 1> 7 100 … … <1, 1, 0> <1, 0, 2> 3 21 25 <1, 1, 0> <1, 0, 2> 4 63 75 <1, 1, 0> <1, 0, 2> 5 84 100 … … Example • δ =3 • v(q,p1) = <1,1,0> v(p1,s) = <1,0,2> • Global lookup: [<1,1,0>,<1,0,2>, 3] • Fraction is 25% x 7 = 1.75 • Iterate through F1, and add up contributions Global Table
Cons • Hard to maintain if clusters start drifting • Hard to find good number of clusters • Space/Time tradeoffs • Needs training to construct good dataset statistics table
VSol – minhash (MBKS07) • Solution based on minhash • minhash is used for: • Estimate the size of a set |s| • Estimate resemblance of two sets • I.e., estimating the size of J=|s1s2| / |s1s2| • Estimate the size of the union |s1s2| • Hence, estimating the size of the intersection • |s1s2| J~(s1, s2) ~(s1, s2)
Minhash • Given a set s = {t1, …, tm} • Use independent hash functions h1, …, hk: • hi: n-gram [0, 1] • Hash elements of s, k times • Keep the k elements that hashed to the smallest value each time • We reduced set s, from m to k elements • Denote minhash signature with s’
How to use minhash • Given two signatures q’, s’: • J(q, s) Σ1ik I{q’[i]=s’[i]} / k • |s| ( k / Σ1ik s’[i] ) - 1 • (qs)’ = q’ s’ = min1ik(q’[i], s’[i]) • Hence: • |qs| (k / Σ1ik (qs)’[i]) - 1
t1 t2 … t10 1 3 1 5 5 8 Inverted list … … … 14 25 43 Minhash VSol Estimator • Construct one inverted list per n-gram in D • The lists are our sets • Compute a minhash signature for each list
Selectivity Estimation • Use edit distance length filter: • If ed(q, s) δ, then q and s share at least L = |s| - 1 - n (δ-1) n-grams • Given query q = {t1, …, tm}: • Answer is the size of the union of all non-empty L-intersections (binomial coefficient: m choose L) • We can estimate sizes of L-intersections using minhash signatures
q = t1 t2 … t10 1 3 1 5 5 8 … … … 14 25 43 Example • δ = 2, n = 3 L = 6 • Look at all 6-intersections of inverted lists • Α = |ι1, ..., ι6 [1,10](ti1 ti2 … ti6)| • There are (10 choose 6) such terms Inverted list
The m-L Similarity • Can be done efficiently using minhashes • Answer: • ρ = Σ1jk I{ i1, …, iL: ti1’[j] = … = tiL’[j] } • A ρ |t1… tm| • Proof very similar to the proof for minhashes
Cons • Will overestimate results • Many L-intersections will share strings • Edit distance length filter is loose
OptEQ – wild-card n-grams (LNS07) • Use extended n-grams: • Introduce wild-card symbol ‘?’ • E.g., “ab?” can be: • “aba”, “abb”, “abc”, … • Build an extended n-gram table: • Extract all 1-grams, 2-grams, …, n-grams • Generalize to extended 2-grams, …, n-grams • Maintain an extended n-grams/frequency hashtable
n-gram table n-gram Frequency ab 10 Dataset bc 15 string de 4 ef 1 abc gh 21 def hi 2 ghi … … … ?b 13 a? 17 ?c 23 … … abc 5 def 2 … … Example