Efficient Approximate Search on String Collections Part II

Efficient Approximate Search on String CollectionsPart II Marios Hadjieleftheriou Chen Li

Outline • Motivation and preliminaries • Inverted list based algorithms • Gram Signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions

N-Gram Signatures • Use string signatures that upper bound similarity • Use signatures as filtering step • Properties: • Signature has to have small size • Signature verification must be fast • False positives/False negatives • Signatures have to be “indexable”

Known signatures • Minhash • Jaccard, Edit distance • Prefix filter (CGK06) • Jaccard, Edit distance • PartEnum (AGK06) • Hamming, Jaccard, Edit distance • LSH (GIM99) • Jaccard, Edit distance • Mismatch filter (XWL08) • Edit distance

3 4 5 7 8 9 10 12 13 1 2 6 11 14 q s Prefix Filter • Bit vectors: • Mismatch vector: s: matches 6, missing 2, extra 2 • If |sq|6 then s’s s.t. |s’|3, |s’q| • For at least k matches, |s’| = l - k + 1

Using Prefixes • Take a random permutation of n-gram universe: • Take prefixes from both sets: • |s’|=|q’|=3, if |sq|6 then s’q’ 11 14 8 2 3 4 5 10 12 6 9 1 7 13 q s

t1 t2 t4 t6 t8 t11 t14 w1 w1 w2 w2 0 0 w4 w4 0 0 q s α w(s)-α s’ s/s’ Prefix Filter for Weighted Sets • For example: • Order n-grams by weight (new coordinate space) • Query: w(qs)=Σiqswi  τ • Keep prefix s’ s.t. w(s’)  w(s) - α • Best case: w(q/q’s/s’) = α • Hence, we need w(q’s’) τ-α w1 w2  …  w14

Prefix Filter Properties • The larger we make α, the smaller the prefix • The larger we make α, the smaller the range of thresholds we can support: • Because τα, otherwise τ-α is negative. • We need to pre-specify minimum τ • Can apply to Jaccard, Edit Distance, IDF

Other Signatures • Minhash (still to come) • PartEnum: • Upper bounds Hamming • Select multiple subsets instead of one prefix • Larger signature, but stronger guarantee • LSH: • Probabilistic with guarantees • Based on hashing • Mismatch filter: • Use positional mismatching n-grams within the prefix to attain lower bound of Edit Distance

Signature Indexing • Straightforward solution: • Create an inverted index on signature n-grams • Merge inverted lists to compute signature intersections • For a given string q: • Access only lists in q’ • Find strings s with w(q’ ∩ s’) ≥ τ - α

The Inverted Signature Hashtable (CCVX08) • Maintain a signature vector for every n-gram • Consider prefix signatures for simplicity: • s’1={ ‘tt ’, ‘t L’}, s’2={‘t&t’, ‘t L’}, s’3=… • co-occurence lists: ‘t L’: ‘tt ’  ‘t&t’  … ‘&tt’: ‘t L’  … • Hash all n-grams (h: n-gram  [0, m]) • Convert co-occurrence lists to bit-vectors of size m

Hash Signatures lab s’1 5 at&, la s’2 at& 4 t&t, at& s’3 t&t 5 t L, at& s’4 t L 1 abo, t&t s’5 la 0 t&t, la … … Hashtable at& 100011 t&t 010101 lab … t L la … Example

q at& lab t&t res … q’ 1 1 1 0 … at& r lab 1 1 0 1 … p Using the Hashtable? • Let list ‘at&’ correspond to bit-vector 100011 • There exists string s s.t. ‘at&’  s’ and s’ also contains some n-grams that hash to 0, 1, or 5 • Given query q: • Construct query signature matrix: • Consider only solid sub-matrices P: rq’, pq • We need to look only at rq’ such that w(r)τ-α and w(p)τ

Verification • How do we find which strings correspond to a given sub-matrix? • Create an inverted index on string n-grams • Examine only lists in r and strings with w(s)τ • Remember that rq’ • Can be used with other signatures as well

Outline • Motivation and preliminaries • Inverted list based algorithms • Gram Signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions

Length Normalized Measures • What is normalization? • Normalize similarity scores by the length of the strings. • Can result in more meaningful matches. • Can use L0 (i.e., the length of the string), L1, L2, etc. • For example L2: • Let w2(s)  Σtsw(t)2 • Weight can be IDF, unary, language model, etc. • ||s||2 =w2(s)-1/2

The L2-Length Filter (HCKS08) • Why L2? • For almost exact matches. • Two strings match only if: • They have very similar n-gram sets, and hence L2 lengths • The “extra” n-grams have truly insignificant weights in aggregate (hence, resulting in similar L2 lengths).

Example • “AT&T Labs – Research”  L2=100 • “ATT Labs – Research”  L2=95 • “AT&T Labs”  L2=70 • If “Research” happened to be very popular and had small weight? • “The Dark Knight”  L2=75 • “Dark Night”  L2=72

Why L2 (continued) • Tight L2-based length filtering will result in very efficient pruning. • L2 yields scores bounded within [0, 1]: • 1 means a truly perfect match. • Easier to interpret scores. • L0 and L1 do not have the same properties • Scores are bounded only by the largest string length in the database. • For L0 an exact match can have score smaller than a non-exact match!

Example • q={‘ATT’, ‘TT ’, ‘T L’, ‘LAB’, ‘ABS’}  L0=5 • s1={‘ATT’}  L0=1 • s2=q   L0=5 • S(q, s1)=Σw(qs1)/(||q||0||s1||0)=10/5 = 2 • S(q, s2)=Σw(qs2)/(||q||0||s2||0)=40/25<2

Problems • L2 normalization poses challenges. • For example: • S(q, s) = w2(qs)/(||q||2 ||s||2) • Prefix filter cannot be applied. • Minimum prefix weight α? • Value depends both on ||s||2 and ||q||2. • But ||q||2 is unknown at index construction time

Important L2 Properties • Length filtering: • For S(q, s) ≥ τ • τ||q||2 ||s||2  ||q||2 / τ • We are only looking for strings within these lengths. • Proof in paper • Monotonicity …

Monotonicity • Let s={t1, t2, …, tm}. • Let pw(s, t)=w(t) / ||s||2(partial weight of s) • Then: S(q, s) =Σ tqs w(t)2 / (||q||2||s||2)= Σtqspw(s, t) pw(q, t) • If pw(s, t) > pw(r, t): • w(t)/||s||2 > w(t)/||r||2 ||s||2 < ||r||2 • Hence, for any t’  t: • w(t’)/||s||2 > w(t’)/||r||2pw(s, t’) > pw(r, t’)

id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 2 3 3 4 3 1 0 4 2 1 0 3 1 0 4 1 2 4 4 2 Indexing • Use inverted lists sorted by pw(): • pw(0, ic) > pw(4, ic) > pw(1, ic) > pw(2, ic)  • ||0||2 < ||4||2 < ||1||2 < ||2||2

4 4 0 0 3 at ch ck ic ri st ta ti tu uc 4 2 0 0 4 2 0 0 2 2 3 4 4 4 4 2 4 4 2 1 2 1 1 1 3 3 L2 Length Filter • Given q and τ, and using length filtering: • We examine only a small fraction of the lists

at ch ck ic ri st ta ti tu uc 2 4 3 2 1 2 4 0 4 3 0 4 1 2 0 Monotonicity • If I have seen 1 already, then 4 is not in the list: 3 1 3 1 4

Other Improvements • Use properties of weighting scheme • Scan high weight lists first • Prune according to string length and maximum potential score • Ignore low weight lists altogether

Conclusion • Concepts can be extended easily for: • BM25 • Weighted Jaccard • DICE • IDF • Take away message: • Properties of similarity/distance function can play big role in designing very fast indexes. • L2 super fast for almost exact matches

Outline • Motivation and preliminaries • Inverted list based algorithms • Gram signature algorithms • Length-normalized measures • Selectivity estimation • Conclusion and future directions

The Problem • Estimate the number of strings with: • Edit distance smaller than k from query q • Cosine similarity higher than τ to query q • Jaccard, Hamming, etc… • Issues: • Estimation accuracy • Size of estimator • Cost of estimation

Motivation • Query optimization: • Selectivity of query predicates • Need to support selectivity of approximate string predicates • Visualization/Querying: • Expected result set size helps with visualization • Result set size important for remote query processing

Flavors • Edit distance: • Based on clustering (JL05) • Based on min-hash (MBKS07) • Based on wild-card n-grams (LNS07) • Cosine similarity: • Based on sampling (HYKS08)

Selectivity Estimation for Edit Distance • Problem: • Given query string q • Estimate number of strings s  D • Such that ed(q, s)  δ

Sepia - Clustering (JL05, JLV08) • Partition strings using clustering: • Enables pruning of whole clusters • Store per cluster histograms: • Number of strings within edit distance 0,1,…,δ from the cluster center • Compute global dataset statistics: • Use a training query set to compute frequency of strings within edit distance 0,1,…,δ from each query

Edit Vectors • Edit distance is not discriminative: • Use Edit Vectors • 3D space vs 1D space Ci Luciano <2,0,0> 2 <1,1,1> 3 Lukas Lucia pi q Lucas <1,1,0> 2

# Edit Vector Cn pn <0, 0, 0> 4 C1 p1 F1 <0, 0, 1> 12 <1, 0, 2> 7 … # Edit Vector <0, 0, 0> 3 C2 F2 p2 <0, 1, 0> 40 <1, 0, 1> 6 … v(q,pi) v(pi,s) ed(q,s) # % <1, 0, 1> <0, 0, 1> 1 1 14 2 <1, 0, 1> <0, 0, 1> 4 57 # Edit Vector 3 <1, 0, 1> <0, 0, 1> 7 100 <0, 0, 0> 2 … … Fn <1, 0, 2> 84 <1, 1, 0> <1, 0, 2> 3 21 25 <1, 1, 1> 1 <1, 1, 0> <1, 0, 2> 4 63 75 … <1, 1, 0> <1, 0, 2> 5 84 100 … … Visually ... Global Table

Selectivity Estimation • Use triangle inequality: • Compute edit vector v(q,pi) for all clusters i • If |v(q,pi)|  ri+δ disregard cluster Ci δ ri pi q

Selectivity Estimation • Use triangle inequality: • Compute edit vector v(q,pi) for all clusters i • If |v(q,pi)|  ri+δ disregard cluster Ci • For all entries in frequency table: • If |v(q,pi)| + |v(pi,s)|  δ then ed(q,s)  δ for all s • If ||v(q,pi)| - |v(pi,s)||  δ ignore these strings • Else use global table: • Lookup entry <v(q,pi), v(pi,s), δ> in global table • Use the estimated fraction of strings

# Edit Vector <0, 0, 0> 4 F1 <0, 0, 1> 12 <1, 0, 2> 7 … v(q,pi) v(pi,s) ed(q,s) # % <1, 0, 1> <0, 0, 1> 1 1 14 2 <1, 0, 1> <0, 0, 1> 4 57 3 <1, 0, 1> <0, 0, 1> 7 100 … … <1, 1, 0> <1, 0, 2> 3 21 25 <1, 1, 0> <1, 0, 2> 4 63 75 <1, 1, 0> <1, 0, 2> 5 84 100 … … Example • δ =3 • v(q,p1) = <1,1,0> v(p1,s) = <1,0,2> • Global lookup: [<1,1,0>,<1,0,2>, 3] • Fraction is 25% x 7 = 1.75 • Iterate through F1, and add up contributions Global Table

Cons • Hard to maintain if clusters start drifting • Hard to find good number of clusters • Space/Time tradeoffs • Needs training to construct good dataset statistics table

VSol – minhash (MBKS07) • Solution based on minhash • minhash is used for: • Estimate the size of a set |s| • Estimate resemblance of two sets • I.e., estimating the size of J=|s1s2| / |s1s2| • Estimate the size of the union |s1s2| • Hence, estimating the size of the intersection • |s1s2| J~(s1, s2)  ~(s1, s2)

Minhash • Given a set s = {t1, …, tm} • Use independent hash functions h1, …, hk: • hi: n-gram  [0, 1] • Hash elements of s, k times • Keep the k elements that hashed to the smallest value each time • We reduced set s, from m to k elements • Denote minhash signature with s’

How to use minhash • Given two signatures q’, s’: • J(q, s) Σ1ik I{q’[i]=s’[i]} / k • |s|  ( k / Σ1ik s’[i] ) - 1 • (qs)’ = q’  s’ = min1ik(q’[i], s’[i]) • Hence: • |qs|  (k / Σ1ik (qs)’[i]) - 1

t1 t2 … t10 1 3 1 5 5 8 Inverted list … … … 14 25 43 Minhash VSol Estimator • Construct one inverted list per n-gram in D • The lists are our sets • Compute a minhash signature for each list

Selectivity Estimation • Use edit distance length filter: • If ed(q, s)  δ, then q and s share at least L = |s| - 1 - n (δ-1) n-grams • Given query q = {t1, …, tm}: • Answer is the size of the union of all non-empty L-intersections (binomial coefficient: m choose L) • We can estimate sizes of L-intersections using minhash signatures

q = t1 t2 … t10 1 3 1 5 5 8 … … … 14 25 43 Example • δ = 2, n = 3  L = 6 • Look at all 6-intersections of inverted lists • Α = |ι1, ..., ι6  [1,10](ti1  ti2  …  ti6)| • There are (10 choose 6) such terms Inverted list

The m-L Similarity • Can be done efficiently using minhashes • Answer: • ρ = Σ1jk I{ i1, …, iL: ti1’[j] = … = tiL’[j] } • A  ρ  |t1… tm| • Proof very similar to the proof for minhashes

Cons • Will overestimate results • Many L-intersections will share strings • Edit distance length filter is loose

OptEQ – wild-card n-grams (LNS07) • Use extended n-grams: • Introduce wild-card symbol ‘?’ • E.g., “ab?” can be: • “aba”, “abb”, “abc”, … • Build an extended n-gram table: • Extract all 1-grams, 2-grams, …, n-grams • Generalize to extended 2-grams, …, n-grams • Maintain an extended n-grams/frequency hashtable

n-gram table n-gram Frequency ab 10 Dataset bc 15 string de 4 ef 1 abc gh 21 def hi 2 ghi … … … ?b 13 a? 17 ?c 23 … … abc 5 def 2 … … Example

Efficient Approximate Search on String Collections Part II

Efficient Approximate Search on String Collections Part II

Presentation Transcript

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Efficient Merging and Filtering Algorithms for Approximate String Searches

Approximate String Matching

The Flamingo Software Package on Approximate String Queries

String Checker II

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Approximate Search on String Collections Part I

Efficient Search on Encrypted Data

Rules for Approximate String Matching

Exact String Search

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient String Matching : An Aid to Bibliographic Search

Word Hashing for Efficient Search in Document Image Collections

Part II Algorithms for string motif finding

Approximate Boyer-Moore String Matching

Filter Algorithms for Approximate String Matching

A fast algorithm for approximate string matching on gene sequences

Approximate String Matching

Word Hashing for Efficient Search in Document Image Collections

Efficient Approximate Search on String Collections Part II

Exact String Search