530 likes | 663 Views
Efficient Approximate Search on String Collections Part I. Marios Hadjieleftheriou. Chen Li. DBLP Author Search. http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html. Try their names (good luck!). http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html. UCSD.
E N D
Efficient Approximate Search on String CollectionsPart I Marios Hadjieleftheriou Chen Li
DBLP Author Search http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html
Try their names (good luck!) http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html UCSD Case Western AT&T--Research Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou
Better system? http://dblp.ics.uci.edu/authors/
People Search at UC Irvine http://psearch.ics.uci.edu/
Web Search • Errors in queries • Errors in data • Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html
Data Cleaning R S
Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” • Performance is important! • 10 ms: 100 queries per second (QPS) • 5 ms: 200 QPS
Outline • Motivation • Preliminaries • Trie-based approach • Gram-based algorithms • Sketch-based algorithms • Compression • Selectivity estimation • Transformations/Synonyms • Conclusion Part I Part II
Next… Preliminaries
Similarity Functions • Similar to: • a domain-specific function • returns a similarity value between two strings • Examples: • Edit distance • Hamming distance • Jaccard similarity • Soundex • TF/IDF, BM25, DICE • See [KSS06] for an excellent survey
A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 13
Next… Gram-based algorithms • List-merging algorithms [LLL08] • Variable-length grams (VGRAM) [LWY07,YWL08]
“q-grams” of strings u n i v e r s a l 2-grams
Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >= (|s1|- q + 1) –k *q
id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 4 2 3 0 1 4 3 0 3 1 0 1 2 4 4 1 2 4 2 3 q-gram inverted lists
# of common grams >= 3 id strings at ch ck ic ri st ta ti tu uc 4 0 1 2 3 4 rich stick stich stuck static 2 0 1 3 0 1 2 4 0 4 2 3 1 4 1 2 4 3 3 Searching using inverted lists • Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck ic ck ti 2-grams
T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T
Example T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13
List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip
Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap
MergeOpt Algorithm [SK04] Binary search Long Lists: T-1 Short Lists
Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4
ScanCount String ids # of occurrences Increment by 1 1 2 3 … 1 0 1 3 5 10 13 10 13 15 5 7 13 13 15 0 1 0 13 4 0 Result! 14 0 15 2 0 Count threshold T≥ 4 25
List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip
MergeSkip algorithm [BK02, LLL08] Pop T-1 …… Min-heap Jump Greater or equals T-1
Example of MergeSkip 1 minHeap 5 10 13 15 1 3 5 10 10 15 5 7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4
DivideSkip Algorithm [LLL08] Binary search MergeSkip Long Lists Short Lists
Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19
Positional Filtering Ed(s,t) ≤ 2 s (ab,1) t (ab,12)
A filter tree Combine filters with list-merging algorithms [LLL08]
Next… Variable-length grams (VGRAM) [LWY07,YWL08]
ati ich ick ric sta sti stu tat tic tuc uck 4 id strings id strings id strings 2 0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 1 0 4 2 1 3 4 2 1 4 3 3 2-grams -> 3-grams? • Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick ick tic # of common grams >= 1 3-grams
id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 1 4 2 3 0 1 4 3 0 3 0 1 2 4 4 1 2 4 2 3 Observation 1: dilemma of choosing “q” • Increasing “q” causing: • Longer grams Shorter lists • Smaller # of common grams of similar strings
Observation 2: skew distributions of gram frequencies • DBLP: 276,699 article titles • Popular 5-grams: ation (>114K times), tions, ystem, catio
VGRAM: Main idea • Grams with variable lengths (between qmin and qmax) • zebra • ze(123) • corrasion • co(5213), cor(859), corr(171) • Advantages • Reduce index size • Reducing running time • Adoptable by many algorithms
Challenges • Generatingvariable-length grams? • Constructing a high-quality gram dictionary? • Relationship between string similarity and their gram-set similarity? • Adopting VGRAM in existing algorithms?
[2,4]-gram dictionary ni ivr sal uni vers Challenge 1: String Variable-length grams? • Fixed-length 2-grams u n i v e r s a l • Variable-length grams u n i v e r s a l
Representing gram dictionary as a trie ni ivr sal uni vers
Step 2: Constructing a gram dictionary qmin=2 qmax=4 • Frequency-based [LYW07] • Cost-based [YLW08]
Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l
Deletion affects variable-length grams Not affected Not affected Affected i i-qmax+1 i+qmax- 1 Deletion
Main idea • For a string, for each position, compute the number of grams that could be destroyed by an operation at this position • Compute number of grams possibly destroyed by k operations • Store these numbers (for all data strings) as part of the index Vector of s = <2,4,6,8,9> • Use this number to do count filtering With 2 edit operations, at most 4 grams can be affected
Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: • String s grams • String s1, s2 such that ed(s1,s2) <= k min # of their common grams
Lower bound on # of common grams Fixed length (q) If ed(s1,s2) <= k, then their # of common grams >=: (|s1|- q + 1) –k *q u n i v e r s a l Variable lengths:# of grams of s1 – NAG(s1,k)
… ck ic ich … tic tick … id strings id strings id strings … ck ic … ti … 1 3 1 3 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 4 1 4 1 2 0 2 0 1 2 4 2 4 1 Example: algorithm using inverted lists • Query: “shtick”, ED(shtick, ?)≤1 tick sh ht tick 2-grams 2-4 grams Lower bound = 3 Lower bound = 1
End of part I • Motivation • Preliminaries • Trie-based approach • Gram-based algorithms • Sketch-based algorithms • Compression • Selectivity estimation • Transformations/Synonyms • Conclusion Part I Part II