630 likes | 885 Views
Efficient Approximate Search on String Collections. Marios Hadjieleftheriou. Chen Li. Outline. Part 1: Motivation and preliminaries Inverted list based algorithms Part 2: Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions.
E N D
Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li
Outline Part 1: • Motivation and preliminaries • Inverted list based algorithms Part 2: • Gram signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions
Web Search • Errors in queries • Errors in data • Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html
Record Linkage R S • Edit distance • Jaccard • Cosine • … Record linkage
Document Cleaning Should be “Niels Bohr” Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope
Demos • http://directory.uci.edu/ • http://psearch.ics.uci.edu/advanced/ • http://psearch.ics.uci.edu/
State-of-the-art: Oracle 10g and older • Supported by Oracle Text • CREATE TABLE engdict(word VARCHAR(20), len INT); • Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / • CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); • Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; • Limitation: cannot handle errors in the first letters: Katherine versus Catherine
Microsoft SQL Server [CGG+05] • Data cleaning tools available in SQL Server 2005 • Part of Integration Services • Supports fuzzy lookups • Uses data flow pipeline of transformations • Similarity function: tokens with TF/IDF scores 8
Lucene • Using Levenshtein Distance (Edit Distance). • Example: roam~0.8 • Prefix filtering followed by a scan (Efficiency?)
Problem Formulation Find strings similar to a given string • Performance is important! • 10 ms: 100 queries per second (QPS) • 5 ms: 200 QPS
Similarity Functions • Similar to: • a domain-specific function • returns a similarity value between two strings • Examples: • Edit distance • Hamming distance • Jaccard similarity • Soundex • TF/IDF, BM25, DICE • See [KSS06] for an excellent survey
A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 12
Outline • Motivation and preliminaries • Inverted list based algorithms • List-merging algorithms • VGRAM • List-compression techniques • Gram signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions
“q-grams” of strings u n i v e r s a l 2-grams
Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l
id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 4 2 3 0 1 4 3 0 3 1 0 1 2 4 4 1 2 4 2 3 q-gram inverted lists
# of common grams >= 3 id strings at ch ck ic ri st ta ti tu uc 4 0 1 2 3 4 rich stick stich stuck static 2 0 1 3 0 1 2 4 0 4 2 3 1 4 1 2 4 3 3 Searching using inverted lists • Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck ic ck ti 2-grams
T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T
Example T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13
List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip
Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap
MergeOpt Algorithm [SK04] Binary search Long Lists: T-1 Short Lists
Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4
ScanCount String ids # of occurrences Increment by 1 1 2 3 … 1 0 1 3 5 10 13 10 13 15 5 7 13 13 15 0 1 0 13 4 0 Result! 14 0 15 2 0 Count threshold T≥ 4 24
List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip
MergeSkip algorithm [BK02, LLL08] Pop T-1 …… Min-heap Jump Greater or equals T-1
Example of MergeSkip 1 minHeap 5 10 13 15 1 3 5 10 10 15 5 7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4
DivideSkip Algorithm [LLL08] Binary search MergeSkip Long Lists Short Lists
How many lists are treated as long lists? Long Lists Short Lists Lookup Merge ? A good balance in the tradeoff: # of long lists = T / ( μ logM +1)
Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19
Positional Filtering Ed(s,t) ≤ 2 s (ab,1) t (ab,12)
root … 1 2 3 n … aa ab zy zz 1 2 m 5 12 17 28 44 Filter tree [LLL08] Length level Gram level … Position level Inverted list
Surprising experimental results (DBLP) Adding position filter could increase running time
Filters fragment inverts lists Merge Merge Merge Merge Applying filters Saving: reduce list size. Cost: - Tree traversal, - More merging
Outline • Motivation and preliminaries • Inverted list based algorithms • List-merging algorithms • VGRAM [LWY07,YWL08] • List-compression techniques • Gram signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions
ati ich ick ric sta sti stu tat tic tuc uck 4 id strings id strings id strings 2 0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 1 0 4 2 1 3 4 2 1 4 3 3 2-grams -> 3-grams? • Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick ick tic # of common grams >= 1 3-grams
id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 1 4 2 3 0 1 4 3 0 3 0 1 2 4 4 1 2 4 2 3 Observation 1: dilemma of choosing “q” • Increasing “q” causing: • Longer grams Shorter lists • Smaller # of common grams of similar strings
Observation 2: skew distributions of gram frequencies • DBLP: 276,699 article titles • Popular 5-grams: ation (>114K times), tions, ystem, catio
VGRAM: Main idea • Grams with variable lengths (between qmin and qmax) • zebra • ze(123) • corrasion • co(5213), cor(859), corr(171) • Advantages • Reduce index size • Reducing running time • Adoptable by many algorithms
Challenges • Generatingvariable-length grams? • Constructing a high-quality gram dictionary? • Relationship between string similarity and their gram-set similarity? • Adopting VGRAM in existing algorithms?
[2,4]-gram dictionary ni ivr sal uni vers Challenge 1: String Variable-length grams? • Fixed-length 2-grams u n i v e r s a l • Variable-length grams u n i v e r s a l
Representing gram dictionary as a trie ni ivr sal uni vers
Challenge 2: Constructing gram dictionary Step 1: Collecting frequencies of grams with length in [qmin, qmax] st 0, 1, 3 sti 0, 1 stu3 stic 0, 1 stuc3 Gram trie with frequencies
Step 2: selecting grams • Pruning trie using a frequency threshold F (e.g., 2)
Step 2: selecting grams (cont) Threshold T = 2
Final gram dictionary A cost-based approach to choosing a gram dictionary [YWL08]
Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l
Deletion affects variable-length grams Not affected Not affected Affected i i-qmax+1 i+qmax- 1 Deletion
[2,4]-grams ni ivr sal uni vers Grams affected by a deletion Affected? i i-qmax+1 i+qmax- 1 Deletion Deletion u n i v e r s a l Affected?
Grams affected by a deletion (cont) Affected? i i-qmax+1 i+qmax- 1 Deletion Trie of grams Trie of reversed grams