510 likes | 696 Views
Efficient Merging and Filtering Algorithms for Approximate String Searches. Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu. Example: a movie database. Find movies starred Schwarrzenger. In general: Gap between Queries and Data. Errors in the query
E N D
Efficient Merging and Filtering Algorithms for Approximate String Searches Jiaheng Lu, University of California, Irvine Joint work with Chen Li, Yiming Lu
Example: a movie database Find movies starred Schwarrzenger.
In general: Gap between Queries and Data • Errors in the query • The user doesn’t remember a string exactly • The user unintentionally types a wrong string Query: Schwarrzenger. Data :Schwarzenegger … …
Data may not clean • Errors in the database: • Data often is not clean by itself, especially true in data integration and cleansing Relation R Relation S
Problem definition: approximate string searches Collection of strings s Star Search Keanu Reeves Samuel Jackson Query q Schwarzenegger Samuel Jackson … Output: strings s that satisfy Sim(q,s)≤δ
Example Similarity Function: Edit Distance • A widely used metric to define string similarity • Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 • Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2
Example: approximate string searches Collection of strings s Star Search Tom Hank Thomas Hanks Query q Ton Hank Tom J. Hanks Tom Hanks … Output: strings s that satisfy ed(q,s)≤2
Outline • Problem motivation • Preliminary • Grams • Inverted lists • Merge algorithms • Filtering technique • Conclusion
String Grams q-grams For example: 2-gram (un),(ni),(iv),(ve),(er),(rs),(sa),(al) 10
id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 1 4 2 3 0 1 4 3 0 3 0 1 2 4 4 1 2 4 2 3 Inverted lists • Convert strings to gram inverted lists
Main Example st 1,2,3,4 Merge Candidate string ids {1,2,3,4} Query ed(s,q)≤1 ti 1,2,4 (st,ti,ic,ck) stick ic 0,1,2,4 count >=2 ck 1,3 Double check for the real edit distance Grams Data ck ic st ta ti … 1,3 Final answers 0,1,2,4 Performance bottleneck! {1,2,3} 1,2,3,4 4 1,2,4
Sub-problem definitions: Given multiple inverted lists with integer values in increasing order and a threshold T, we find all values whose number of occurrences ≥ T.
Example • Count threshold: 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13
Outline • Problem motivation • Preliminary • Merge algorithms • Two previous algorithms • Our proposed three algorithms • Filtering technique • Conclusion
Five Merge Algorithms HeapMerger [Sarawagi,SIGMOD 2004] MergeOpt [Sarawagi,SIGMOD 2004] Previous New ScanCount MergeSkip DivideSkip
Two previous algorithms (1) Heap-based Algorithm Push to heap …… Min-heap Count # of the occurrences of each element by a heap
Example of HeapMerger [Sarawagi et al 2004] 1 minHeap 10 5 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4
Five Merge Algorithms MergeOpt [Sarawagi 2004] HeapMerger [Sarawagi 2004] Previous New ScanCount MergeSkip DivideSkip
Two previous algorithms (2) MergeOpt Algorithm Binary search Long Lists: T-1 Short Lists
Example of MergeOpt [Sarawagi et al 2004] Min-heap 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold ≥ 4
Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip
Our new algorithms (1) ScanCount Algorithm Use an array to record # of occurrences of each element
ScanCount Example 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Result:13 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4
Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip
Our new algorithms (2) MergeSkip algorithm Pop T-1 Min-heap …… Jump T-1
Example of MergeSkip minHeap 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4
Example of MergeSkip 1 minHeap 5 10 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4
Example of MergeSkip Pop 1, 5,10 minHeap 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Count threshold ≥ 4
Example of MergeSkip Pop 1, 5,10 minHeap 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Jump ≥ 13 Count threshold ≥ 4
Example of HeapMerger minHeap 13 13 13 13 15 1 3 5 10 13 10 13 15 5 7 13 13 15 Result:13 Count threshold ≥ 4
Five Merge Algorithms HeapMerger MergeOpt Previous New ScanCount MergeSkip DivideSkip
Our new algorithms (3) DivideSkip Algorithm MergeSkip Binary search Long Lists: dynamic size Short Lists
Size of long lists How many lists are treated as long lists? Cost: MergeOpt Binary search Long Lists Short Lists 35
Size of long lists How many lists are treated as long lists? Cost: MergeSkip Binary search Long Lists Short Lists 36
Decide L value A good balance in the tradeoff: # of long lists = T / ( μ logM +1) 37
Empirically verification Our formula about “L” achieves the best result over other options. 38
Experimental data sets Three real data sets have various string lengths and data sizes DBLP data IMDB data Google Web corpus
Performance (DBLP data) DivideSkip is the best one Running time per query with various algorithms
# of elements reading (DBLP data) DivideSkip is the best one DivideSkip skips reading the most elements
Outline • Problem motivation • Preliminary • Merge algorithms • Filtering technique • Length, positional filter [Gravano et al. VLDB 2001] • Filter tree • Conclusion and future work
Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19
Positional Filtering • Positional Gram • For example: string abcd: • {(ab,1),(bc,2),(cd,3)} Ed(s,t) ≤ 2 s (ab,1) t (ab,12)
root … 1 2 3 n … aa ab zy zz 1 2 m Filter tree Length level Gram level … Position level 5 12 17 28 44 Inverted list
Surprising experimental results(DBLP) Wisely use filters, more filters may be bad!
Conclusion • Three newmergealgorithms • We run faster • Surprising experimental results Wisely use filters, more filters may be bad!
Backup : related work Approximate string matching [Navarro 2001] Fuzzy lookup in Varied length Grams [Li et al 2007]
Reference • [Arasu 2006] A. Arasu and V. Ganti and R. Kaushik “Efficient Exact Set-similarity Joins” in VLDB 2006 • [Chaudhuri 2003] S. Chaudhuri ,K Ganjam, V. Ganti and R. Motwani “Robust and Efficient Fuzzy Match for online Data Cleaning” in SIGMOD 2003 • [Gravano 2001] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan and D. Srivastava “Approximate string joins in a database almost for free” in VLDB 2001