310 likes | 492 Views
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. Alexander Behm 1 , Shengyue Ji 1 , Chen Li 1 , Jiaheng Lu 2 1 University of California, Irvine 2 Renmin University of China. Overview. Motivation & Preliminaries Approach 1: Discarding Lists
E N D
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm1, Shengyue Ji1, Chen Li1, Jiaheng Lu2 1University of California, Irvine 2Renmin University of China
Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
Motivation: Data Cleaning Should clearly be “Niels Bohr” • Real-world data is dirty • Typos • Inconsistent representations • (PO Box vs. P.O. Box) • Approximately check against clean dictionary Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008
Motivation: Record Linkage We want to link records belonging to the same entity No exact match! The same entity may have similar representations Arnold Schwarzeneger versus Arnold Schwarzenegger Forrest Whittaker versus Forest Whittacker
Motivation: Query Relaxation • Errors in queries • Errors in data • Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html
What is Approximate String Search? Queries against collection: Find all entries similarto“Forrest Whitaker” Find all entries similarto“Arnold Schwarzenegger” Find all entries similarto“Brittany Spears” String Collection: (People) Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzeneger … … … • What do we mean by similar to? • Edit Distance • Jaccard Similarity • Cosine Similaity • Dice • Etc. The similar to predicate can help our described applications! How can we support these types of queries efficiently?
Approximate Query Answering Main Idea: Use q-grams as signatures for a string irvine Sliding Window 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Inverted index on grams supports finding all data strings sharing enough grams with a query
Approximate Query Example Query: “irvine”, Edit Distance 1 2-grams {ir, rv, vi, in, ne} Lookup Grams … tf vi ir ef rv ne un … in 2-grams 1 2 4 5 6 5 9 1 3 4 5 7 9 1 5 1 2 3 9 3 9 7 9 5 6 9 Inverted Lists (stringIDs) Candidates = {1, 5, 9} May have false positives Need to compute real similarity Each edit operations can “destroy” at most q grams Answers must share at least T = 5 – 1 * 2 = 3 grams T-Occurrence problem: Find elements occurring at least T=3 times among inverted lists. This is called list-merging. T is called merging-threshold.
Motivation: Compression Inverted index can be very large compared to source data May need to fit in memory for fast query processing Can we compress the index to fit into a space budget? Index-Size Estimation Each string produces |s| - q + 1 grams For each gram we add one element to its inverted list (a 4-byte uint) With ASCII encoding the index is ~4x as large as the original data!
Motivation: Related Work IR community developed many lossless compression algorithms for inverted lists (mostly in a disk-based setting) Mainly use delta representation + packing If inverted lists are in memory these techniques always impose decompression overhead Difficult to tunecompression ratio How to overcome these limitations in our setting?
This Paper We developed two lossy compressiontechniques We answer queries exactly Index can fit into a space budget (space constraint) Queries can become faster on the compressed indexes Flexibility to choose space / time tradeoff Existing list-merging algorithms can be re-used (even with compression specific optimizations)
Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
Approach 1: Discarding Lists B E FORE … tf vi ir ef rv ne un … in 2-grams 1 2 4 5 6 5 9 1 3 4 5 7 9 1 5 1 2 3 9 3 9 7 9 5 6 9 Inverted Lists (stringIDs) … tf vi ir ef rv ne un … in 2-grams A F TER 1 2 4 5 6 5 9 1 5 7 9 5 6 9 Inverted Lists (stringIDs) Lists discarded, “Holes”
Effects on Queries • Need to decrease merging-threshold T • Lower T more false positives to post-process • If T <= 0 we “panic”, need to scan entire collection and compute true similarities • Surprisingly! QueryProcessing time can decrease because fewerliststo consider
Query “shanghai”, Edit Distance 1 3-grams {sha, han, ang, ngh, gha, hai} Hole grams Regular grams ing han ngh hai … uni sha ang gha ter 3-grams Merging-threshold without holes, T = #grams – ed * q = 6 – 1 * 3 = 3 Basis: Each Edit Operation can “destroy” at most q=3 grams Naïve new Merging-Threshold T’ = T – #holes = 0 Panic! Can we really destroy at most q=3 non-hole grams with each edit operation? han ngh hai sha ang gha Delete “a” Delete “g” Can destroy at most 2 grams with 1 Edit Operation! New Merging-Threshold T’ = 1 We use Dynamic Programming to compute tighter T’
Choosing Lists to Discard • One extreme: query is entirely unaffected • Other extreme: query becomes panic • Good choice of lists depends on query workload • Many combinations of lists to discard that satisfy memory constraint, checking all is infeasible • How can we make a “reasonable” choice efficiently?
Choosing Lists to Discard Input: Memory Constraint Inverted Lists L Query Workload W Output: Lists to Discard D DiscardLists { While(Memory Constraint Not Satisfied) { For each list in L { ∆t = estimateImpact(list, W) benefit = list.size() } discard = use ∆t’s and benefits to choose list add discard to D remove discard from L } } How can we do this efficiently? Perhaps incrementally? Times needed: List-Merging Time Post-Processing Time Panic Time What exactly should we minimize? benefit / cost? cost only? We could ignore benefit…
Choosing Lists to Discard Estimating Query Times With Holes List-Merging Time: cost function, parameters decided offline with linear regression Post-Processing Time: #candidates * average compute similarity time Panic Time: #strings * average compute similarity time #candidates depends on T, data distribution, number of holes Incremental-ScanCount Algorithm Before Discarding List T = 3 #candidates = 3 Counts 2 0 3 3 2 4 0 0 1 0 List to discard 0 1 2 3 4 5 6 7 8 9 StringIDs 2 3 decrement counts 4 8 After Discarding List T’ = T – 1 = 2 #candidates = 4 Counts 2 0 2 2 1 4 0 0 0 0 0 1 2 3 4 5 6 7 8 9 StringIDs Many more ways to improve speed of DiscardLists, this is just one example…
Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
Approach 2: Combining Lists B E FORE … tf vi ir ef rv ne un … in 2-grams 1 2 4 5 6 5 9 1 3 4 5 7 9 5 6 9 1 2 3 9 1 3 9 7 9 6 9 Inverted Lists (stringIDs) … tf vi ir ef rv ne un … in 2-grams A F TER 1 2 4 5 6 1 3 4 5 7 9 7 9 6 9 1 2 3 9 5 6 9 Inverted Lists (stringIDs) Lists combined Intuition: Combine correlated lists.
Effects on Queries • Merging-threshold T is unchanged (no new panics) • Lists become longer: • More time to traverse lists • More false positives List-Merging Optimization 3-grams {sha, han, ang, ngh, gha, hai} Traverse physical lists once. Count for stringIDs on physical lists increased by refcount instead of 1 combined refcount = 2 combined refcount = 3
Choosing Lists to Combine • Discovering candidate gram pairs • Frequent q+1-grams correlated adjacent q-grams • Using Locality-Sensitive Hashing (LSH) • Selecting candidate pairs to combine • Based on estimated cost on query workload • Similar to DiscardList • Different Incremental ScanCount algorithm
Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion
Experiments • Datasets: • Google WebCorpus (word grams) • IMDB Actors • Queries: picked from dataset, Zipf distributed • q=3, Edit Distance=2 • Overview: • Performance of flavors of DiscardLists & CombineLists • Scalability with increasing index size • Comparison with IR compression technique • Comparison with VGRAM • What if workload changes from training workload
Experiments DiscardLists CombineLists Runtime decreases! Runtime decreases!
Experiments Comparison with IR compression technique Compressed Uncompressed Compressed Uncompressed
Experiments Comparison with variable-length gram technique, VGRAM Compressed Uncompressed Uncompressed Compressed
Future Work DiscardLists, CombineLists and IR compression could be combined When considering filter tree, global vs. local decisions How to minimize impact on performance if workload change
Conclusion We developed two lossy compressiontechniques We answer queries exactly Index can fit into a space budget (space constraint) Queries can become faster on the compressed indexes Flexibility to choose space / time tradeoff Existing list-merging algorithms can be re-used (even with compression specific optimizations)
More Experiments What if the workload changes from the training workload?
More Experiments What if the workload changes from the training workload?