300 likes | 439 Views
36 th ACM International Conference on Information Retrieval. Cache-Conscious Performance Optimization for Similarity Search. Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California at Santa Barbara. All Pairs Similarity Search (APSS).
E N D
36th ACM International Conference on Information Retrieval Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California at Santa Barbara
All Pairs Similarity Search (APSS) • Definition: Finding pairs of objects whose similarity is above a certain threshold. • Application examples: • Collaborative filtering. • Spam and near duplicate detection. • Image search. • Query suggestions. • Motivation: APSS still time consuming for large datasets. Sim (di,dj) = cos(di,dj) ≥ τ 2
Previous Work • Approaches to speedup APSS: • Exact APSS: • Dynamic Computation Filtering. [ Bayardo et al. WWW’07 ] • Inverted indexing. [Arasu et al. VLDB’06] • Parallelization with MapReduce. [Lin SIGIR’09] • Partition-based similarity comparison [Maha WSDM’13] • Approximate APSS via LSH: Tradeoff between precision and recall plus addition of redundant computations. • Approaches that utilize memory hierarchy: • General query processing [ Manegold VLDB02 ] • Other computing problems. 3
Baseline: Partition-based Similarity Search (PSS) [WSDM’13] Similarity comparison with parallel tasks Partitioning with dissimilarity detection 4
PSS Task Memory areas: S = vectors owned, B = other vectors, C = temporary. • Read assigned partition into area S. Task steps: • Repeat • Read some vectors vi from other partitions • Compare vi with S • Output similar vector pairs • Until other potentially similar vectors are compared. 5
Focus and Contribution • Contribution: • Analyze memory hierarchy behavior in PSS tasks. • New data layout/traversal techniques for speedup: • Splitting data blocks to fit cache. • Coalescing: read a block of vectors from other partitions and process them together. • Algorithms: • Baseline: PSS [WSDM’13] • Cache-conscious designs: PSS1 & PSS2 6
PROBLEM1: PSS area S is too big to fit in cache S C Inverted index of vectors Accumulator for S … B … Other vectors … … … … Too Long to fit in cache! … 7 …
C PSS1: Cache-conscious data splitting Accumulator for Si After splitting: … S1 … S2 … B Split Size? … aa aa aa aa aa aa aa aa … … Sq … 8
PSS1 Task PSS1 Task Read S and divide into many splits Read other vectors into B For each split Sx Compare (Sx, B) Output similarity scores Compare(Sx, B) … for di in Sx for dj in B Sim(di,dj) += wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj <t) then … 9
Modeling Memory/Cache Access of PSS1 Area Si Area B Sim(di,dj) + = wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < T ) then Area C Total number of data accesses : 10 D0 = D0(Si) + D0(B)+D0(C)
Cache misses and data access time Memory and cache access counts: D0 : total memory data accesses. D1 : missed access at L1 D2 : missed access at L2 D3 : missed access at L3 Memory and cache access time: δi : access time at cache level i δmem : access time in memory. Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 11
Total data access time Data found in L1 Total data access time ~2 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem
Total data access time Data found in L2 Total data access time 6-10 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem
Total data access time Data found in L3 Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 30-40 cycles
Total data access time Data found in memory Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 100- 300 cycles
Actual vs. Predicted Avg. task time ≈ #features * ( lookup + multiply + add) + accessmem 13
C RECALL: Split size s Accumulator for Si … S1 … S2 … B … Split Size s aa aa aa aa aa aa aa aa … … Sq …
Ratio of Data Access to Computation Data access computation Avg. task time ≈ #features * ( lookup + add+multiply) + accessmem Data access computation 15 Split size s
PSS2: Vector coalescing • Issues: • PSS1 focused on splitting S to fit into cache. • PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C. • Solution: coalescing multiple vectors in B
PSS2: Example for improved locality Striped areas in cache Si C … … … … B … … … 16
Evaluation • Implementation: Hadoop MapReduce. • Objectives: • Effectiveness of PSS1, PSS2 over PSS. • Benefits of modeling. • Datasets: • Twitter, Clueweb, Enron emails, YahooMusic, • Google news. • Preprocessing: • Stopword removal + df-cut. • Static partitioning for dissimilarity detection.
RECALL: coalescing size b Si C … … … … B … b … … … Avg. # of sharing = 2 18
Overall performance Clueweb
Impact of split size s in PSS1 Clueweb Twitter Emails
RECALL: split size s & coalescing size b Si C … … … … s B … b … … 20
Conclusions • Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1) • Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data.(PSS2) • Cost modeling for memory hierarchy access is a guidance to optimize parameter setting. • Experiments show cache-conscious design can be upto 2.74x as fast as the cache-oblivious baseline.