Cache-Conscious Performance Optimization for Similarity Search

36th ACM International Conference on Information Retrieval Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California at Santa Barbara

All Pairs Similarity Search (APSS) • Definition: Finding pairs of objects whose similarity is above a certain threshold. • Application examples: • Collaborative filtering. • Spam and near duplicate detection. • Image search. • Query suggestions. • Motivation: APSS still time consuming for large datasets. Sim (di,dj) = cos(di,dj) ≥ τ 2

Previous Work • Approaches to speedup APSS: • Exact APSS: • Dynamic Computation Filtering. [ Bayardo et al. WWW’07 ] • Inverted indexing. [Arasu et al. VLDB’06] • Parallelization with MapReduce. [Lin SIGIR’09] • Partition-based similarity comparison [Maha WSDM’13] • Approximate APSS via LSH: Tradeoff between precision and recall plus addition of redundant computations. • Approaches that utilize memory hierarchy: • General query processing [ Manegold VLDB02 ] • Other computing problems. 3

Baseline: Partition-based Similarity Search (PSS) [WSDM’13] Similarity comparison with parallel tasks Partitioning with dissimilarity detection 4

PSS Task Memory areas: S = vectors owned, B = other vectors, C = temporary. • Read assigned partition into area S. Task steps: • Repeat • Read some vectors vi from other partitions • Compare vi with S • Output similar vector pairs • Until other potentially similar vectors are compared. 5

Focus and Contribution • Contribution: • Analyze memory hierarchy behavior in PSS tasks. • New data layout/traversal techniques for speedup: • Splitting data blocks to fit cache. • Coalescing: read a block of vectors from other partitions and process them together. • Algorithms: • Baseline: PSS [WSDM’13] • Cache-conscious designs: PSS1 & PSS2 6

PROBLEM1: PSS area S is too big to fit in cache S C Inverted index of vectors Accumulator for S … B … Other vectors … … … … Too Long to fit in cache! … 7 …

C PSS1: Cache-conscious data splitting Accumulator for Si After splitting: … S1 … S2 … B Split Size? … aa aa aa aa aa aa aa aa … … Sq … 8

PSS1 Task PSS1 Task Read S and divide into many splits Read other vectors into B For each split Sx Compare (Sx, B) Output similarity scores Compare(Sx, B) … for di in Sx for dj in B Sim(di,dj) += wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj <t) then … 9

Modeling Memory/Cache Access of PSS1 Area Si Area B Sim(di,dj) + = wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < T ) then Area C Total number of data accesses : 10 D0 = D0(Si) + D0(B)+D0(C)

Cache misses and data access time Memory and cache access counts: D0 : total memory data accesses. D1 : missed access at L1 D2 : missed access at L2 D3 : missed access at L3 Memory and cache access time: δi : access time at cache level i δmem : access time in memory. Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 11

Total data access time Data found in L1 Total data access time ~2 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

Total data access time Data found in L2 Total data access time 6-10 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem

Total data access time Data found in L3 Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 30-40 cycles

Total data access time Data found in memory Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 100- 300 cycles

Actual vs. Predicted Avg. task time ≈ #features * ( lookup + multiply + add) + accessmem 13

C RECALL: Split size s Accumulator for Si … S1 … S2 … B … Split Size s aa aa aa aa aa aa aa aa … … Sq …

Ratio of Data Access to Computation Data access computation Avg. task time ≈ #features * ( lookup + add+multiply) + accessmem Data access computation 15 Split size s

PSS2: Vector coalescing • Issues: • PSS1 focused on splitting S to fit into cache. • PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C. • Solution: coalescing multiple vectors in B

PSS2: Example for improved locality Striped areas in cache Si C … … … … B … … … 16

Evaluation • Implementation: Hadoop MapReduce. • Objectives: • Effectiveness of PSS1, PSS2 over PSS. • Benefits of modeling. • Datasets: • Twitter, Clueweb, Enron emails, YahooMusic, • Google news. • Preprocessing: • Stopword removal + df-cut. • Static partitioning for dissimilarity detection.

Improvement Ratio of PSS1,PSS2 over PSS 2.7x 18

RECALL: coalescing size b Si C … … … … B … b … … … Avg. # of sharing = 2 18

Average number of shared features 19

Overall performance

Overall performance Clueweb

Impact of split size s in PSS1 Clueweb Twitter Emails

RECALL: split size s & coalescing size b Si C … … … … s B … b … … 20

Affect of s & b on PSS2 performance (Twitter) fastest 21

Conclusions • Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1) • Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data.(PSS2) • Cost modeling for memory hierarchy access is a guidance to optimize parameter setting. • Experiments show cache-conscious design can be upto 2.74x as fast as the cache-oblivious baseline.

Cache-Conscious Performance Optimization for Similarity Search

Cache-Conscious Performance Optimization for Similarity Search

Presentation Transcript

Seeds for Similarity Search

Cache performance

Cache-Conscious Data Placement

A Metric Cache for Similarity Search

Cache Optimization Summary

Cache Performance

Cache-Conscious Wavefront Scheduling

Cache (Memory) Performance Optimization

Memory/Cache Optimization

Cache performance

Database Similarity Search

Similarity Search for Web Services

Cache-Conscious Data Placement

Topics 10: Cache Conscious Indexes

Similarity Search

Cache Performance

Programming for Cache Performance

Search Engine Optimization for Better Bottom Line Performance

Cache (Memory) Performance Optimization

Efficient Similarity Search with Cache-Conscious Data Traversal

Cache-Conscious Wavefront Scheduling

Operators for Similarity Search