560 likes | 581 Views
This thesis explores efficient similarity search techniques for big data, covering partition-based symmetric comparison, load balancing, and cache-conscious traversal. Applications include document clustering, duplicate detection, and spam detection.
E N D
Efficient Similarity Search with Cache-Conscious Data Traversal Xun Tang Committee: Tao Yang (Chair), Divy Agrawal, Xifeng Yan March 16, 2015
Roadmap • Similarity search background • Partition-based method background • Three main components in my thesis • Partition-based symmetric comparison and load balancing [SIGIR’14a]. • Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal]. • Optimized search result ranking with cache-conscious traversal [SIGIR’14b]. • Conclusion
Similarity Search for Big Data • Finding pairs of data objects with similarity score above a threshold. • Example Applications: • Document clustering • Near duplicates • Spam detection • Query suggestion • Advertisement fraud detection • Collaborative filtering & recommendation • Very slow data processing for large datasets • How to make it fast and scalable?
Applications : Duplicate Detection & Clustering d1 = (1, 3,5, 0,0,0,4, 3,2,7) d2 = (1, 2,2, 0,0,0,4, 3,2,7)
All-Pairs Similarity Search (APSS) • Dataset • Cosine-based similarity: • Given n normalized vectors, compute all pairs of vectors such that • Quadratic complexity O(n2) >
Big Data Challenges for Similarity Search Sequential time (hour) • 4M tweets fit in memory, but take days to process • Approximated processing • Df-limit [Lin SIGIR’09]: remove features if their document frequency exceed an upper limit Values marked * are estimated by sampling
Inverted Indexing and Parallel Score Accumulation for APSS [Lin SIGIR’09; Baraglia et al. ICDM’10] f3 f5 f1 Vector d2 W2,1 W2,3 W2,5 W3,1 Map Partial result Partial result Partial result W4,1 W4,3 W4,5 Vector d4 Communication overhead + Reduce W7,3 = sim(d2,d4) W8,3
Parallel Solutions for Exact APSS • Parallel score accumulation[Lin SIGIR’09; Baraglia et al. ICDM’10] • Partition-based Similarity Search (PSS) [Alabduljalil et al. WSDM’13]
Parallel time comparison: PSS vs. Parallel Score Accumulation Inverted indexing with partial result parallelism 25x faster PSS Twitter
PSS: Partition-based similarity search • Key techniques : • Partition-based symmetric comparison and load balancing [SIGIR’14a]. • Challenge comes from the skewed distribution in data partition sizes and irregular dissimilarity relationship in large datasets. • Analysis on competitiveness to the optimum. • Scalable for large datasets on hundreds of cores. • Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal].
Symmetry of Comparison • Partition-level comparison is symmetric • Example: Should Pi compare with Pjor Pjcompare with Pi? • Impact communication/load of corresponding tasks • Choice of comparison direction affects load balance Pi Pj Pj Pi
Similarity Graph Comparison Graph Load assignment process:Transition from similarity graph to comparison graph
Load Balance Measurement & Examples • Load balance metric: Graph cost = Max (task cost) • Task cost is the sum of • Self comparison, including computation and I/O cost • Comparison to partitions point to itself
Challenges of Optimal Load Balance • Skewed distribution of node connectivity & node sizes • Empirical data
Two-Stage Load Balance Stage 1: Initial assignment of edge directions • Key Idea: tasks with small partitions or low connectivity should absorb more load • Optimize a sequence of steps that balances the load
Stage 2: Assignment refinement • Key Idea: gradually shift load of heavy tasks to their lightest neighbors • Only reverse an edge direction if beneficial Result of Stage 1 A refinement step
Competitive to Optimal Task Load Balancing • Is this two-stage algorithm competitive the optimum? • Optimum = minimum (maximum task cost) • Result: Two-stage solution ≤ (2 + δ) Optimum δ is the ratio of I/O and communication cost over computation cost In our tested cases, δ ≈ 10%
Competitive to Optimum Runtime Scheduler • Can the solution of task assignment algorithm be competitive to the one produced by the optimum runtime scheduling? • PTopt = Minimum (parallel time on q cores) • A greedy scheduler executes tasks produced by two-stage algorithm • E.g. Hadoop MapReduce • Yielded schedule length is PTq • Result:
Scalability: Parallel Time and Speedup • Efficiency decline caused by the increase of I/O overhead among machines in larger clusters • YMusic dataset is not large enough to use more cores for amortizing overhead
Comparison with Circular Load Assignment • [Alabduljalil et al. WSDM’13] • Parallel time reduction • Stage 1 up to 39% • Stage 2 up to 11% • Task cost Improvement percentage
PSS: Partition-based similarity search • Key techniques : • Partition-based symmetric comparison and load balancing [SIGIR’14a]. • Fast runtime execution considering memory hierarchy [SIGIR’13 + Journal]. • Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1). • Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data (PSS2). • Cost modeling for memory hierarchy access as a guidance to optimize parameter setting.
Memory-hierarchy aware execution inPSS Task S = vectors of a partition owned B = vectors of other partitions to compare C = temporary storage • Read assigned partition into area S. Task steps: • Repeat • Read some vectors vi from other partitions • Compare vi with S • Output similar vector pairs • Until other potentially similar vectors are compared.
Problem: PSS area S is too big to fit in cache S C Inverted index of vectors Accumulator for S … B … Other vectors … … … … Too Long to fit in cache! … …
C PSS1: Cache-conscious data splitting Accumulator for Si After splitting: … S1 … S2 … Split Size? B aa aa aa aa aa aa aa aa … … … Sq …
PSS1 Task PSS1 Task Read S and divide into many splits Read other vectors into B For each split Sx Compare (Sx, B) Output similarity scores Compare(Sx, B) … for di in Sx for dj in B Sim(di,dj) += wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < ) then …
Modeling Memory/Cache Access of PSS1 Area Si Area B Sim(di,dj) = Sim(di,dj)+ wi,t * wj,t if( sim(di,dj) + maxwdi * sumdj < ) then Area C Total number of data accesses : D0 = D0(Si) + D0(B)+D0(C)
Cache misses and data access time Memory and cache access counts: D0 : total memory data accesses D1 : missed access at L1 D2 : missed access at L2 D3 : missed access at L3 Memory and cache access time: δi : access time at cache level i δmem : access time in memory. Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem
Total data access time Data found in L1 Total data access time ~2 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem
Total data access time Data found in L2 Total data access time 6-10 cycles = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem
Total data access time Data found in L3 Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 30-40 cycles
Total data access time Data found in memory Total data access time = (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3 + D3δmem 100- 300 cycles
Time comparisonPSS v.s. PSS1 • : L1 cache miss ratio. In practice > 10% • is two orders of magnitude slower than • Ideal ratio ~ 10x Consider case: PSS1 split fits in L2 cache
Actual vs. Predicted Avg. task time ≈ #features * (lookup + multiply + add) + accessmem/cache
PSS2: Vector coalescing • Issues: • PSS1 focused on splitting S to fit into cache. • PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C. • Solution: coalescing multiple vectors in B
PSS2: Example for improved locality Striped areas in cache Si C … … … … B … … … • Improve temporal locality in memory areas B and C • Reduce the amortization of inverted index lookup cost
Incorporate LSH with PSS • LSH functions • Signature generation • MinHash • Jaccard similarity • Random Projection • cosine similarity
LSH Pipeline • LSH sub-steps • Projection generation • Signature generation • Bucket generation • Benefits • Great for parallelization • More accessible for larger dataset
Effectiveness of Our Method • 100% precision • As comparison: 67% [Ture et. al. SIGIR’11] • A guaranteed recall ratio for a certain similarity threshold k bits each round l rounds
Efficiency – 20M Tweets • >95% recall for 0.95 cosine similarity • 50 cores • Tradeoff of k • Too high: partition too small • Too low: not enough speedup via hashing
Method Comparison – 20M Tweets • LSH: improves efficiency (speed) with recall bound • PSS: guarantees precision
Efficiency – 40M ClueWeb • 95% recall for 0.95 cosine similarity • 300 cores • LSH+PSS better than Pure LSH • Precision is increased to 100% with faster speed • LSH+PSS better than Pure PSS • 71x speedup
PSS with Incremental Updates New documents are appended to the end of a new partition Compare the new partition with all the original partitions Update static partitions with new documents
Result Ranking After Similarity-based Retrieval or Other Metrics
Motivation • Machine-learnt ranking models are popular • Ranking ensembles such as • Gradient boosted regression trees (GBRT) • A large number of trees are used to improve accuracy • Winning teams at Yahoo! Learning-to-rank challenge used ensembles with 2k to 20k trees, or even 300k trees with bagging methods • Time consuming for computing large ensembles • Access of irregular document attributes impairs CPU cache reuse • Unorchestrated slow memory access incurs significant cost • Memory access latency is 200x slower than L1 cache • Dynamic tree branching impairs instruction branch prediction
Document-ordered Traversal(DOT) Data Traversal in Existing Solutions: Scorer-ordered Traversal (SOT)
Why Better? • Total slow memory accesses in score calculation • 2D block can be up to s time faster. But s is capped by cache size • 2D Block fully exploits cache capacity for better temporal locality • Block-VPred: a combined solution that applies 2D Blocking on top of VPred [Asadi et al. TKDE’13] • Convert control dependence to data dependence to reduce instruction branch misprediction
Scoring Time per Document per Tree in Nanoseconds • Query latency = Scoring time * n * m • n docs ranked with an m-tree model