310 likes | 727 Views
RankReduce – Processing K-Nearest Neighbors Queries on Top of MapReduce. Aleksandar Stupar , Sebastian Michel, and Ralf Schenkel LSDS-IR, Genéve 2010. Talk Outline. Motivation Background RankReduce Framework Experimental Evaluation Conclusion and Outline. Talk Outline.
E N D
RankReduce – Processing K-Nearest Neighbors Queries on Top of MapReduce AleksandarStupar, Sebastian Michel, and Ralf Schenkel LSDS-IR, Genéve 2010
Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline
Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline
Similarity Search • Given a document find similar documents • e.g., given a photo find similar ones • Used extensively in Content Based Retrieval
The Problem • Huge datasets • Number of digital cameras • Web 2.0 success: Facebook (FB), Flickr,… • 60+ million photos uploaded to FB weekly (2007) • Approximately 5000GB data • Similarity Search • How to? • Solution • Distributed • Reliable • Efficient
The RankReduce Approach • Framework for similarity search • Large scale data • Vector based (pictures, music, video,…) • Built on top of • Locality Sensitive Hashing • MapReduce framework
Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline
K-Nearest Neighbors • Feature vectors representation • Similarity defined by distance measure • L1: • L2: • Exact solutions • Linear scan • Tree structures
Locality Sensitive Hashing (LSH) (1) • Efficient • High dimensional data • Approximate K-Nearest Neighbors • What approximate means? • Trade off • (precision, space requirements, processing time) • Is approximate good enough? • sketches of the documents • E.g. color structure, Chroma features n1 n1 n2 n2 n3 n3 n4 n4 n5 n6 n6 n7 n7 n9 n8 n10 n9 n11 n10 n13 n11 n12 n13 n14 n15 …
Locality Sensitive Hashing (LSH) (2) • Feature vectors are hashed to buckets • Neighbors collide to the same bucket • With high probability • Query processing • Multi probe [2]
Locality Preserving Property Family of functions is sensitive if
LSH based on p-stable distribution • Vector projection LSH • LSH parameters • Select a from Normal distribution • Select B from Uniform(0,W) distribution • W controls bucket size
MapReduce • Large scale data processing • By Google [4] • Distributing • Data • Processing
MapReduce Properties • Cluster of commodity machines • Fault tolerant • Scalable • Implementation • Distributed File System (DFS) • MapReduce jobs
MapReduce Jobs • Data is pre-distributed • Calculations are done where data resides • Programming model • Map function • Reduce function • Job • Multiple map tasks • Sort and merge • One or multiple reduce tasks Map Map DFS Reduce … Map
Perfect Marriage MapReduce + LSH = RankReduce
Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline
RankReduce (1) • LSH index is stored in Distributed File System • Hash Tables mapped to folders • Buckets mapped to files Distributed File System: /HashTable1 bucket_0_0_1_5 bucket_2_-3_3_1 /HashTable2 bucket_12_0_1_-1 bucket_8_-1_9_10 … /HashTableN bucket_0_0_0_-1 bucket_12_1_13_-9 …
RankReduce (2) • Benefits • Fast look up in query time • Only probed data read • Block based sequential access • Downside • possible high number of files
Query Processing (1) • As a MapReduce Job • List of buckets to probe as input • Single probe Distributed File System: /HashTable1 bucket_0_0_1_5 bucket_2_-3_3_1 /HashTable2 bucket_12_0_1_-1 bucket_8_-1_9_10 … /HashTableN bucket_0_0_0_-1 bucket_12_1_13_-9 Query MR Job
Query Processing (2) • Map function calculates similarity • Reduce method sorts and emits K-nearest neighbors • Possible secondary sort Map Query Query Map Reduce KNN Query Probed buckets ... Query Map Query
Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline
Datasets • Flickr dataset (~ 54 million photos) • 64 dimensions • Color structure • CoPhIR data collection • Synthetic dataset • 32 dimensions • IID for dimensions N(0,1) scaled (*10) base vectors • Neighbors - slightly changed base vectors
Parameter Tuning • Too many files can downgrade performance • (File Size < 64 KB) • LSH parameter tuning (precision, index size,…)
Experimental Evaluation • RankReduce approach vs. linear scan • Hadoop • Open source implementation of MapReduce • Single machine installation • One mapper per machine allowed • Measured • Map task execution time (approximately constant) • Number of map tasks per job
Results Interpretation • Precision • Real image data >85% • Synthetic data >70% • Runtime • Number of map tasks • Real image data 4-5 times better • Synthetic data 3-4 times better
Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline
Conclusion and Outlook • Similarity search on large datasets • Robust and scalable framework • Locality Sensitive Hashing & MapReduce • Experimental evaluation • Real computer cluster (several TBs) • Music retrieval • Chroma features • Time dimension
Thank you! Questions?