RankReduce – Processing K-Nearest Neighbors Queries on Top of MapReduce

RankReduce – Processing K-Nearest Neighbors Queries on Top of MapReduce AleksandarStupar, Sebastian Michel, and Ralf Schenkel LSDS-IR, Genéve 2010

Talk Outline • Motivation • Background • RankReduce Framework • Experimental Evaluation • Conclusion and Outline

Similarity Search • Given a document find similar documents • e.g., given a photo find similar ones • Used extensively in Content Based Retrieval

The Problem • Huge datasets • Number of digital cameras • Web 2.0 success: Facebook (FB), Flickr,… • 60+ million photos uploaded to FB weekly (2007) • Approximately 5000GB data • Similarity Search • How to? • Solution • Distributed • Reliable • Efficient

The RankReduce Approach • Framework for similarity search • Large scale data • Vector based (pictures, music, video,…) • Built on top of • Locality Sensitive Hashing • MapReduce framework

K-Nearest Neighbors • Feature vectors representation • Similarity defined by distance measure • L1: • L2: • Exact solutions • Linear scan • Tree structures

Locality Sensitive Hashing (LSH) (1) • Efficient • High dimensional data • Approximate K-Nearest Neighbors • What approximate means? • Trade off • (precision, space requirements, processing time) • Is approximate good enough? • sketches of the documents • E.g. color structure, Chroma features n1 n1 n2 n2 n3 n3 n4 n4 n5 n6 n6 n7 n7 n9 n8 n10 n9 n11 n10 n13 n11 n12 n13 n14 n15 …

Locality Sensitive Hashing (LSH) (2) • Feature vectors are hashed to buckets • Neighbors collide to the same bucket • With high probability • Query processing • Multi probe [2]

Locality Preserving Property Family of functions is sensitive if

LSH based on p-stable distribution • Vector projection LSH • LSH parameters • Select a from Normal distribution • Select B from Uniform(0,W) distribution • W controls bucket size

MapReduce • Large scale data processing • By Google [4] • Distributing • Data • Processing

MapReduce Properties • Cluster of commodity machines • Fault tolerant • Scalable • Implementation • Distributed File System (DFS) • MapReduce jobs

MapReduce Jobs • Data is pre-distributed • Calculations are done where data resides • Programming model • Map function • Reduce function • Job • Multiple map tasks • Sort and merge • One or multiple reduce tasks Map Map DFS Reduce … Map

Perfect Marriage MapReduce + LSH = RankReduce

RankReduce (1) • LSH index is stored in Distributed File System • Hash Tables mapped to folders • Buckets mapped to files Distributed File System: /HashTable1 bucket_0_0_1_5 bucket_2_-3_3_1 /HashTable2 bucket_12_0_1_-1 bucket_8_-1_9_10 … /HashTableN bucket_0_0_0_-1 bucket_12_1_13_-9 …

RankReduce (2) • Benefits • Fast look up in query time • Only probed data read • Block based sequential access • Downside • possible high number of files

Query Processing (1) • As a MapReduce Job • List of buckets to probe as input • Single probe Distributed File System: /HashTable1 bucket_0_0_1_5 bucket_2_-3_3_1 /HashTable2 bucket_12_0_1_-1 bucket_8_-1_9_10 … /HashTableN bucket_0_0_0_-1 bucket_12_1_13_-9 Query MR Job

Query Processing (2) • Map function calculates similarity • Reduce method sorts and emits K-nearest neighbors • Possible secondary sort Map Query Query Map Reduce KNN Query Probed buckets ... Query Map Query

Datasets • Flickr dataset (~ 54 million photos) • 64 dimensions • Color structure • CoPhIR data collection • Synthetic dataset • 32 dimensions • IID for dimensions N(0,1) scaled (*10) base vectors • Neighbors - slightly changed base vectors

Parameter Tuning • Too many files can downgrade performance • (File Size < 64 KB) • LSH parameter tuning (precision, index size,…)

Experimental Evaluation • RankReduce approach vs. linear scan • Hadoop • Open source implementation of MapReduce • Single machine installation • One mapper per machine allowed • Measured • Map task execution time (approximately constant) • Number of map tasks per job

Results

Results Interpretation • Precision • Real image data >85% • Synthetic data >70% • Runtime • Number of map tasks • Real image data 4-5 times better • Synthetic data 3-4 times better

Conclusion and Outlook • Similarity search on large datasets • Robust and scalable framework • Locality Sensitive Hashing & MapReduce • Experimental evaluation • Real computer cluster (several TBs) • Music retrieval • Chroma features • Time dimension

Thank you! Questions?

RankReduce – Processing K-Nearest Neighbors Queries on Top of MapReduce