150 likes | 458 Views
Efficient Parallel kNN Joins for Large Data in MapReduce. Research problem. Research problem Run query search based on graph Return Top-K min-sub-graph which contain all keywords in query
E N D
Research problem • Research problem • Run query search based on graph • Return Top-K min-sub-graph which contain all keywords in query • Based on the partition duplication, try to generate a better partition, in that case, mapreduce can just run one parse to generate good enough results
Efficient Parallel kNN Joins for Large Data in MapReduce • Summary: • Baseline Method • Block nested loop kNN join with HadoopMapreduce • Partition R and S, each into n equal-sized blocks in the map phase, and put every |R|/n and |S|/n into one block. • Reduce phase run block nested loop kNN join between the local R and local S blocks in that bucket. • Second map-reduce is needed to calculate the global kNNs among its n local kNNs produced in the first phase, a total of nk candidates.
Z-Value Based Partition Join • Motivations: • Based method creates excessive communication(n^2 buckets) • New method tried to find alternatives with linear communication and computation costs • Use space-filling curves(z-value curve)
Z-order curve • In mathematical analysis, a space-filling curve is a curve whose range contains the entire 2-dimensional unit square (or more generally an n-dimensional hypercube from Wiki) • In mathematical analysis and computer science, Z-order, Morton order, or Morton code is a function which maps multidimensional data to one dimension while preserving locality of the data points.
zkNN Algorithm • zkNN algorithm runs on two datasets, we use R and S • Find a small integer α, run loop until αtimes • For each entry in R, try to use a vector array in order to find a candidate is S set which is used to find the k-nearest neighbors, • Final candidate should be the union of all candidate subsets
zkNN Algorithm based on Mapreduce • Partition • All the partition should based on z-value, two dataset R and S generate two linear lines with all entries from R and S • Each iterator, based on the same z-value function, two nodes with similar z-value are consider as near with each other • In case to find the corresponding nearest neighbors from dataset S, find the corresponding block from S • In order to make sure there are enough entries(at least k) from S, so duplicate is needed here
Partition Duplicate • In order to make sure all the possible nearest neighbors, we duplicate the nearest k points from the preceding block and succeeding block if necessary • First Challenge: balance partition • Partition the outer R to balance parts • Generate a sample of R with probability p=1/(ε2N) for any ε from 0 to 1, calculate the rank s(x) and then r(x)=s(x)/p • Calculate the variance of r(x) from sample of R with the real r(x) of R
Proof for partition R • Proof the standard deviation of rank of x with the real rank of x • See the Theorem 2 and Lemma 2
Partition Duplicate Continues • For the dataset S, original the partition information of S is just as same as partition of R, but as we discuss before, one block from dataset S, this block need to contains the nearest k points from the preceding block and also the succeeding block • Just as partition of R, generate a sample of S with probability p, the kp(upper bound) node from sample of S is considered as the kth node from the real S
Proof for partition S • Check the Theorem 3
Approximation quality • For each of the selected records, calculate its distance to the approximate kth-NN and its distance to the exact kth-N, the ratio between the two distances is one measurement of the approximation quality • Other measurements are recall and precision • Confidence interval for randomly selected records • One more information: all the experiments are conducted on a cluster with 16 slave nodes
Effect of ε and α • Run different ε and compare the running time and standard deviation, which can be found from paper figure 9 • Compare the Approximation ratio and Recall(Precision) with different α value, which can be found from paper figure 15