1 / 97

Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery

This research paper explores various dimensionality reduction algorithms and their applications in distributed knowledge discovery. It discusses a fast and efficient algorithm called FEDRA, a framework for linear distributed dimensionality reduction, and a distributed non-linear dimensionality reduction algorithm called D-Isomap.

slarkin
Download Presentation

Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear and Non Linear Dimensionality Reduction for Distributed Knowledge Discovery Panagis Magdalinos Supervising Committee: Michalis Vazirgiannis, EmmanuelYannakoudakis, Yannis Kotidis Athens University of Economics and Business Athens, 31st of May 2010

  2. Outline • Introduction – Motivation • Contributions • FEDRA: A Fast and Efficient Dimensionality Reduction Algorithm • A new dimensionality reduction algorithm • Large scale data mining with FEDRA • A Framework for Linear Distributed Dimensionality Reduction • Distributed Non Linear Dimensionality Reduction • Distributed Isomap (D-Isomap) • Distributed Knowledge Discovery with the use of D-Isomap • An Extensible Suite for Dimensionality Reduction • Conclusions and Future Research Directions Athens University of Economics and Business Athens, 31st of May 2010

  3. Motivation • Top 10 Challenges in Data Mining1 • Scaling Up for High Dimensional Data and High Speed Data Streams • Distributed Data Mining • Typical examples • Banks all around the world • World Wide Web • Network Management • More challenges are envisaged in the future • Novel distributed applications and trends • Peer-to-peer networks • Sensor networks • Ad-hoc mobile networks • Autonomic Networking • Commonality : High dimensional data in massive volumes. 1. Q.Yang and X.Wu: “10 Challenging Problems in Data Mining Research”,International Journal of Information Technology & Decision Making, Vol. 5, No. 4, 2006, 597-604 Athens University of Economics and Business Athens, 31st of May 2010

  4. The curses of dimensionality • Curse of dimensionality • Empty space phenomenon • Maximum and minimum distance of a dataset tend to be equal as dimensions grow (i.e., Dmax – Dmin ≈ 0) • Data mining becomes resource intensive • K-means and k-nn are typical examples R1 21 R2 22 R3 23 R4 24 Athens University of Economics and Business Athens, 31st of May 2010

  5. Solutions • Dimensionality reduction • MDS, PCA, SVD, FastMap, Random Projections… • Lower dimensional embeddings while enabling the subsequent addition of new points. • The curse of dimensionality • Significant reduction in the number of dimensions. • We can project from 500 dimensions to 10 while retaining cluster structure. • The empty space phenomenon • Meaningful results from distance functions • k-NN classification quality almost doubles when projecting from more than 20000 dimensions to 30. • Computational requirements • Distance based algorithms are significantly accelerated. • k-Means converges to less than 40 seconds while initially required almost 7 minutes. Athens University of Economics and Business Athens, 31st of May 2010

  6. Classification • Problems • Hard Problems  Significant reduction • Soft Problems  Milder requirements • Visualization Problems • Methods • Linear and Non Linear • Exact and Approximate • Global and Local • Data Aware and Data Oblivious Athens University of Economics and Business Athens, 31st of May 2010

  7. Quality Assessment • Distortion: • Provision of an upper and lower bound to the new pairwise distance. • The new distance is provided as a function of the initial distance: • (1/c1)D(a,b)≤ D’(a,b) ≤ c2D(a,b) , c1, c2 > 1 • Good method  min(c1c2) • Stress • Distortion might be misleading • Stress quantifies the distance distortion on a particular example. • Stress = √∑(d(Xi,Xj)-d(X’i,X’j))2/∑d(Xi,Xj)2 • Task Related Metric • Clustering/Classification Quality • Pruning Power • Computational Cost • Visualization Athens University of Economics and Business Athens, 31st of May 2010

  8. Contributions • Definition of a new, global, linear, approximate dimensionality reduction algorithm • Fast and Efficient Dimensionality Reduction Algorithm (FEDRA) • Combination of low time and space requirements together with high quality results • Definition of a framework for the decentralization of any landmark based dimensionality reduction method • Motivated by low memory requirements of landmark based algorithms • Applicable in various network topologies • Definition of the first distributed, non linear, global approximate dimensionality reduction algorithm • Decentralized version of Isomap (D-Isomap) • Application on knowledge discovery from text collections • A prototype enabling the experimentation with dimensionality reduction methods (x-SDR) • Ideal for teaching and research in academia Athens University of Economics and Business Athens, 31st of May 2010

  9. FEDRA: A Fast and Efficient Dimensionality Reduction Algorithm • Based on : • P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, "FEDRA: A Fast and Efficient Dimensionality Reduction Algorithm", In Proceedings of the SIAM International Conference on Data Mining (SDM'09), Sparks Nevada, USA, May 2009. • P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, "Enhancing Clustering Quality through Landmark Based Dimensionality Reduction ", Accepted with revisions in the Transactions on Knowledge Discovery from Data, Special Issue on Large Scale Data Mining – Theory and Applications. Athens University of Economics and Business Athens, 31st of May 2010

  10. The general idea Z Y P2 P4 P1 P1 P3 X X P2 Y P3 P4 • Instead of trying to map the whole dataset in the new space • Extract a small fraction of data and embed it in the new space • Create the “kernel” around which the whole dataset is going to be placed • Minimize the loss of the information during the first part of the process. • Project each remaining point independently by taking into account only the initial set of sampled data. • The formulation of this idea into a coherent algorithm resulted in the definition of FEDRA (Fast and Efficient Dimensionality Reduction Algorithm) • A global, linear, approximate, landmark based method Athens University of Economics and Business Athens, 31st of May 2010

  11. Our goal • Formulate a method which combines: • Results of high quality • Minimum space requirements • Minimum time requirements • Scalability in terms of cardinality and dimensionality • Application • Hard dimensionality reduction problems • Projecting from 500 dimensions to 10 while retaining inter-objects relations • Enabling faster convergence of k-Means • Top 10 Challenge: Scaling up for high dimensional data Athens University of Economics and Business Athens, 31st of May 2010

  12. The FEDRA Algorithm How do we select landmarks? Does this system of equations has a solution? Does this simplification come at a cost? Does the algorithm converge? Isn’t it time consuming? These are the questions that we will answer in the next couple of slides Athens University of Economics and Business Athens, 31st of May 2010

  13. The theory underlying FEDRA Theorem 1:A set of k+1 points, pi i=1…k+1, described only by their pairwise distances which have been defined with the use of a Minkowski distance metric p, can be embedded in Rk without distortion. Their coordinates can be derived in polynomial time through the following set of equations: if j<i-1 then p’i,j is given by the single root of |p’i,j|p - |p’i,j–p’j+1,j|p + ∑f=1j-1|p’i,f|p-∑f=1j-1|p’i,f-p’j,f|p + dp(pj+1,pi)p – dp(pi,p1)p = 0 if j=i-1 p’i,j =(dp(pi,p1)p - ∑f=1i-2|p’i,f|p)1/p 0 otherwise Theorem 2: Any equation of the form f(x)=|x|p–|x-a|p–d where aЄR\{0}, dЄR, pЄN\{0} has a single root in R. if -1< v=d/|a|p <1 the root lays in (0, a) otherwise the root lays in (a,|v|a) The cost of embedding the k landmarks is ck2/2 where c is the cost of the Newton-Raphson method (for p=2  c=1) Athens University of Economics and Business Athens, 31st of May 2010

  14. Theorem 1 in practice (1/2) No distortion requires that ||Pi’ -Pj’||(p)=||Pi-Pj||(p), i,j=1..4 First point is mapped as P’1 = O = {0,0,0} Second point is mapped at P’2 = {||P2 – P1||(p),0,0} Third points should satisfy simultaneously ||P’3 –P’1||(p)=||P3–P1||(p) ||P’3–P’2||(p)=||P3–P2||(p) The solution is the intersection of the circles Z P1 Y Z Z P3 P2 P1 P2 P1 X X X Y Y Athens University of Economics and Business Athens, 31st of May 2010

  15. Theorem 1 in practice (2/2) Fourth point should satisfy simultaneously ||P’4 –P’1||(p)=||P4–P1||(p) ||P’4–P’2||(p)=||P4–P2||(p) ||P’4–P’3||(p)=||P4–P3||(p) Three intersecting spheres. The intersection of two spheres is a circle. Consequently we search for the intersection of a circle with a sphere. Z Z P3 P3 P4 P1 P1 P2 P2 X X Y Y Athens University of Economics and Business Athens, 31st of May 2010

  16. Reducing Time Complexity (1/2) Simplified through the following iterative scheme The embedding of Xi inRkgiven the embeddings of Pj , j = 1..i-1 |x’i,1|p + |x’i,2|p + |x’i,3|p +….+ |x’i,i-1|p = ||P1 - Xi||p |x’i,1-p’2,1|p + |x’i,2|p + |x’i,3|p +….+ |x’i,i-1|p = ||P2 - Xi||p |x’i,1-p’3,1|p + |x’i,2-p’3,2|p + |x’i,3|p +….+ |x’i,i-1|p = ||P3 - Xi||p ………………………………………………………………………………………. |x’i,1-pi-1,1|p + |x’i,2-pi-1,2|p + |x’i,3-pi-1,3|p +….+ |x’i,i-1|p = ||Pi-1 - Xi||p Note that by subtracting the second equation from the first we derive |x’i,1|p - |x’i,1-p’2,1|p - ||P1 - Xi||p + ||P2 - Xi||p=0 The equation has a single unknown and a single root x’i,1 In general, the value of the i-th coordinate is derived by subtracting the (i+1)-th equation from the first. Athens University of Economics and Business Athens, 31st of May 2010

  17. Reducing Time Complexity (2/2) • By subtracting the i-th equation from the first we essentially calculate the corresponding coordinate (i.e. a plane in R3). • The intersection of the k-1 planes corresponds to a line. • The first equation is satisfied by points P1,P2 that correspond to the intersection of the line with the norm-sphere of R3. • We lower time complexity from O(ck2) to O(ck)or even O(k) when p=2 • What if the intersection of the line with the sphere does not exist? Z Z Z Y=b X=a X=a P1 O O O X X X P2 X=a & Y=b X=a & Y=b Y Y Y Athens University of Economics and Business Athens, 31st of May 2010

  18. Existence of solution • Theorem 3:For any non-linear system of equations defined by FEDRA, there always exists at least one solution, provided that the triangular inequality is sustained in the original space. • No convergence • ||OA’|| + ||A’L’1|| < ||O’L’1|| • Theorem 1 guarantees that • ||O’A’||=||OA|| , ||A’L’1||=||AL1||, ||O’L’1||=||OL1|| • Triangular inequality is not sustained in the original space R3 X R2 Athens University of Economics and Business Athens, 31st of May 2010

  19. The FEDRA Algorithm How do we select landmarks? Does this system of equations has a solution? Yes, always! Does this simplification come at a cost? Does the algorithm converge? Yes, always! Isn’t it time consuming? No! In fact it is only O(k) per point! Still some questions remain… Athens University of Economics and Business Athens, 31st of May 2010

  20. FEDRA requirements • FEDRA requirements in terms of time and space • Exhibits low memory requirements combined with low computational complexity • Memory: O(k2), k: lower dimensionality • Time: O(cdk), d: number of objects, c: constant • Addition of new point : O(ck) • Achieved by relaxing the original requirements and requesting that every projected point retains unaltered k distances to other data points • Advantageous features • Operates on similarity/dissimilarity matrix • Applicable with any Minkowski distance metric • FEDRA can provide a mapping from Lnp to Lkpwhere p≥1 Athens University of Economics and Business Athens, 31st of May 2010

  21. Distortion B’ A’ E L1’ Ay’ L2’ By’ E A’ A” OR B By Ay’ L2’ A L1’ By’ L2 Ay R B’ L1 • Theorem 4: Using any two landmarks L1, L2, FEDRA can project any two points A,B while guaranteeing that their new distance A’B’ will be bounded according to: • AB2 -4AAyBBy ≤ A’B’ 2 ≤ AB2 + 4AAyBBy • Alternatively: A’B’2=AB2 -2BL1AL1(cos(A’L’1B’)-cos(AL1B)) • Distortion = √(AB2+4AAyBBy)/(AB2-4AAyBBy) • For any Minkowski distance metric p: • ABp -Δ≤ A’B’p ≤ ABp +Δ • Δ = 2BBy∑k=1p(AAy+BBy)p-k(AAy-BBy)k-1 Does this simplification come at a cost? The distance distortion is low and upper bounded Athens University of Economics and Business Athens, 31st of May 2010

  22. Landmarks selection • Based on the former analysis it can be proved that the ideal landmark set should satisfy for any two landmarks Li, Lj and any point A, one of the following relations: • LiA ≈ LjA – LiLj (or simply that LiLj ≈ 0 ) • LjA ≈ LiLj – LiA • LiA ≈ LjA – LiLj  requires the creation of a compact “kernel” where landmarks exhibit minimum distances from each other • LjA ≈ LiLj – LiA  requires that cluster centroids are chosen as the landmarks • So if random selection is not acceptable we use a set of k landmarks that exhibit minimum distance from each other. • How do we select landmarks?: Either randomly or heuristically according to theory. L1 C A L2 L1 L2 A Athens University of Economics and Business Athens, 31st of May 2010

  23. Ameliorating projection quality (I) • Depending on the properties of the selected landmarks set, a -single- case of failure may rise1 Y Y Z L2 Cluster A Clusters A and B Cluster A Cluster B O O O L1 L2 L1 L2 X X X L1 Y Cluster B • V.Athitsos, J.Alon, S.Sclaroff, G.Kollios, “BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval”, IEEE Transactions on PAMI, Vol 30, No.1 January 2008 Athens University of Economics and Business Athens, 31st of May 2010

  24. Ameliorating projection quality (II) • What if we sample an additional set of points and use it as for enhancing projection quality? • Zero distortion from the landmark points and minimum distortion from another k points. Y Y Z L2 Cluster A Cluster A Cluster B O O O L1 L2 X X L1 L2 X L1 Y Cluster B Does this simplification come at a cost? The distance distortion is low and upper bounded. Moreover the projection of a point can be determined using the already projected non landmark points Athens University of Economics and Business Athens, 31st of May 2010

  25. FEDRA Applications • The purpose of the conducted experimental evaluation process is: • Highlight the efficiency and effectiveness of FEDRA on hard dimensionality reduction problems • Highlight FEDRA’s scaling ability and applicability in large scale data mining • Showcase the enhancement of a typical data mining task like clustering due to the application of FEDRA Athens University of Economics and Business Athens, 31st of May 2010

  26. Metrics • We assess the quality of FEDRA through the following metrics • Stress • √∑(d(Xi,Xj)-d(X’i,X’j))2/∑d(Xi,Xj)2 • Clustering quality maintenance defined as Quality in Rk/ Quality in Rn • Clustering quality: Purity = (1/N) ∑i,j=1amax(|Ci∩Sj|) • Time requirements for each algorithm to produce the embedding • Time requirements for k-Means to converge • We compare FEDRA with Landmark-based Methods • Landmark MDS • Metric Map • Vantage Objects • As well as prominent methods such as • PCA • FastMap • Random Projection Athens University of Economics and Business Athens, 31st of May 2010

  27. Stress evolution Dataset: segmentation Dataset: ionosphere

  28. Purity evolution Dataset: alpha Dataset: beta • Experimental analysis indicates: • FEDRA exhibits behavior similar to landmark based approaches and slightly ameliorates clustering quality Athens University of Economics and Business Athens, 31st of May 2010

  29. Time Requirements Dataset: alpha Dataset: beta

  30. k-Means Convergence 296secs 324secs Dataset: alpha Dataset: beta • Experimental analysis indicates: • k-Means converges slower on the dataset of Vantage Objects • FEDRA reduces k-Means convergence requirements Athens University of Economics and Business Athens, 31st of May 2010

  31. Summary • FEDRA is a viable solution for hard dimensionality reduction problems. • Quality of results comparable to PCA • Low time requirements, outperformed by Random Projection • Low stress values, sometimes lower than FastMap • Maintain or ameliorate original clustering quality, similar behavior to other methods • Enables faster convergence of k-Means Athens University of Economics and Business Athens, 31st of May 2010

  32. Linear Distributed Dimensionality Reduction • Based on : • P. Magdalinos, C.Doulkeridis, M.Vazirgiannis"K-Landmarks: Distributed Dimensionality Reduction for Clustering Quality Maintenance" In Proceedings of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), Berlin, Germany, September 2006. (Acceptance Rate (full papers) 8,8%) • P. Magdalinos, C.Doulkeridis, M.Vazirgiannis, "Enhancing Clustering Quality through Landmark Based Dimensionality Reduction ", Accepted with revisions in the Transactions on Knowledge Discovery from Data, Special Issue on Large Scale Data Mining – Theory and Applications. Athens University of Economics and Business Athens, 31st of May 2010

  33. The general idea • All landmark based algorithms are applicable in distributed environments • The idea is to sample landmarks from all nodes and use them to define the original landmark set. • Then, communicate this set to all nodes. Global Landmark Set Global Landmark Set Peer 6 Peer 6 Peer 6 Peer 7 Peer 7 Peer 7 Peer 4 Peer 4 Peer 4 Peer 2 Peer 3 Peer 2 Peer 2 Peer 3 Peer 3 Peer 1 Peer 1 Peer 1 Athens University of Economics and Business Athens, 31st of May 2010

  34. Our goal • Formulate a method which combines: • Minimum requirements in terms of network resources • Immunity to subsequent alterations of the dataset • Adaptability to network changes • Top 10 Challenge: Distributed Data Mining • Application • Hard dimensionality reduction problems • Projecting from 500 dimensions to 10 while retaining inter-objects relations • Reduction of network resources consumption • State of the art: • Distributed PCA • Distributed FastMap Athens University of Economics and Business Athens, 31st of May 2010

  35. Requirements and Candidates • Requirements: • There exists some kind of network organization scheme • Physical topology • Self-Organization • Each algorithm is composed of two parts • A centrally executed • A decentralized part • Ideal Candidate: Any landmark based dimensionality reduction algorithm • Landmark selection process • Aggregation of landmarks in a central location • Derivation of the projection operator • Communication of the operator to all nodes • Projection of each point independently Athens University of Economics and Business Athens, 31st of May 2010

  36. Distributed FEDRA • Applying the landmark based paradigm in a network environment • Select landmarks at peer level • Communicate all landmarks to aggregator • O(nk) network load • Project landmarks and communicate the results • O(nkM +Mk2) network load • Each peer projects each point independently • Assuming a fixed number of |L| landmarks then network requirements are upper bounded for each algorithm • O(n|L|M+M|L|k) • Landmark based algorithms are less demanding than distributed PCA • Distributed PCA: O(Mn2 + nkM) • As long as |L| < n Athens University of Economics and Business Athens, 31st of May 2010

  37. Selecting the landmark points • Each peer may select: • k points from the local dataset • Select k local points (randomly or heuristically) • Transmit them to the aggregator • The aggregator receives Mk points from all peers and selects the landmark set. • Network load is O(Mkn + Mk2) • k/M points from the local dataset • This implies that the aggregator will inform the peers about the size of the network • The landmarks selection happens only once in the lifetime of the network, arrivals and departures will have no affect. • Network load is O(kn + Mk2) • Zero points from the local set • The aggregator selects from the local dataset k landmarks • Network load is O(Mk2) Athens University of Economics and Business Athens, 31st of May 2010

  38. Application • Datasets from the Pascal Large Scale Challenge 2008 • 500-node network with random connections between elements • Nodes are connected with 5% probability • Distributed K-Means (P2P-Kmeans1) approach in order to assess the quality of the produced embedding 1. S.Datta, C.Giannella, H.Kargupta: Approximate Distributed K-means clustering over a P2P network. IEEE TKDE 2009, vol 21, no10, 10/2009 Athens University of Economics and Business Athens, 31st of May 2010

  39. Dataset: alpha Dataset: beta Dataset: gamma Dataset: delta

  40. Network Requirements • Random Projection deviate from the framework • Random Projection: The aggregator identifies the projection matrix • Distributed clustering induces a network cost of more than 10GB • Hard dimensionality reduction preprocessing -requiring at most 200MB- reduces the cost to roughly 1GB. Athens University of Economics and Business Athens, 31st of May 2010

  41. Summary • Landmark based dimensionality reduction algorithms provide a viable solution to distributed dimensionality reduction pre-processing • High quality results • Low network requirements • No special requirements in terms of network organization • Adaptability to potential failures • Results obtained in a network of 500 peers • Dimensionality reduction preprocessing and subsequent P2P-Kmeans application necessitates only 12% of the original P2P-Kmeans load • Clustering quality remains the same and slightly ameliorated • Distributed FEDRA • Low network requirements combined with high quality results Athens University of Economics and Business Athens, 31st of May 2010

  42. Distributed Non Linear Dimensionality Reduction • Based on : • P.Magdalinos, M.Vazirgiannis, D.Valsamou, "Distributed Knowledge Discovery with Non Linear Dimensionality Reduction", To appear in the Proceedings of the 14thPacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'10), Hyderabad, India, June 2010. (Acceptance Rate (full paper) 10,2%) • P. Magdalinos, G.Tsatsaronis, M.Vazirgiannis, “Distributed Text Mining based on Non Linear • Dimensionality Reduction", Submitted to European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2010), Currently under review. Athens University of Economics and Business Athens, 31st of May 2010

  43. Our goal • Top 10 Challenges: Distributed data mining of high dimensional data • Scaling Up for High Dimensional Data • Distributed Data Mining • Vector Space Model: • Each word defines an axis  each document is a vector residing in a high dimensional plane • Numerous methods that try to project data in a low dimensional space while assuming linear dependence between variables. • However latest experimental results show that this assumption is incorrect • Application • Hard dimensionality reduction and visualization problems • Unfolding a manifold distributed across a network of peers • Mining information from distributed text collections • State of the art: • None! Athens University of Economics and Business Athens, 31st of May 2010

  44. The general idea The idea is to replicate the original Isomap algorithm in a highly distributed environment and still get results of equal quality. Distributed Isomap: A three phased approach: Isomap Peer 8 Peer 6 Peer 8 Peer 6 Peer 7 Distibuted NN and SP algorithms Peer 7 Peer i Peer 4 Peer 4 Multidimensional Scaling Peer 3 Peer 3 Peer 1 Peer 2 Peer 1 Peer 2 Athens University of Economics and Business Athens, 31st of May 2010

  45. Indexing and k-NN retrieval (1/4) Which LSH family to employ1? Since we use the Euclidean distance we should use an Euclidean distance preservation mapping hx,b = floor(xr+b/w) where x is the data point, r is an 1xn random vector, w in N and b in [0,w) This family of functions guarantees that the probability of collision will be analogous to points original distance. Given f hash function for each table we have an f-dimensional vector hash1 hash2 hashf 1. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1) (2008) Athens University of Economics and Business Athens, 31st of May 2010

  46. Indexing and k-NN retrieval (2/4) Indexing and guaranteeing load balancing Consider the norm-1 of the produced vector, ∑i=1f|hi(x)| The values are generated from the normal distribution N(f/2,fμ||x||/w)1 Consider 2 standard deviations and split the range into M cells For a given hash vector v, the peer that will index it is: peerid = (M(||v||1-μl1+2σl1)/4σl1)modM peeri hash1 hash2 hashf …………… l1=∑|vi| 40 μl1 -2σl1 2σl1 1. Haghani, P., Michel, S., Aberer, K.: Distributed similarity search in high dimensions using locality sensitive hashing. ACM EDBT pp. 744--755 (2009) Athens University of Economics and Business Athens, 31st of May 2010

  47. Indexing and k-NN retrieval (3/4) How to effectively and efficiently search for the kNN of each point? Baseline: For each local point di For each table T Find the peer that indexes it Retrieve all points from corresponding bucket Retrieve actual points Calculate actual distances, rank them and retain k-NNs What if we could somehow identify a range and upper bound the difference of δ=| ||h(x)||1- ||h(y)||1 |? Theorem 5:Given f hash functions hi = floor(rixT+bi/w) where ri is an 1xn random vector, w∈N, bi∈[0, w), i = 1...f , the difference δ of the l1 norms of the projections xf ,yf of two points x, y∈Rn is upper bounded by (||A|| ||x-y||)/w, where A= || ∑i=1f|ri| || and ||x − y|| the points’ Euclidean distance. Although the bound is rather large, it still reduces the required number of messages Athens University of Economics and Business Athens, 31st of May 2010

  48. Indexing and k-NN retrieval (4/4) k-NN Retrieval Indexing • Messages: O(cskd) • Time: O(cskdi) • Memory: O(cskn) ( ||hash(V)||1,6) ( ||hash(V)||1,6,X) peerid = f(hash(V)) hash(V) = V= V= Peer 8 (Peer4,Y,32) Peer 8 Peer 6 Peer 6 Request-Reply Y • Messages: O(dT) • Time: O(diTfn) • Memory: O(fn) Peer 4 Athens University of Economics and Business Athens, 31st of May 2010

  49. Geodesic Distances (1/2) At this step, each peer has identified the NN graphs of its points G (G=Ui=1|Di|Gi ) The target is to identify the SPs from each point to the rest of the dataset Use best practices from computer networking Distance Vector Routing or Distributed Bellman Ford Assume that each point is a network node and each calculated distance a link between the corresponding points/nodes From a node’s perspective, DVR replicates a ranged search, starting with one link and progressively augmenting it by 1 Athens University of Economics and Business Athens, 31st of May 2010

  50. Geodesic Distances (2/2) • Start at node 1 • Discover paths, 1 hop away • Discover paths, 2 hops away • Discover paths, 3 hops away • Peer 5 will never be reached! • Not connected graph • Distance is ∞  Distance is • 5*max(distance) • Graph is now connected Peer 1 Peer 5 Peer 2 • Messages: O(kNNMd2) • Space: O(did) • Time: O(M) Peer 4 Peer 3 Athens University of Economics and Business Athens, 31st of May 2010

More Related