480 likes | 492 Views
Maleq Khan, Qin Ding, William Perrizo; NDSU. k-Nearest Neighbor Classification on Spatial Data Streams Using P-trees. Introduction. We explored distance metric based computation using P-trees Defined a new distance metric, called HOB distance Revealed some useful properties of P-trees
E N D
Maleq Khan, Qin Ding, William Perrizo; NDSU k-Nearest Neighbor Classification on Spatial Data Streams Using P-trees
Introduction • We explored distance metric based computation using P-trees • Defined a new distance metric, called HOB distance • Revealed some useful properties of P-trees • A new method of nearest neighbor classification using P-tree - called Closed-KNN • A new algorithm for k-clustering using P-trees - efficient statistical computation from the P-trees
Overview • Data Mining • - classification and clustering • Various distance metrics • Minkowski, Manhattan, Euclidian, Max, Canberra, Cord, and HOB distance • - Neighborhoods and decision boundaries • P-trees and its properties • k-nearest neighbor classification • - Closed-KNN using Max and HOB distance • k-clustering • - overview of existing algorithms • - our new algorithm • - computation of mean and variance from the P-trees
Useful Information Data Mining Raw data Information Pyramid Data Mining extracting knowledge from a large amount of data More data less information Functionalities: feature selection,association rule mining, classification & prediction, cluster analysis, outlier analysis, evolution analysis
Training data: Class labels are known Feature1 Feature2 Feature3 Class a1 b1 c1 A a2 b2 c2 A a3 b3 c3 B Sample with unknown class: Classifier Predicted class Of the Sample a b c Classification Predicting the class of a data object also called Supervised learning
Types of Classifier Eager classifier: Builds a classifier model in advance e.g.decision tree induction, neural network Lazy classifier: Uses the raw training data e.g. k-nearest neighbor
A two dimensional space showing 3 clusters Clustering • The process of grouping objects into classes, • with the objective: the data objects are • similar to the objects in the same cluster • dissimilar to the objects in the other clusters. • Clustering is often calledunsupervised learning • or unsupervised classification • the class labels of the data objects are unknown
Distance Metric Measures the dissimilarity between two data points. A distance metric is a function, d, of two n-dimensional points X and Y, such that d(X, Y)is positive definite: if (X Y), d(X, Y) > 0 if (X = Y), d(X, Y) = 0 d(X, Y) issymmetric: d(X, Y) = d(Y, X) d(X, Y) holds triangle inequality:d(X, Y) + d(Y, Z) d(X, Z)
Various Distance Metrics Let and Minkowski distance or Lp distance, Manhattan distance, (P = 1) Euclidian distance, (P = 2) Max distance, (P = )
Y (6,4) Z X (2,1) An Example A two-dimensional space: Manhattan, d1(X,Y)= XZ+ ZY =4+3 = 7 Euclidian, d2(X,Y)= XY = 5 Max, d(X,Y)= Max(XZ, ZY) = XZ = 4 d1d2 d For any positive integer p,
Some Other Distances Canberra distance Squared cord distance Squared chi-squared distance
Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x1: 0 1 10 1 0 0 1 x2: 0 1 0 11 1 0 1 y1: 0 1 11 1 1 0 1 y2: 0 1 0 1 0 0 0 0 HOBS(x1, y1) = 3 HOBS(x2, y2) = 4 HOB Similarity Higher Order Bit (HOB) similarity: HOBS(A, B) = A, B: two scalars (integer) ai, bi :ith bit of A and B (left to right) m : number of bits
The previous example: Bit position: 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 x1: 0 1 10 1 0 0 1 x2: 0 1 0 11 1 0 1 y1: 0 1 11 1 1 0 1 y2: 0 1 0 1 0 0 0 0 HOBS(x1, y1) = 3 HOBS(x2, y2) = 4 dv(x1, y1) = 8 – 3 = 5 dv(x2, y2) = 8 – 4 = 4 The HOB distance between two pointsX and Y: In our example (considering 2-dimensional data): dh(X, Y) = max (5, 4) = 5 HOB Distance The HOB distance between two scalar value A and B: dv(A, B)= m – HOB(A, B) dv(x1, y1) = 8 – 3 = 5dv(x2, y2) = 8 – 4 = 4
HOB Distance Is a Metric HOB distance is positive definite if (X = Y), = 0 if (XY), > 0 HOB distance is symmetric HOB distance holds triangle inequality
2r 2r 2r 2r X X X X T T T T Neighborhood of a Point Neighborhood of a target point, T, is a set of points, S, such thatXSif and only if d(T, X) r Manhattan Euclidian Max HOB If Xis a point on the boundary, d(T, X) = r
Manhattan Euclidian Max Max Euclidian Manhattan > 45 < 45 X A A A A A R1 B B B B B d(A,X) d(B,X) R2 D Decision Boundary decision boundary between points A and B, is locus of the point X satisfying the condition d(A, X) = d(B, X) Decision boundary for HOB Distance. Perpendicular to the axis that makes maximum distance
Remotely Sensed Imagery Data An image is a collection of pixels Each pixel represent an square area in the ground Several attributes or bands associated with each pixel ex. red, green, blue reflectance values, soil moisture, nitrate Band Sequential (BSQ) file: one file for each band Bit Sequential (bSQ) file: one file each bit of each band Bi,jis the bSQ file forjth bit of ith band
Peano count-Tree or P-tree • We form one P-tree from each bSQ file • Pi,j is the basic P-tree for bit j of band I • Rootof the P-tree is the count of 1 bits in the entire image • Root has 4 children with the counts of the 4 quadrants • Recursively divide the quadrants until there is only one bit in the quadrant unless the node is pure0 or pure1 Root Count 55 ____________/ / \ \___________ / _____/ \ ___ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 Pure1 node: All bits are 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
Peano Mask Tree (PMT) 0 represents Pure0 node 1 represents pure1 node m represents mixed node 55 ____________/ / \ \___________ / _____/ \ ___ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 m ____________/ / \ \____________ / ____/ \ ____ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 P-tree PMT
m m m = AND 1 m 0 m m 0 0 m m 0 0 m Subtree1 Subtree2 Subtree3 Subtree4 Subtree3 Subtree5 P-tree ANDing ORing and COMPLEMENT operation are performed in similar way Also there are some other P-tree structured (such as PVT) and ANDing algorithms that are beyond the scope of this presentation
Value & Interval P-tree The value P-tree Pi(v) represents the pixels that have value v for band i. there is a 1 in Pi(v) at a pixel location, if that pixel have the value vfor bandi otherwise there is a 0 in Pi(v). Let,bj = jth bit of the value v and and Pi,j= the basic P-tree for band ibit j. Define Pti,j = Pi,j if bj = 1 = Pi,jif bj = 0 Then Pi(v)= Pti,1 AND Pti,2 AND Pti,3 AND … AND Pti,m Theinterval P-tree, Pi(v1, v2) = Pi(v1) OR Pi(v1+1) OR Pi(v1+2) OR … OR Pi(v2)
Notations rc(P) : root count of P-tree P N :number of pixels n : number of bands m :number of bits P1& P2:P1AND P2 P1 | P2 :P1OR P2 P´:COMPLEMENT of P Pi, j : basic P-tree for band i bit j. Pi(v) : value P-tree for value v of band i. Pi(v1, v2) : interval P-tree for interval [v1, v2] of band i. P0 : is pure0-tree, a P-tree having the root node which is pure0. P1 : is pure1-tree, a P-tree having the root node which is pure1.
Properties of P-trees 1. a) b) 2. a) b) c) d) 3. a) b) c) d) • 4. rc(P1 | P2)= 0 rc(P1)= 0andrc(P2) = 0 • v1 v2 rc{Pi(v1)& Pi(v2)} = 0 • rc(P1 | P2) = rc(P1) + rc(P2) - rc(P1 & P2) • rc{Pi (v1) | Pi(v2)} = rc{Pi(v1)} + rc{Pi(v2)}, where v1 v2
1 word 2 words 2 words 4 words 4 words Format Code Fan-out # of levels Root count Length of the body in bytes Body of the P-tree P-tree Header Header of a P-tree file to make a generalized P-tree structure
k-Nearest Neighbor Classification 1) Select a suitable value for k 2) Determine a suitable distance metric 3)Find k nearest neighbors of the sample using the selected metric 4) Find the plurality class of the nearest neighbors by voting on the class labels of the NNs 5) Assign the plurality class to the sample to be classified.
T Closed-KNN T is the target pixels. With k = 3, to find the third nearest neighbor, KNN arbitrarily select one point from the boundary line of the neighborhood Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN
SearchingNearestNeighbors We begin searching by finding the exact matches. Let the target sample, T = <v1, v2, v3, …, vn> The initial neighborhood is the point T. We expand the neighborhood along each dimension: along dimension i, [vi] is expanded to the interval [vi – ai , vi+bi], for some positive integers ai and bi. Continue expansion until there are at least k points in the neighborhood.
HOB Similarity Method for KNN In this method, we match bits of the target to the training data Fist we find matching in all 8 bits of each band (exact matching) let, bi,j = jth bit of the ith band of the target pixel. Define Pti,j = Pi,j, if bi,j = 1 = Pi,j, otherwise And Pvi,1-j = Pti,1 & Pti,2 & Pti,3 & … & Pti,j Pnn = Pv1,1-8&Pv2,1-8&Pv3,1-8 & … &Pvn,1-8 If rc(Pnn) < k, update Pnn = Pv1,1-7 & Pv2,1-7 & Pv3,1-7 & … & Pvn,1-7
An Analysis of HOB Method • Let ith band value of the target T, vi= 105 = 01101001b • [01101001] = [105, 105] • 1st expansion • [0110100-] = [01101000, 01101001] = [104, 105] • 2nd expansion • [011010--] = [01101000, 01101011] = [104, 107] • Does not expand evenly in both side: • Target = 105 and center of [104, 111] = (104+107) / 2 = 105.5 • And expands by power of 2. • Computationally very cheap
Perfect Centering Method Max distance metric provides better neighborhood by - keeping the target in the center - and expanding by 1 in both side Initial neighborhood P-tree (exact matching): Pnn = P1(v1)& P2(v2)& P3(v3) & … &Pn(vn) If rc(Pnn) < k Pnn = P1(v1-1, v1+1)& P2(v2-1, v2+1)& … & Pn(vn-1, vn+1) If rc(Pnn) < k Pnn = P1(v1-2, v1+2)& P2(v2-2, v2+2)& … & Pn(vn-2, vn+2) Computationally costlier than HOB Similarity method But a little better classification accuracy
Finding the Plurality Class Let, Pc(i) is the value P-trees for the class i Plurality class =
Performance Experimented on two sets of Arial photographs of The Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values: 0 to 7
Performance – Accuracy 1997 Dataset:
Performance - Accuracy (cont.) 1998 Dataset:
Performance - Time 1997 Dataset: both axis in logarithmic scale
Performance - Time (cont.) 1998 Dataset : both axis in logarithmic scale
k-Clustering Partitioning data into k clusters, C1, C2, …, Ckas to minimizes some criterion function such as the sum of squared Euclidian distance measured from the centroid of the cluster or total variance , ciis the centroid or mean of Ci or sum of the pair-wise weight c is the weight function usually the distance between p and q
k-Means Algorithm • Arbitrarily select k initial cluster centers • Assign each data point to its nearest center • Update the centers by the means of the clusters • Repeat step 2 & 3 until no change • Good optimization, very slow • Complexity O(nNkt), n = # of dimension, N = # of data points • k = # of clusters, t = # of iterations • To solve speed issues, • some other algorithms have been proposed sacrificing quality
Divisive Approach 1. Initially consider the whole space as one hyperbox 2. Select a hyperbox to split 3. Select an axis and cut-point 4. Split the selected hyperbox by a hyperplane perpendicular to the selected axis through the selected cut-point 5. Repeat step 2-4 until there are k hyperboxes, each hyperbox is a cluster Mean-split algorithm, variance-based algorithm and our proposed new algorithm follow the divisive approach They differ in the strategies for selecting the hyperbox, axis and cut-point.
Mean-Split Algorithm • The initial hyperbox (the whole space) is assigned a number k • that is, k clusters will be formed from this hyperbox • Let, L = number of clusters assigned to a hyperbox • Li clusters are assigned to the i th sub-hyperbox • where, i = 1, 2 • 0 1 • n = # of points, V = volume • Select a hyperbox with L > 1 • Select the axis with largest spread of projected data • Mean of the projected data is the cut-point • Fast but poor optimization
Variance-Based Algorithm 1. Select the hyperbox with largest variance 2. By checking each point on each dimension of the selected hyperbox find the optimal cut-point, topt, that gives maximum variance reduction on the projected data. where wi and are the weight and variance of the i th interval (i = 1, 2) Still computationally costly but optimization is closer to k-means
Our Algorithm • When a new hyperbox is formed find two means m1 and m2 for each dimension using the projected data: • a. Arbitrarily select two values for m1 and m2 (m1 < m2) • b. Update m1 = mean of the interval [0, (m1+m2)/2] • c. Update m2 = mean of the interval [(m1+m2)/2, upper_limit] • d. Repeat step b & c until no change in m1 and m2. • Select the hyperbox and axis for which (m2 – m1) is largest • Cut-point = (m1 + m2) / 2
Our Algorithm(cont.) We represent each cluster by a P-tree the initial cluster is the pure1-tree, P1 Let Pci is the P-tree for cluster ci the P-trees for the two new clusters after splitting along axis j: PCi1 = PCi & Pj(0, (m1+m2)/2) PCi2 = PCi & Pj((m1+m2)/2, upper_limit) Note: Pj((m1+m2)/2, upper_limit) = complement of Pj(0, (m1+m2)/2)
Computing Sum & Mean from P-trees for all points and for dimension or band i: sum = mean = For the points in a cluster: sum = mean = Here the template P-tree, Pt= P-tree representing the cluster
Computing Variance from P-trees Variance = = For all points in the space: For the points in a cluster:
Performance Unlike variance based method, instead of checking each point on the axis, our method rapidly converges to the optimal cut point, topt . avoids scanning database by computing sum and mean from the root count of the P-trees very much faster than variance-based method while optimization as good as variance-based method
Conclusion • Analyzed the effect of various distance metric • Used a new metric, HOB Distance for fast P-tree-based computation • Revealed useful properties of P-trees • using P-trees, a fast new method of KNN, called Closed-KNN, giving higher classification accuracy • Designed a new FAST k-clustering algorithm: computing sum, mean, variance from P-tree without scanning databases