650 likes | 663 Views
This presentation discusses index methods, the curse of dimensionality, classifiers, and their applications in classification and image retrieval. The focus is on dimension pruning and the use of RCE-network as a classifier. Further, experiments, including active learning and feedback, are explored.
E N D
Index methods, Curse of Dimensionality, and classifiers Presented by: Ding-Ying Chiu Date: 2008/10/24
Outline • Motivation • Index methods • Estimate a lower bound • Classifier: RCE-network • Curse of Dimensionality • VA-file • Our method - dimension pruning • Experiments
Classification & Image Retrieval [16]Active Learning • Feedback
Classification & Image Retrieval [16]Terminate Photo Photo Photo Photo Photo Photo Photo Photo
Related workIndex methods • Coordinate-based • Space foundation • K-D-tree[12], K-D-B-tree[3] • Data foundation • R-tree[12], SS-tree[5] • Distance-based • Multiple reference points[7][8], M-tree[9]
block Coordinate-basedSpace foundation[12] Main idea Estimate a lower bound • Grid • Pruning rate = 30/40 = 0.75 Find 1-nearest neighbor Disk access
query Coordinate-basedData foundation[12] • The number of the contained data in a node: m~M • m M/2 • m:3, M:6 Main idea Estimate a lower bound Find 1-nearest neighbor
X1 d(r,x1) d(r,q) Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30
X2 d(r,x2) d(r,q) Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30
Xi d(r,xi) d(r,q) Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30
Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30
Advantages & DisadvantageR-tree & Grid – static data Grid R-tree m:2, M:5
Advantages & DisadvantageR-tree & Grid – dynamic data Grid R-tree m:2, M:5 Application: Cars
The dimension of a space F, to which the instances are projected, can be very high, possibly infinite. space F The advantages of Distance-base indexUnknown coordinates [11] • Distance-based index methods can be used in a space in which the coordinates are unknown.
Similarity between any two instances measured by a kernel function is: 1 1 The advantages of Distance-base indexCharacteristic of the space F [11] • Characteristic • lie on the surface of a unit hypersphere • No coordinate • Operator : Kn(x1, x2)
The advantages of Distance-base indexQuery processing [11] 1-nearest neighbor
d- f(x) random f(x+x) d- x y=1 x’ Lagrange Multipliers: (new)x’=(old)x’+x d+ y=-1 Margin = |d+|+|d-| Classifier and index methods • Neural network • SVM
Test datum Test datum RCE-networkConstraints • The RCE-network uses circles to cover training data with the following constraints: • (a) For each datum, it must be covered by a circle. • (b) The training data which are covered by a circle must be in the same class.
(w11, w21, …, wn1) C1 r1 Input layer Hidden layer Output layer RCE-networkStructure
RCE-network algorithmRajan’s algorithm [1] Input: training data, initial radius , radius reduction rate (0<1) Output: An RCE-network
The algorithm needs to scan the data many times RCE-network algorithmRajan’s algorithm [1] Input: training data, initial radius , radius reduction rate (0<1) Output: An RCE-network
=0.5 RCE-network algorithm Mu’s algorithm [2] Input: training data, radius reduction rate (0<1) Output: An RCE-network The algorithm will produce a large number of circles
1-NN query range query q1 p1 p2 q2 RCE-network algorithmOur method - Radius Expansion algorithm (RE)
When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful (Max-Min)/MinCurse of Dimensionality (From Introduction to data mining) • Randomly generate 500 points • Compute difference between max and min distance between any pair of points
Randomly generate 11 nodes on one dimensional space Max = 10, Min=1 Max-Min = 9 10 9 8 7 6 r2 = (3.14159)*(1.414)2 = 6.28128 6.28128/4 1.57 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 (Max-Min)/Min1D & 2D 11 nodes 0 1 2 3 4 5 6 7 8 9 10 • 2D: Max 14.14 • (14.14-x)/x = 9 x = 1.414 1.57/(10*10) = 0.0157
………… Randomly generate 101 nodes on one dimensional space 0 1 2 3 4 5 6 7 8 9 10 Max = 10, Min=0.1 Max-Min = 9.9 9.9/0.1=99 10 9 8 • 2D: Max 14.14 • (14.14-x)/x = 99 x = 0.1414 7 6 r2 = (3.14159)*(0.1414)2 = 0.0628128 0.0628128/4 0.0157 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 (Max-Min)/Min1D & 2D 101 nodes 0.0157/(10*10) = 0.000157
Randomly generate11 nodes on one dimensional space Max = 10, Min=1 Max-Min = 9 • 3D: Max 17.32 • (17.32-x)/x = 9 x = 1.732 10 (4/3)*r3 = (4/3)*(3.14159)*(1.732)3 = 21.7637 21.7637/8 2.72 0 10 (Max-Min)/Min1D & 3D 0 1 2 3 4 5 6 7 8 9 10 2.72/(10*10*10) = 0.00272
(Max-Min)/MinCoordinate-based - data foundation (Max-Min)/Min 0 R-tree
X1 d(r,x1) d(r,q) q (Max-Min)/MinDistance-based (Max-Min)/Min 0 Reference point
Related work • A Survey on High Dimensional Spaces and Indexing • COMP 530: Database Architecture and Implementation (The Hong Kong University of Science and Technology) • Wu Hai Liang, Lam Man Lung, Lo Ming Fun, Yuen Chi Kei, Ng Chun Bong
Index methodDistance-based • 10,000 uniform data
VA-file[13]Main idea: Bit Vector • For every partitioning and clustering method there is a dimensionality d such that, on average, all blocks are accessed if the number of dimensions exceeds d. • Linear scan
0.75 0.2 VA-file[13]Bounds • Linear scan • Bound: without disk access Low bound = 0.45 (0.7, 0.3) 0.5
A Our pruning method • Step 1: • Find an approximate nearest neighbor. • Step 2: • Dimension pruning Q Find the 1-nearest neighbor from the training data whose classes are different from that of A.
The smallest k is called the number of the computed dimensions (NCD) of P. • Example: Q(1, 1, 1, 1, 1, 1, 1) A(2, 2, 2, 2, 2, 2, 2) DIS(Q, A)= P(1, 3, 2, 1, 3, 5, 1) The NCD of P is 5. Dimension pruningThe number of the computed dimensions • Given a query Q and its approximate nearest neighbor A. • When , where 1kn, P cannot become the nearest datum of Q since DIS(Q, A) is smaller than DIS(Q, P).
Dimension pruningANCD • Given a data set X. The Average Number of the Computed Dimensions (ANCD) of X is the sum of the NCD’s of the data in X divided by the number of the data in X. • Example: Q(1, 1, 1, 1, 1, 1, 1) A(2, 2, 2, 2, 2, 2, 2) DIS(Q, A)= P1(1, 3, 2, 1, 3, 5, 1) P2(5, 1, 4, 2, 7, 1, 2) P3(1, 3, 1, 3, 7, 1, 1) P4(3, 3, 1, 8, 1, 2, 5) ANCD of X is (5+1+4+2)/4 = 3.
Dimension pruning • Property 4.3 • In an n dimensional space, given a query Q and its approximate nearest neighbor A. • Suppose that X and Y are two node sets: • X={x| x is a node and DIS(Q, x) is d} • Y={y|y is a node and DIS(Q, y) is r*d}. • Moreover, the data of X are uniform distribution on the surface of a hyper-sphere whose center is Q and radius is d. The data of Y are uniform distribution on the surface of a hyper-sphere whose center is Q and radius is r*d. • If the ANCD of X is m, m < n, then the ANCD of Y is m/(r2). ANCD is m Q rd d A ANCD is m/r2
ANCD=27.78 ANCD=62.5 ANCD=20.4 ANCD=250 ANCD=40 ANCD=111.11 Dimension pruning Dimension = 1000 ANCD=999.9
30 30 = 100 10 10 10 10 20 20 30 = 100 15 10 5 20 30 20 15 15 = 100 10 25 30 20 = 100 10 5 Average value of each dimension d2/n 17 85 80 16 18 90 80 16 75 15 18 90 100/6=16.67 Dimension pruningAverage value of each dimension Dimension = 6 d=10 30 20 20 15 = 100 10 5
d=20 100 60 40 20 20 60 80 100 80 40 Dimension pruningConcept d=10 d=9 Dimension = 100 1 Dimension 4 Dimension
300 400 90 200 100 500 Dimension pruningExperiments • In a 100 dimensional space, we generate a query Q and its approximate nearest datum A, DIS(Q, A)=90. • we produce the data on the surface of five hyper-spheres. The center of the five hyper-spheres is Q and their radius range from 100 to 500. • For each hyper-sphere, 100,000 uniform data are produce on its surface.
P P Q Dimension pruning • Variance affected dimension pruning • We can analyze the variance of each dimension and change the computation order of the Euclidian distance. Q
query 1-NN queryApproximate 1-NN Random selection N=6 , =3 (1*10+2*6+3*3+4*1)/20 = 1.75
1-NN queryApproximate 1-NN N data, If data are selected (1*10+2*6+3*3+4*1)/20 = 1.75 = (6+1)/(3+1)
Reference point R A B A Q C R Q B C 1-NN queryApproximate 1-NN Greedy selection
Approximate 1-NNExperiment • The number of dimensions of USPS is 256, the number of dimensions of MNIST is 784, and the number of dimensions of LETTER is 16. 7291 data • based on Formula (16), the average minimum rank of random selection method on 7291 data and 200 selected data is 36.27. 11.47