Presented by: Ding-Ying Chiu Date: 2008/10/24

Index methods, Curse of Dimensionality, and classifiers Presented by: Ding-Ying Chiu Date: 2008/10/24

Outline • Motivation • Index methods • Estimate a lower bound • Classifier: RCE-network • Curse of Dimensionality • VA-file • Our method - dimension pruning • Experiments

Classification & Image Retrieval [16]Active Learning • Feedback

Classification & Image Retrieval [16]Two classes

Classification & Image Retrieval [16]Repeatedly

Classification & Image Retrieval [16]Concept

Classification & Image Retrieval [16]Terminate Photo Photo Photo Photo Photo Photo Photo Photo

Related workIndex methods • Coordinate-based • Space foundation • K-D-tree[12], K-D-B-tree[3] • Data foundation • R-tree[12], SS-tree[5] • Distance-based • Multiple reference points[7][8], M-tree[9]

block Coordinate-basedSpace foundation[12] Main idea Estimate a lower bound • Grid • Pruning rate = 30/40 = 0.75 Find 1-nearest neighbor Disk access

query Coordinate-basedData foundation[12] • The number of the contained data in a node: m~M • m  M/2 • m:3, M:6 Main idea Estimate a lower bound Find 1-nearest neighbor

X1 d(r,x1)  d(r,q) Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30

X2 d(r,x2)  d(r,q) Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30

Xi d(r,xi)  d(r,q) Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30

 Distance-base Multiple reference points[7][8] Main idea Estimate a lower bound r q Dimension = 30

Advantages & DisadvantageR-tree & Grid – static data Grid R-tree m:2, M:5

Advantages & DisadvantageR-tree & Grid – dynamic data Grid R-tree m:2, M:5 Application: Cars

The dimension of a space F, to which the instances are projected, can be very high, possibly infinite. space F The advantages of Distance-base indexUnknown coordinates [11] • Distance-based index methods can be used in a space in which the coordinates are unknown.

Similarity between any two instances measured by a kernel function is: 1 1 The advantages of Distance-base indexCharacteristic of the space F [11] • Characteristic • lie on the surface of a unit hypersphere • No coordinate • Operator : Kn(x1, x2)

The advantages of Distance-base indexQuery processing [11] 1-nearest neighbor

d- f(x) random f(x+x) d- x y=1 x’ Lagrange Multipliers： (new)x’=(old)x’+x d+ y=-1 Margin = |d+|+|d-| Classifier and index methods • Neural network • SVM

Test datum Test datum RCE-networkConstraints • The RCE-network uses circles to cover training data with the following constraints: • (a) For each datum, it must be covered by a circle. • (b) The training data which are covered by a circle must be in the same class.

(w11, w21, …, wn1) C1 r1 Input layer Hidden layer Output layer RCE-networkStructure

  RCE-network algorithmRajan’s algorithm [1] Input: training data, initial radius , radius reduction rate  (0<1) Output: An RCE-network

The algorithm needs to scan the data many times RCE-network algorithmRajan’s algorithm [1] Input: training data, initial radius , radius reduction rate  (0<1) Output: An RCE-network

=0.5 RCE-network algorithm Mu’s algorithm [2] Input: training data, radius reduction rate  (0<1) Output: An RCE-network The algorithm will produce a large number of circles

1-NN query range query q1 p1 p2 q2 RCE-network algorithmOur method - Radius Expansion algorithm (RE)

When dimensionality increases, data becomes increasingly sparse in the space that it occupies Definitions of density and distance between points, which is critical for clustering and outlier detection, become less meaningful (Max-Min)/MinCurse of Dimensionality (From Introduction to data mining) • Randomly generate 500 points • Compute difference between max and min distance between any pair of points

Randomly generate 11 nodes on one dimensional space Max = 10, Min=1 Max-Min = 9 10 9 8 7 6 r2 = (3.14159)*(1.414)2 = 6.28128 6.28128/4  1.57 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 (Max-Min)/Min1D & 2D  11 nodes 0 1 2 3 4 5 6 7 8 9 10 • 2D: Max  14.14 • (14.14-x)/x = 9  x = 1.414 1.57/(10*10) = 0.0157

………… Randomly generate 101 nodes on one dimensional space 0 1 2 3 4 5 6 7 8 9 10 Max = 10, Min=0.1 Max-Min = 9.9 9.9/0.1=99 10 9 8 • 2D: Max  14.14 • (14.14-x)/x = 99  x = 0.1414 7 6 r2 = (3.14159)*(0.1414)2 = 0.0628128 0.0628128/4  0.0157 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 (Max-Min)/Min1D & 2D  101 nodes 0.0157/(10*10) = 0.000157

Randomly generate11 nodes on one dimensional space Max = 10, Min=1 Max-Min = 9 • 3D: Max  17.32 • (17.32-x)/x = 9  x = 1.732 10 (4/3)*r3 = (4/3)*(3.14159)*(1.732)3 = 21.7637 21.7637/8  2.72 0 10 (Max-Min)/Min1D & 3D 0 1 2 3 4 5 6 7 8 9 10 2.72/(10*10*10) = 0.00272

(Max-Min)/MinCoordinate-based - data foundation (Max-Min)/Min  0 R-tree

X1 d(r,x1) d(r,q) q  (Max-Min)/MinDistance-based (Max-Min)/Min  0 Reference point

Related work • A Survey on High Dimensional Spaces and Indexing • COMP 530: Database Architecture and Implementation (The Hong Kong University of Science and Technology) • Wu Hai Liang, Lam Man Lung, Lo Ming Fun, Yuen Chi Kei, Ng Chun Bong

Index methodDistance-based • 10,000 uniform data

VA-file[13]Main idea: Bit Vector • For every partitioning and clustering method there is a dimensionality d such that, on average, all blocks are accessed if the number of dimensions exceeds d. • Linear scan

0.75 0.2 VA-file[13]Bounds • Linear scan • Bound: without disk access Low bound = 0.45 (0.7, 0.3) 0.5

Training dataNo disk access

A Our pruning method • Step 1: • Find an approximate nearest neighbor. • Step 2: • Dimension pruning Q Find the 1-nearest neighbor from the training data whose classes are different from that of A.

The smallest k is called the number of the computed dimensions (NCD) of P. • Example： Q(1, 1, 1, 1, 1, 1, 1) A(2, 2, 2, 2, 2, 2, 2) DIS(Q, A)= P(1, 3, 2, 1, 3, 5, 1) The NCD of P is 5. Dimension pruningThe number of the computed dimensions • Given a query Q and its approximate nearest neighbor A. • When , where 1kn, P cannot become the nearest datum of Q since DIS(Q, A) is smaller than DIS(Q, P).

Dimension pruningANCD • Given a data set X. The Average Number of the Computed Dimensions (ANCD) of X is the sum of the NCD’s of the data in X divided by the number of the data in X. • Example： Q(1, 1, 1, 1, 1, 1, 1) A(2, 2, 2, 2, 2, 2, 2) DIS(Q, A)= P1(1, 3, 2, 1, 3, 5, 1) P2(5, 1, 4, 2, 7, 1, 2) P3(1, 3, 1, 3, 7, 1, 1) P4(3, 3, 1, 8, 1, 2, 5) ANCD of X is (5+1+4+2)/4 = 3.

Dimension pruning • Property 4.3 • In an n dimensional space, given a query Q and its approximate nearest neighbor A. • Suppose that X and Y are two node sets: • X={x| x is a node and DIS(Q, x) is d} • Y={y|y is a node and DIS(Q, y) is r*d}. • Moreover, the data of X are uniform distribution on the surface of a hyper-sphere whose center is Q and radius is d. The data of Y are uniform distribution on the surface of a hyper-sphere whose center is Q and radius is r*d. • If the ANCD of X is m, m < n, then the ANCD of Y is m/(r2). ANCD is m Q rd d A ANCD is m/r2

ANCD=27.78 ANCD=62.5 ANCD=20.4 ANCD=250 ANCD=40 ANCD=111.11 Dimension pruning Dimension = 1000 ANCD=999.9

30 30 = 100 10 10 10 10 20 20 30 = 100 15 10 5 20 30 20 15 15 = 100 10 25 30 20 = 100 10 5 Average value of each dimension d2/n 17 85 80 16 18 90 80 16 75 15 18 90 100/6=16.67 Dimension pruningAverage value of each dimension Dimension = 6 d=10 30 20 20 15 = 100 10 5

d=20 100 60 40 20 20 60 80 100 80 40 Dimension pruningConcept d=10 d=9 Dimension = 100 1 Dimension 4 Dimension

300 400 90 200 100 500 Dimension pruningExperiments • In a 100 dimensional space, we generate a query Q and its approximate nearest datum A, DIS(Q, A)=90. • we produce the data on the surface of five hyper-spheres. The center of the five hyper-spheres is Q and their radius range from 100 to 500. • For each hyper-sphere, 100,000 uniform data are produce on its surface.

P P Q Dimension pruning • Variance affected dimension pruning • We can analyze the variance of each dimension and change the computation order of the Euclidian distance. Q

query 1-NN queryApproximate 1-NN Random selection N=6 , =3 (1*10+2*6+3*3+4*1)/20 = 1.75

1-NN queryApproximate 1-NN N data, If  data are selected (1*10+2*6+3*3+4*1)/20 = 1.75 = (6+1)/(3+1)

Reference point R A B A Q C R Q B C 1-NN queryApproximate 1-NN Greedy selection

Approximate 1-NNExperiment • The number of dimensions of USPS is 256, the number of dimensions of MNIST is 784, and the number of dimensions of LETTER is 16. 7291 data • based on Formula (16), the average minimum rank of random selection method on 7291 data and 200 selected data is 36.27. 11.47

Presented by: Ding-Ying Chiu Date: 2008/10/24