190 likes | 406 Views
The Curse of Dimensionality. Atul Santosh Tirkey Y7102. Curse of Dimensionality. A term coined by – Richard E. Bellman. It refers to the problem caused by the exponential increase in the volume associated with the adding extra dimension to mathematical space.
E N D
The Curse of Dimensionality Atul Santosh Tirkey Y7102
Curse of Dimensionality • A term coined by – Richard E. Bellman. • It refers to the problem caused by the exponential increase in the volume associated with the adding extra dimension to mathematical space. • The basic problems associated with increase of dimensionality are – • There aren’t enough observations to make good estimates. • Adding more features can increase the noise, and hence the error
Examples • To sample a unit distance with an accuracy of 0.01 distance between the points 100 evenly spaced points would suffice. • But an equivalent sampling of 10-dimensional unit hypercube with a lattice of spacing of 0.01 between adjacent points would require 10^20 sample points. • comparison of the volume of the hypercube of side 2r and sphere of radius r. • volume of sphere is - • Volume of the cube is -
Principle Components Analysis invented by Karl Pearson in 1901 • Idea is to project onto to the subspace which accounts for most of the variance. • Data is projected onto to the eigenvectors of the covariance matrix associated with the • Steps Involved • Calculating the covariance matrix • Calculate the eigenvector and eigenvalues of the covariance matrix • Choosing the components and forming the feature vector • Deriving the new data set
Principle component analysis Projection of data along one eigenvector Original data set
Fisher Linear Discriminant Needed because directions of maximum variance maybe useless for classification
Fisher Linear Discriminant Main Idea – Find projection to a line s.t. samples from different classes are well separated.
Fisher Linear Discriminant – Methodology Let µ1 and µ2 be the projection of classes 1 and 2 on a particular line. If Z1, Z2 … Zn are samples then sample mean is Scatter is defined as Let yi = VtZi Target is to find v which makes J(v) large to gurantee that the classes are well separated.
Taking on the curse of dimensionality in Joint distributions using neural networks SamyBenigo and YoshuaBenigo IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 3, MAY 2000 In this paper they propose a new architecture for modeling high dimensional data that requires comparatively less parameters, using a multilayer neural network to represent the joint distribution of the variables as the product of conditional distributions.
Proposed Architecture one can see the network as a kind of auto encoder in which the variable is predicted based on the previous variables The neural network represents the parameterized function The above log probability is computed as the sum of conditional log probabilities. where gi(Z1,…,Zi-1) is the vector-valued output of the ith group of output units, and it gives the value of the parameters of the distribution of Zi, when Z1=z1, Z2=z2,….Zi-1=zi-1 .
Proposed Architecture In the discrete case, we have Where gi,i’ is the ith output element of the vector gi In this example, a softmax output for the ith group may be used to force these parameters to be positive and sum to one, i.e.
Proposed Architecture Hidden units activations may be computed as follows where the c’s are biases and the vj,j’k,k’’s are the weights of the hidden layer [from input unit to hidden unit ] and z’kk’ is the kth element of the vectorial input representation for the value Zk=zk.
Proposed Architecture To optimize the parameters we have simply used gradient-based optimization methods, using conjugate or stochastic (on-line) gradient, to maximize a MAP (maximum a posteriori) criterion, that is the sum of a logprior and the total loglikelihood. They have used a “weight decay” logprior, which gives a quadratic penalty to the parameters θ the inverse variances γi are chosen proportional to the number of weights incoming into a neuron
Questions • what is PCA ? • What is Fisher linear discriminant? • How is FLD better than PCA in classification? • How are weights calculated in the new proposed architecture