1 / 17

The Curse of Dimensionality

The Curse of Dimensionality. Atul Santosh Tirkey Y7102. Curse of Dimensionality. A term coined by – Richard E. Bellman. It refers to the problem caused by the exponential increase in the volume associated with the adding extra dimension to mathematical space.

seth-horne
Download Presentation

The Curse of Dimensionality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Curse of Dimensionality Atul Santosh Tirkey Y7102

  2. Curse of Dimensionality • A term coined by – Richard E. Bellman. • It refers to the problem caused by the exponential increase in the volume associated with the adding extra dimension to mathematical space. • The basic problems associated with increase of dimensionality are – • There aren’t enough observations to make good estimates. • Adding more features can increase the noise, and hence the error

  3. Examples • To sample a unit distance with an accuracy of 0.01 distance between the points 100 evenly spaced points would suffice. • But an equivalent sampling of 10-dimensional unit hypercube with a lattice of spacing of 0.01 between adjacent points would require 10^20 sample points. • comparison of the volume of the hypercube of side 2r and sphere of radius r. • volume of sphere is - • Volume of the cube is -

  4. Dimensionality Reduction

  5. Principle Components Analysis invented by Karl Pearson in 1901 • Idea is to project onto to the subspace which accounts for most of the variance. • Data is projected onto to the eigenvectors of the covariance matrix associated with the • Steps Involved • Calculating the covariance matrix • Calculate the eigenvector and eigenvalues of the covariance matrix • Choosing the components and forming the feature vector • Deriving the new data set

  6. Principle component analysis Projection of data along one eigenvector Original data set

  7. Fisher Linear Discriminant Needed because directions of maximum variance maybe useless for classification

  8. Fisher Linear Discriminant Main Idea – Find projection to a line s.t. samples from different classes are well separated.

  9. Fisher Linear Discriminant – Methodology Let µ1 and µ2 be the projection of classes 1 and 2 on a particular line. If Z1, Z2 … Zn are samples then sample mean is Scatter is defined as Let yi = VtZi Target is to find v which makes J(v) large to gurantee that the classes are well separated.

  10. After solving we can find the solution as

  11. Taking on the curse of dimensionality in Joint distributions using neural networks SamyBenigo and YoshuaBenigo IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 3, MAY 2000 In this paper they propose a new architecture for modeling high dimensional data that requires comparatively less parameters, using a multilayer neural network to represent the joint distribution of the variables as the product of conditional distributions.

  12. Proposed Architecture one can see the network as a kind of auto encoder in which the variable is predicted based on the previous variables The neural network represents the parameterized function The above log probability is computed as the sum of conditional log probabilities. where gi(Z1,…,Zi-1) is the vector-valued output of the ith group of output units, and it gives the value of the parameters of the distribution of Zi, when Z1=z1, Z2=z2,….Zi-1=zi-1 .

  13. Proposed Architecture

  14. Proposed Architecture In the discrete case, we have Where gi,i’ is the ith output element of the vector gi In this example, a softmax output for the ith group may be used to force these parameters to be positive and sum to one, i.e.

  15. Proposed Architecture Hidden units activations may be computed as follows where the c’s are biases and the vj,j’k,k’’s are the weights of the hidden layer [from input unit to hidden unit ] and z’kk’ is the kth element of the vectorial input representation for the value Zk=zk.

  16. Proposed Architecture To optimize the parameters we have simply used gradient-based optimization methods, using conjugate or stochastic (on-line) gradient, to maximize a MAP (maximum a posteriori) criterion, that is the sum of a logprior and the total loglikelihood. They have used a “weight decay” logprior, which gives a quadratic penalty to the parameters θ the inverse variances γi are chosen proportional to the number of weights incoming into a neuron

  17. Questions • what is PCA ? • What is Fisher linear discriminant? • How is FLD better than PCA in classification? • How are weights calculated in the new proposed architecture

More Related