240 likes | 372 Views
KERNEL INDEPENDENT COMPONENT ANALYSIS BY FRANCIS BACH & MICHAEL JORDAN International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003. Presented by Nagesh Adluru. Goal of the Paper.
E N D
KERNEL INDEPENDENT COMPONENT ANALYSISBYFRANCIS BACH & MICHAEL JORDANInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003 Presented by Nagesh Adluru
Goal of the Paper To perform Independent Component Analysis (ICA) in a novel way which is better, robust compared to the existing techniques.
Concepts Involved • ICA – Independent Component Analysis • Mutual Information • F – Correlation • RKHS – Reproducing Kernel Hilbert Spaces • CCA – Canonical Correlation Analysis • KICA – Kernel ICA • KGV – Kernel Generalized Variance
ICA – Independent Component Analysis • ICA is unsupervised learning • We have estimate x given the set of observations of y (Assumption components of x are independent). • So we have to estimate W such that • x = Wy
ICA – Independent Component Analysis • ICA is semi-parametric. • Because we do not know anything about the distribution of x it is non-parametric. • But we do know the distribution of y and that it is a distribution of ‘linear combination’ of components of x. • So the problem is semi-parametric and kernels do well in such situations.
ICA – Independent Component Analysis • If we knew the distribution of x then we can assume the ‘x-space’ and hence can find W using gradient or fixed-point algorithm. • But not in practice!!! So how?? • Since we are looking for independent components we need to maximize the independence or minimize mutual information.
Mutual Information • Mutual Information is an abstract term that is used to describe independence among variables. • The mutual information is the least when the dependence is the least. • So looks promising to be explored!!! • Prior work has focused on approximations to this term because of difficulty involved with real-variables and finite samples. • Kernels offer better ways.
F – Correlation • F – Correlation is defined as below: • If x1 and x2 are independent then the value is zero but converse is important here.
F – Correlation • Converse: If is zero then the x1 and x2 are independent. • Is that true? • It is true only if F ‘space’ is very large. • But it is also true if F is restricted to the reproducing Kernel Hilbert Spaces based on Gaussian kernels.
F – Correlation • Since the converse holds even for the restriction of F to RKHS, a mutual information can be defined such that if it is 0 then the two variables are independent.
RKHS – Reproducing Kernel Hilbert Spaces • Operations using kernels can be treated as operations in Hilbert space. • The reproducing ability of the kernels of operations in Euclidean space is exploitable for computational purposes. • So the correlation between fs can be interpreted as the correlation between Фs which is defined as the canonical correlation between Фs.
CCA – Canonical Correlation Analysis • CCA vs PCA • PCA maximizes variance of projection of distribution of a single random vector. • CCA maximizes correlation between projections of distributions of two or more random vectors. CIJ = cov(xI, xJ)
CCA – Canonical Correlation Analysis • While PCA leads to eigenvector problem CCA leads to generalized eigenvector problem. (Eigenvector problem: AV = V Generalized eigenvector problem: AV = BV) • The CCA can easily be kernelized and also generalized to more than two random vectors. • So the max correlation between variables can be found efficiently, which is very nice.
CCA – Canonical Correlation Analysis • Though this kernelization of CCA can help us, the generalization is not precise in terms mutual independence measure using F – Correlation. • But that is not limitation in practice, both because of empirical results as well as because mutuality could be achieved using pair-wise dependence.
Kernel ICA • We saw • And also that can be calculated using kernelized CCA. • So we now have Kernel – ICA not in the sense that the basic ICA is kernelized but because using kernelized CCA.
KICA – Kernel ICA Algorithm • Input: W and • Procedure: • Estimate set • Minimize are [N*N] Gram matrices for each component of the random vector. (Equivalent to generalized CCA, where each of the m vectors is a single element vector)
KICA – Kernel ICA • Computational Complexity of calculating ‘smallest’ generalized eigen value of matrices of size mN is O(N3). (Note: the eigen values are not directly related to the entries in W.) • But we can reduce it because of special properties of the Gram matrix spectrum (or range of values in its space) to O(M2N), where M is a constant < N.
KICA – Kernel ICA • The next crucial job is to find minimum C(W) in the space and that W is called de-mixing matrix. • Preferably data is whitened (PCA) and W is restricted to be ‘orthogonal’ because de-correlation implies independence. • The search for W in this restricted space (called Stiefel manifold) can be done with Riemannian metric suggesting gradient type algorithms.
KICA – Kernel ICA • The problem of local-minima can be solved either using heuristics (instead of random) for selecting initial W. • Also it has been shown empirically that a decent number of restarts would solve this problem when large number of samples are available.
KGV – Kernel Generalized Variance • F – Correlation is the ‘smallest’ generalized eigenvalue of KCCA. • Idea with KGV is to make use of other values as well. • The mutual information contrast function is defined as where
Simulation Results • The results on the simulation data showed that the KICA is better compared to other ICA algorithms like FastICA, Jade, Imax for larger number of ‘components’. • The simulation data was mixture of variety of source distributions like subgaussian, supergaussian and nearly gaussian. • The KICA is also robust for outliers.
Conclusions • This paper proposed novel kernel-based measures for independence. • The approach is flexible and computationally demanding (because of additional search in finding eigenvalues).