1 / 20

Independent Component Analysis PART I CS679 Lecture Note by Gil-Jin Jang

Independent Component Analysis PART I CS679 Lecture Note by Gil-Jin Jang Computer Science Department KAIST. REFERENCES. * A.J. Bell & T.J. Sejnowski. 1995. “An Information-Maximazation Approach to Blind Separation and Blind Deconvolution,” Neural Computation 7: section 1-4.

vesna
Download Presentation

Independent Component Analysis PART I CS679 Lecture Note by Gil-Jin Jang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Independent Component Analysis PART I CS679 Lecture Note by Gil-Jin Jang Computer Science Department KAIST

  2. REFERENCES • * A.J. Bell & T.J. Sejnowski. 1995. “An Information-Maximazation Approach to Blind Separation and Blind Deconvolution,” Neural Computation 7: section 1-4. • P. Comon. 1994. “Independent Component Analysis, a New Concept?,” Signal Processing 36: pp.287-314. • K.Pope & R.Bogner. 1996. “Blind Signal Separation: Linear, Instaneous Combinations,” Digital Signal Processing: pp.5-16.

  3. ABSTRACT • New Self-Organizing Learning Algorithm • no knowledge of the input distributions • maximizes the information of the output of the neuron • input is transferred by non-linear units (sigmoid function, etc.) • Extra Properties of Non-Linear Transfer Function • pick up higher-order moments of the input distributions • redundancy reduction btn. outputs • separate statistically independent components • a higher-order generalization of PCA • Simulations • blind separation & blind deconvolution • time-delayed source separation

  4. (sec.3) Background: Terminology BSS ? (blind source separation) = redundancy reduction* Linearly mixed Nonlinearly mixed ICA = blind separation* whitening a signal = blind deconvolution* general mixing structures for multiple signals PCA = Karhunen-Loéve transform (Duda&Hart) intractable so far

  5. Blind Separation • “cocktail-party” problem, with no delay • Problem: Linear Combination by a matrix A • Solution: a square matrix W = PDA-1 • P: a permutation matrix, D: a diagonal matrix • Minimizing Mutual Information Btn Output Units x = As u = Wx  PDA-1x  PDs x : observed signals s : source signals u : estimated independent components

  6. Blind Deconvolution • Problem: a single signal, Corrupted by an Unknown filter • {a1, a2,…, aK}, K-th order causal filter • other sources are time-delayed versions of itself • Solution: a reverse filter {w1, w2,…, wL} • removing statistical dependencies across time x(t) : observed, corrupted signal s(t) : unknown source signal u(t) : recovered signal

  7. 1. Introduction • Information Theoretic Unsupervised Learning Rules • applied to NN with non-linear units  a single unit  NN mapping  a causal filter (blind deconvolution)  a time-delayed system • ‘flexible’ non-linearity selection of activation function

  8. (sec.4) Reducing Statistical Independencevia information maximization • Statistically Independent • Using Information-Theoretic Approach • Purpose: Max. sum of individual entropy and Min. M.I. • Practically, maximizing the joint entropyH(x,y) yields minimizing the mutual informationI(x,y) • super-Gaussian input signals (e.g. speech) - max. the joint entropy in sigmoidal networks = min. the M.I. btn outputs (experimental results)

  9. W x1 x2 … xn y1 y2 … yn 1 2 n N 2. Information Maximization • Maximize the M.I. of the output Y of a Neural Network G: invertible transformation (deterministic) W: NN weights : activation function (Sigmoid) N: noise H(Y): differential entropy of output Y H(Y|X): entropy of output not from the input X gradient ascent rule w.r.t. the weight ‘W’

  10. M.I. Minimization • Basis: stochastic gradient ascent rule w.r.t. W • as G is invertible transformation: H(Y|X) = H(N)

  11. CASE1: 1 Input and 1 Output • Example. sigmoidal transfer function • stochastic gradient ascent learning rule w0-rule: center the steepest part of sigmoid on the peak of f(x) yielding most informative bias w-rule: scale the slope of the sigmoid to match the variance of f(x) yielding most informative weight narrow pdf  sharply-sloping sigmoid

  12. 1 Input and 1 Output • Infomax Principle (Laughlin 1981) • matching a neuron’s output function to input distribution • inputs are passed through a sigmoid function • maximum information transmission: high density part of pdf f(x) is lined up with sloping part of the sigmoid g(x) • fy(y) is close to the flat unit distribution - maximum entropy for a variable bounded in (0,1) wO: peak of distribution wopt: scale for flat distribution

  13. CASE2: NN Network • Expansion of 1  1 unit mapping • Multi-dimensional Learning Rule • Refer to the paper for detail derivation of the learning rule

  14. CASE3: A Causal Filter(Blind Deconvolution) • Assume the Single Output Signal is Dependent to Itself • Transform the Problem into Blind Separation Domain x(t): time series of length M w(t): a causal filter of length L (<M), {w1, w2, …, wL} u(t): output time series X, Y, U: corresponding vectors W: MM matrix, lower traiangular Special Case of Blind Separation

  15. wL x(t) y(t) g wL-1 z-1 : : w1 z-1 unit delay operator Blind Deconvolution • Learning Rules • when g is the tanh() wL: a ‘leading’ weight, same role as a single unit wL-j: multiplied delay lines from xt-j to yt decorrelate the past input from the present output

  16. CASE4: Weights with Time Delays • Assume the signal Itself is Dependent on Its Time-delayed Version • Learning Rule for delay d for g is tanh() • example.) if y receives mixture of sinusoids of the same frequencies and different phases  d is adjusted until the same frequency sinusoids have same phase • applications.) removing echo or reverberation

  17. CASE5: Generalized Sigmoid Function • Selection of Non-Linear Function g • suitable to NN’s input(u) cumulative pdf • ‘flexible’ sigmoid : asymmetric generalized logistic function defined by the differential equation p,r>1 : very peaked (super-Gaussian) p,r<1 : flat, unit-like (sub-Gaussian) pr : skew distributions

  18. Gaussian white noise speech music (sec.4) Real-World Considerations • learning rules are altered by incorporating (p,r) example.1) for single unit example.2) for NN network

  19. SUMMARY • Self-Organizing Learning Algorithm • based on informax principle • separate statistically independent components • blind separation, blind deconvolution, time-delayed signal • a higher-order generalization of PCA • Non-Linear Transfer Function • pick up higher-order moments • selected to best match the network’s output distributions

  20. Appendix:(sec.3) Higher Order Statistics • a 2nd-order Decorrelation (Barlow & Földiák) • find uncorrelated, linearly independent projection • BS: PCA, unsuitable to asymmetric mixing matrix A • BD: autocorrelation, only amplitude (phase-blind) • Higher-order Statistics • minimizing the M.I.(mutual information) btn outputs • M.I. involves higher-order statistics - cumulants of all orders • explicit estimation, intensive computation • static non-linear functions

More Related