200 likes | 319 Views
Independent Component Analysis PART I CS679 Lecture Note by Gil-Jin Jang Computer Science Department KAIST. REFERENCES. * A.J. Bell & T.J. Sejnowski. 1995. “An Information-Maximazation Approach to Blind Separation and Blind Deconvolution,” Neural Computation 7: section 1-4.
E N D
Independent Component Analysis PART I CS679 Lecture Note by Gil-Jin Jang Computer Science Department KAIST
REFERENCES • * A.J. Bell & T.J. Sejnowski. 1995. “An Information-Maximazation Approach to Blind Separation and Blind Deconvolution,” Neural Computation 7: section 1-4. • P. Comon. 1994. “Independent Component Analysis, a New Concept?,” Signal Processing 36: pp.287-314. • K.Pope & R.Bogner. 1996. “Blind Signal Separation: Linear, Instaneous Combinations,” Digital Signal Processing: pp.5-16.
ABSTRACT • New Self-Organizing Learning Algorithm • no knowledge of the input distributions • maximizes the information of the output of the neuron • input is transferred by non-linear units (sigmoid function, etc.) • Extra Properties of Non-Linear Transfer Function • pick up higher-order moments of the input distributions • redundancy reduction btn. outputs • separate statistically independent components • a higher-order generalization of PCA • Simulations • blind separation & blind deconvolution • time-delayed source separation
(sec.3) Background: Terminology BSS ? (blind source separation) = redundancy reduction* Linearly mixed Nonlinearly mixed ICA = blind separation* whitening a signal = blind deconvolution* general mixing structures for multiple signals PCA = Karhunen-Loéve transform (Duda&Hart) intractable so far
Blind Separation • “cocktail-party” problem, with no delay • Problem: Linear Combination by a matrix A • Solution: a square matrix W = PDA-1 • P: a permutation matrix, D: a diagonal matrix • Minimizing Mutual Information Btn Output Units x = As u = Wx PDA-1x PDs x : observed signals s : source signals u : estimated independent components
Blind Deconvolution • Problem: a single signal, Corrupted by an Unknown filter • {a1, a2,…, aK}, K-th order causal filter • other sources are time-delayed versions of itself • Solution: a reverse filter {w1, w2,…, wL} • removing statistical dependencies across time x(t) : observed, corrupted signal s(t) : unknown source signal u(t) : recovered signal
1. Introduction • Information Theoretic Unsupervised Learning Rules • applied to NN with non-linear units a single unit NN mapping a causal filter (blind deconvolution) a time-delayed system • ‘flexible’ non-linearity selection of activation function
(sec.4) Reducing Statistical Independencevia information maximization • Statistically Independent • Using Information-Theoretic Approach • Purpose: Max. sum of individual entropy and Min. M.I. • Practically, maximizing the joint entropyH(x,y) yields minimizing the mutual informationI(x,y) • super-Gaussian input signals (e.g. speech) - max. the joint entropy in sigmoidal networks = min. the M.I. btn outputs (experimental results)
W x1 x2 … xn y1 y2 … yn 1 2 n N 2. Information Maximization • Maximize the M.I. of the output Y of a Neural Network G: invertible transformation (deterministic) W: NN weights : activation function (Sigmoid) N: noise H(Y): differential entropy of output Y H(Y|X): entropy of output not from the input X gradient ascent rule w.r.t. the weight ‘W’
M.I. Minimization • Basis: stochastic gradient ascent rule w.r.t. W • as G is invertible transformation: H(Y|X) = H(N)
CASE1: 1 Input and 1 Output • Example. sigmoidal transfer function • stochastic gradient ascent learning rule w0-rule: center the steepest part of sigmoid on the peak of f(x) yielding most informative bias w-rule: scale the slope of the sigmoid to match the variance of f(x) yielding most informative weight narrow pdf sharply-sloping sigmoid
1 Input and 1 Output • Infomax Principle (Laughlin 1981) • matching a neuron’s output function to input distribution • inputs are passed through a sigmoid function • maximum information transmission: high density part of pdf f(x) is lined up with sloping part of the sigmoid g(x) • fy(y) is close to the flat unit distribution - maximum entropy for a variable bounded in (0,1) wO: peak of distribution wopt: scale for flat distribution
CASE2: NN Network • Expansion of 1 1 unit mapping • Multi-dimensional Learning Rule • Refer to the paper for detail derivation of the learning rule
CASE3: A Causal Filter(Blind Deconvolution) • Assume the Single Output Signal is Dependent to Itself • Transform the Problem into Blind Separation Domain x(t): time series of length M w(t): a causal filter of length L (<M), {w1, w2, …, wL} u(t): output time series X, Y, U: corresponding vectors W: MM matrix, lower traiangular Special Case of Blind Separation
wL x(t) y(t) g wL-1 z-1 : : w1 z-1 unit delay operator Blind Deconvolution • Learning Rules • when g is the tanh() wL: a ‘leading’ weight, same role as a single unit wL-j: multiplied delay lines from xt-j to yt decorrelate the past input from the present output
CASE4: Weights with Time Delays • Assume the signal Itself is Dependent on Its Time-delayed Version • Learning Rule for delay d for g is tanh() • example.) if y receives mixture of sinusoids of the same frequencies and different phases d is adjusted until the same frequency sinusoids have same phase • applications.) removing echo or reverberation
CASE5: Generalized Sigmoid Function • Selection of Non-Linear Function g • suitable to NN’s input(u) cumulative pdf • ‘flexible’ sigmoid : asymmetric generalized logistic function defined by the differential equation p,r>1 : very peaked (super-Gaussian) p,r<1 : flat, unit-like (sub-Gaussian) pr : skew distributions
Gaussian white noise speech music (sec.4) Real-World Considerations • learning rules are altered by incorporating (p,r) example.1) for single unit example.2) for NN network
SUMMARY • Self-Organizing Learning Algorithm • based on informax principle • separate statistically independent components • blind separation, blind deconvolution, time-delayed signal • a higher-order generalization of PCA • Non-Linear Transfer Function • pick up higher-order moments • selected to best match the network’s output distributions
Appendix:(sec.3) Higher Order Statistics • a 2nd-order Decorrelation (Barlow & Földiák) • find uncorrelated, linearly independent projection • BS: PCA, unsuitable to asymmetric mixing matrix A • BD: autocorrelation, only amplitude (phase-blind) • Higher-order Statistics • minimizing the M.I.(mutual information) btn outputs • M.I. involves higher-order statistics - cumulants of all orders • explicit estimation, intensive computation • static non-linear functions