Independent Component Analysis PART I CS679 Lecture Note by Gil-Jin Jang

Independent Component Analysis PART I CS679 Lecture Note by Gil-Jin Jang Computer Science Department KAIST

REFERENCES • * A.J. Bell & T.J. Sejnowski. 1995. “An Information-Maximazation Approach to Blind Separation and Blind Deconvolution,” Neural Computation 7: section 1-4. • P. Comon. 1994. “Independent Component Analysis, a New Concept?,” Signal Processing 36: pp.287-314. • K.Pope & R.Bogner. 1996. “Blind Signal Separation: Linear, Instaneous Combinations,” Digital Signal Processing: pp.5-16.

ABSTRACT • New Self-Organizing Learning Algorithm • no knowledge of the input distributions • maximizes the information of the output of the neuron • input is transferred by non-linear units (sigmoid function, etc.) • Extra Properties of Non-Linear Transfer Function • pick up higher-order moments of the input distributions • redundancy reduction btn. outputs • separate statistically independent components • a higher-order generalization of PCA • Simulations • blind separation & blind deconvolution • time-delayed source separation

(sec.3) Background: Terminology BSS ? (blind source separation) = redundancy reduction* Linearly mixed Nonlinearly mixed ICA = blind separation* whitening a signal = blind deconvolution* general mixing structures for multiple signals PCA = Karhunen-Loéve transform (Duda&Hart) intractable so far

Blind Separation • “cocktail-party” problem, with no delay • Problem: Linear Combination by a matrix A • Solution: a square matrix W = PDA-1 • P: a permutation matrix, D: a diagonal matrix • Minimizing Mutual Information Btn Output Units x = As u = Wx  PDA-1x  PDs x : observed signals s : source signals u : estimated independent components

Blind Deconvolution • Problem: a single signal, Corrupted by an Unknown filter • {a1, a2,…, aK}, K-th order causal filter • other sources are time-delayed versions of itself • Solution: a reverse filter {w1, w2,…, wL} • removing statistical dependencies across time x(t) : observed, corrupted signal s(t) : unknown source signal u(t) : recovered signal

1. Introduction • Information Theoretic Unsupervised Learning Rules • applied to NN with non-linear units  a single unit  NN mapping  a causal filter (blind deconvolution)  a time-delayed system • ‘flexible’ non-linearity selection of activation function

(sec.4) Reducing Statistical Independencevia information maximization • Statistically Independent • Using Information-Theoretic Approach • Purpose: Max. sum of individual entropy and Min. M.I. • Practically, maximizing the joint entropyH(x,y) yields minimizing the mutual informationI(x,y) • super-Gaussian input signals (e.g. speech) - max. the joint entropy in sigmoidal networks = min. the M.I. btn outputs (experimental results)

W x1 x2 … xn y1 y2 … yn 1 2 n N 2. Information Maximization • Maximize the M.I. of the output Y of a Neural Network G: invertible transformation (deterministic) W: NN weights : activation function (Sigmoid) N: noise H(Y): differential entropy of output Y H(Y|X): entropy of output not from the input X gradient ascent rule w.r.t. the weight ‘W’

M.I. Minimization • Basis: stochastic gradient ascent rule w.r.t. W • as G is invertible transformation: H(Y|X) = H(N)

CASE1: 1 Input and 1 Output • Example. sigmoidal transfer function • stochastic gradient ascent learning rule w0-rule: center the steepest part of sigmoid on the peak of f(x) yielding most informative bias w-rule: scale the slope of the sigmoid to match the variance of f(x) yielding most informative weight narrow pdf  sharply-sloping sigmoid

1 Input and 1 Output • Infomax Principle (Laughlin 1981) • matching a neuron’s output function to input distribution • inputs are passed through a sigmoid function • maximum information transmission: high density part of pdf f(x) is lined up with sloping part of the sigmoid g(x) • fy(y) is close to the flat unit distribution - maximum entropy for a variable bounded in (0,1) wO: peak of distribution wopt: scale for flat distribution

CASE2: NN Network • Expansion of 1  1 unit mapping • Multi-dimensional Learning Rule • Refer to the paper for detail derivation of the learning rule

CASE3: A Causal Filter(Blind Deconvolution) • Assume the Single Output Signal is Dependent to Itself • Transform the Problem into Blind Separation Domain x(t): time series of length M w(t): a causal filter of length L (<M), {w1, w2, …, wL} u(t): output time series X, Y, U: corresponding vectors W: MM matrix, lower traiangular Special Case of Blind Separation

wL x(t) y(t) g wL-1 z-1 : : w1 z-1 unit delay operator Blind Deconvolution • Learning Rules • when g is the tanh() wL: a ‘leading’ weight, same role as a single unit wL-j: multiplied delay lines from xt-j to yt decorrelate the past input from the present output

CASE4: Weights with Time Delays • Assume the signal Itself is Dependent on Its Time-delayed Version • Learning Rule for delay d for g is tanh() • example.) if y receives mixture of sinusoids of the same frequencies and different phases  d is adjusted until the same frequency sinusoids have same phase • applications.) removing echo or reverberation

CASE5: Generalized Sigmoid Function • Selection of Non-Linear Function g • suitable to NN’s input(u) cumulative pdf • ‘flexible’ sigmoid : asymmetric generalized logistic function defined by the differential equation p,r>1 : very peaked (super-Gaussian) p,r<1 : flat, unit-like (sub-Gaussian) pr : skew distributions

Gaussian white noise speech music (sec.4) Real-World Considerations • learning rules are altered by incorporating (p,r) example.1) for single unit example.2) for NN network

SUMMARY • Self-Organizing Learning Algorithm • based on informax principle • separate statistically independent components • blind separation, blind deconvolution, time-delayed signal • a higher-order generalization of PCA • Non-Linear Transfer Function • pick up higher-order moments • selected to best match the network’s output distributions

Appendix:(sec.3) Higher Order Statistics • a 2nd-order Decorrelation (Barlow & Földiák) • find uncorrelated, linearly independent projection • BS: PCA, unsuitable to asymmetric mixing matrix A • BD: autocorrelation, only amplitude (phase-blind) • Higher-order Statistics • minimizing the M.I.(mutual information) btn outputs • M.I. involves higher-order statistics - cumulants of all orders • explicit estimation, intensive computation • static non-linear functions

Independent Component Analysis PART I CS679 Lecture Note by Gil-Jin Jang

Independent Component Analysis PART I CS679 Lecture Note by Gil-Jin Jang

Presentation Transcript

Independent Component Analysis

EE645: Independent Component Analysis

Independent Component Analysis (ICA)

Independent Component Analysis

Subband-based Independent Component Analysis

Independent Component Analysis

Independent Component Analysis

Multilayer Perceptrons CS679 Lecture Note by Jin Hyung Kim Computer Science Department KAIST

Applications of Independent Component Analysis

Independent Component Analysis

Facial Feature Extraction by Kernel Independent Component Analysis

Neural Networks Chapter 8 Principal Components Analysis CS679 Lecture Note by Jahwan Kim

Stochastic Machines CS679 Lecture Note by Jin Hyung Kim Computer Science Department KAIST

Information Theory (10.6 ~ 10.10, 10.13 ~ 10.15) CS679 Lecture Note by Sungho Ryu

Self-Organizing Maps CS679 Lecture Note by Jin Hyung Kim Computer Science Department KAIST

Part I Generalized Principal Component Analysis

INDEPENDENT COMPONENT ANALYSIS OF TEXTURES

Neural Networks Introduction CS679 Lecture Note by Jin Hyung Kim Computer Science Department KAIST

Face Recognition by Independent Component Analysis

Independent Component Analysis (ICA)

Multilayer Perceptrons CS679 Lecture Note by Jin Hyung Kim Computer Science Department KAIST