Information Theoretic Learning Finding structure in data ...

Jose Principe and Sudhir Rao University of Florida principe@cnel.ufl.edu www.cnel.ufl.edu Information Theoretic LearningFinding structure in data ...

Outline • Structure • Connection to Learning • Learning Structure – the old view • A new framework • Applications

Structure Patterns / Regularities Amorphous/chaos Interdependence between subsystems White Noise

Connection to Learning

Type of Learning • Supervised Learning • Data • Desired Signal/Teacher • Reinforcement Learning • Data • Rewards/Punishments • Unsupervised Learning • Only the Data

Unsupervised Learning • What can be done only with the data?? Examples First Principles Auto associative memory, ART PCA, Linsker’s “informax” rule … Preserve maximum information Barlow’s minimum redundancy principle, ICA etc Extract independent features Gaussian Mixture Models, EM algorithm, Parametric Density Estimation. Learn the probability distribution

Connection to Self Organization “If cell 1 is one of the cells providing input to cell 2, and if cell 1’s activity tends to be “high” whenever cell 2’s activity is “high”, then the future contributions that the firing of cell 1 makes to the firing of cell 2 should increase..” -Donald Hebb, 1949, Neuropsychologist. What is the purpose???? A - “Does the Hebb-type algorithm cause a developing perceptual network to optimize some property that is deeply connected with the mature network’s functioning as a information processing system.” C + B Increase wb proportional to activity of B and C + - Linsker, 1988

Linsker’s Infomax principle Linear Network X1 w1 noise X2 Under Gaussian assumptions and uncorrelated noise the rate for a linear network is , Y XL-1 wL XL Maximize Rate = Maximize Shannon Rate I(X,Y) Hebbian Rule!! 

Minimum Entropy Coding Stimulus 1 Feature 1 Feature N Stimulus M Barlow’s redundancy principle Independence features no redundancy ICA!!! Converting an M dimensional problem  N one dimensional problems N conditional probabilities required for an event V P(V|Feature i) 2M conditional probabilities required for an event V P(V|stimuli)

Summary 1 Global Objective Function example, Infomax Extracting desired signal from the data itself Self Organizing Rule example, Hebbian rule Revealing the structure through interaction of the data points Unsupervised Learning Discovering structure in Data example, PCA

Questions • Can we go beyond these preprocessing stages?? • Can we create global cost function which extract “goal oriented structures” from the data? • Can we derive self organizing principle from such a cost function?? A big YES!!!

ITL is a methodology to adapt linear or nonlinear systems using criteria based on the information descriptors of entropy and divergence. Center piece is a non-parametric estimator for entropy that: Does not require an explicit estimation of pdf Uses the Parzen window method which is known to be consistent and efficient Estimator is smooth Readily integrated in conventional gradient descent learning Provides a link to Kernel learning and SVMs. Allows an extension to random processes What is Information Theoretic Learning?

Moment expansions, in particular Second Order moments are still today the workhorse of statistics. We automatically translate deep concepts (e.g. similarity, Hebb’s postulate of learning ) in 2nd order statistical equivalents. ITL replaces 2nd order moments with a geometric statistical interpretation of data in probability spaces. Variance by Entropy Correlation by Correntopy Mean square error (MSE) by Minimum error entropy (MEE) Distances in data space by distances in probability spaces ITL is a different way of thinking about data quantification

1 0.4 (x) 0.5 (x) x f x f 0.2 0 0 -5 0 -5 0 5 x Information Theoretic LearningEntropy Not all random variables (r.v.) are equally random! • Entropy quantifies the degree of uncertainty in a r.v. Claude Shannon defined entropy as 5

Information Theoretic LearningRenyi’s Entropy • Norm of the pdf: Renyi’s entropy equals Shannon’s as

1 1 N=10 N = 1000 (x) (x) 0.5 0.5 x x f f Kernel function 0 0 -5 0 5 -5 0 5 x x N=10 N = 1000 0.4 0.4 (x) (x) x x f f 0.2 0.2 0 0 -5 0 5 -5 0 5 x x Information Theoretic LearningParzen windowing Given only samples drawn from a distribution: Convergence:

Information Theoretic Learning Renyi’s Quadratic Entropy Order-2 entropy & Gaussian kernels: Pairwise interactions between samples O(N2) Information potential,V2(X) provides a potential field over the space of the samples parameterized by the kernel size s Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.

xi xj Information Theoretic Learning Information Force • In adaptation, samples become information particles that interact through information forces. Information potential: Information force: Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000. Erdogmus, Principe, Hild, Natural Computing, 2002.

What will happen if we allow the particles to move under the influence of these forces? Information force within a dataset arising due to H(X)

Desired Adaptive System IT Criterion Adjoint Network Input Output Information Forces Weight Updates Information Theoretic Learning Backpropagation of Information Forces Information forces become the injected error to the dual or adjoint network that determines the weight updates for adaptation.

Information Theoretic Learning Quadratic divergence measures Kulback-Liebler Divergence: Renyi’s Divergence: Euclidean Distance: Cauchy- Schwartz Distance : Mutual Information is a special case (divergence between the joint and the product of marginals) Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.

Information Theoretic Learning Unifying criterion for learning from samples

Training ADALINE sample by sample Stochastic information gradient (SIG) Theorem: The expected value of the stochastic information gradient (SIG), is the gradient of Shannon’s entropy estimated from the samples using Parzen windowing. For the Gaussian kernel and M=1 The form is the same as for LMS except that entropy learning works with differences in samples. The SIG works implicitly with the L1 norm of the error. Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.

SIG Hebbian updates In a linear network the Hebbian update is The update maximizing Shannon output entropy with the SIG becomes Which is more powerful and biologically plausible? Hebbian updates would converge to any direction but SIG found consistently the 90 degree direction! Generated 50 samples of a 2D distribution where the x axis is uniform and the y axis is Gaussian and the sample covariance matrix is 1 Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.

System identification Feature extraction ITL Clustering Blind source separation ITL - Applications www.cnel.ufl.edu ITL has examples and Matlab code

Renyi’s cross entropy Let be two r.vs with iid samples. Then Renyi’s cross entropy is given by Using parzen estimate for the pdfs gives

“Cross” information potential and “cross” information force Force between particles of two datasets

Cross information force between two datasets arising due to H(X;Y)

Cauchy Schwartz Divergence A measure of similarity between two datasets Same probability density functions

A New ITL Framework:Information Theoretic Mean Shift STATEMENT Consider a dataset with iid samples. We wish to find a new dataset which captures “interesting structures” of the original dataset . FORMULATION Cost = Redundancy Reduction term + Similarity Measure Term Weighted Combination

Information Theoretic Mean Shift Form 1 This cost looks like a reaction diffusion equation: Entropy term implements diffusion Cauchy Schwarz implements attraction to the original data

Analogy The weighting parameter λ squeezes the information flow through a bottleneck extracting different levels of structure in the data. • We can also visualize λ as a slope parameter. The previous methods used only λ=1 or

Self organizing rule Rewriting cost function as Differentiating w.r.to xk={1,2,…,N} and rearranging gives Fixed Point Update!!

An Example Crescent shaped Dataset

Effect of λ

Summary 2 Starting with the Data λ= 0 λ = 1 λ∞ Back to Data Single Point Modes

Applications- Clustering Statement Segment data into different groups such that samples belonging to same group are “closer” to each other than samples of different groups. The idea Mode Finding Ability Clustering

Mean Shift – a review Modes are stationary points of the equation,

Two variants: GBMS and GMS Gaussian Blurring Mean Shift Gaussian Mean Shift Single dataset X Initialize X=Xo Two datasets X and Xo Initialize X=Xo

Connection to ITMS λ = 1 λ= 0 GMS GBMS

Applications- Clustering 10 Random Gaussian Clusters and its pdf plot

GBMS result GMS result

Image segmentation

GBMS GMS

Applications- Principal Curves • Non linear extension of PCA. • “Self-consistent” smooth curves which pass through the “middle” of a d-dimensional probability distribution or data cloud. A new definition (Erdogmus et al.) A point is an element of the d-dimensional principal set ,denoted by iff is orthonormal to at least (n-d) eigenvectors of and is a strict local maximum in the subspace spanned by these eigenvectors.

PC continued… • is a 0-dimensional principal set corresponding to modes of the data. is the 1-dimensional principal curve, is a 2-dimensional principal surface and so on … • Hierarchical structure, . . • ITMS satisfies this definition (experimentally). • Gives principal curve for .

Principal curve of spiral data passing through the modes

Denoising Chain of Ring Dataset

Applications -Vector Quantization • Limiting case of ITMS (λ ∞). • Dcs(X;Xo) can be seen as distortion measure between X and Xo. • Initialize X with far fewer points than Xo

Comparison ITVQ LBG

Information Theoretic Learning Finding structure in data ...

Information Theoretic Learning Finding structure in data ...

Presentation Transcript

Information Theoretic Learning

Information-Theoretic Secrecy

Data Mining (Finding information in Auxinfo and AuxOfficer )

Finding Information

Extracting structure information from data

Finding Information

Genecentric: Finding Graph Theoretic Structure in High-Throughput Epistasis Data

Finding Information

Finding Information

Finding Structure in Time

Finding Information

Interference: An Information Theoretic View

Finding Information

Game-Theoretic Multi-Agent Learning

Information Theoretic Signal Processing and Machine Learning

Robust Information-theoretic Clustering

Genecentric: Finding Graph Theoretic Structure in High-Throughput Epistasis Data

Finding Data in ATLAS

3. Information-Theoretic Foundations