1 / 61

Information Theoretic Learning Finding structure in data ...

Jose Principe and Sudhir Rao University of Florida principe@cnel.ufl.edu www.cnel.ufl.edu. Information Theoretic Learning Finding structure in data. Outline. Structure Connection to Learning Learning Structure – the old view A new framework Applications. Structure.

Download Presentation

Information Theoretic Learning Finding structure in data ...

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jose Principe and Sudhir Rao University of Florida principe@cnel.ufl.edu www.cnel.ufl.edu Information Theoretic LearningFinding structure in data ...

  2. Outline • Structure • Connection to Learning • Learning Structure – the old view • A new framework • Applications

  3. Structure Patterns / Regularities Amorphous/chaos Interdependence between subsystems White Noise

  4. Connection to Learning

  5. Type of Learning • Supervised Learning • Data • Desired Signal/Teacher • Reinforcement Learning • Data • Rewards/Punishments • Unsupervised Learning • Only the Data

  6. Unsupervised Learning • What can be done only with the data?? Examples First Principles Auto associative memory, ART PCA, Linsker’s “informax” rule … Preserve maximum information Barlow’s minimum redundancy principle, ICA etc Extract independent features Gaussian Mixture Models, EM algorithm, Parametric Density Estimation. Learn the probability distribution

  7. Connection to Self Organization “If cell 1 is one of the cells providing input to cell 2, and if cell 1’s activity tends to be “high” whenever cell 2’s activity is “high”, then the future contributions that the firing of cell 1 makes to the firing of cell 2 should increase..” -Donald Hebb, 1949, Neuropsychologist. What is the purpose???? A - “Does the Hebb-type algorithm cause a developing perceptual network to optimize some property that is deeply connected with the mature network’s functioning as a information processing system.” C + B Increase wb proportional to activity of B and C + - Linsker, 1988

  8. Linsker’s Infomax principle Linear Network X1 w1 noise X2 Under Gaussian assumptions and uncorrelated noise the rate for a linear network is , Y XL-1 wL XL Maximize Rate = Maximize Shannon Rate I(X,Y) Hebbian Rule!! 

  9. Minimum Entropy Coding Stimulus 1 Feature 1 Feature N Stimulus M Barlow’s redundancy principle Independence features no redundancy ICA!!! Converting an M dimensional problem  N one dimensional problems N conditional probabilities required for an event V P(V|Feature i) 2M conditional probabilities required for an event V P(V|stimuli)

  10. Summary 1 Global Objective Function example, Infomax Extracting desired signal from the data itself Self Organizing Rule example, Hebbian rule Revealing the structure through interaction of the data points Unsupervised Learning Discovering structure in Data example, PCA

  11. Questions • Can we go beyond these preprocessing stages?? • Can we create global cost function which extract “goal oriented structures” from the data? • Can we derive self organizing principle from such a cost function?? A big YES!!!

  12. ITL is a methodology to adapt linear or nonlinear systems using criteria based on the information descriptors of entropy and divergence. Center piece is a non-parametric estimator for entropy that: Does not require an explicit estimation of pdf Uses the Parzen window method which is known to be consistent and efficient Estimator is smooth Readily integrated in conventional gradient descent learning Provides a link to Kernel learning and SVMs. Allows an extension to random processes What is Information Theoretic Learning?

  13. Moment expansions, in particular Second Order moments are still today the workhorse of statistics. We automatically translate deep concepts (e.g. similarity, Hebb’s postulate of learning ) in 2nd order statistical equivalents. ITL replaces 2nd order moments with a geometric statistical interpretation of data in probability spaces. Variance by Entropy Correlation by Correntopy Mean square error (MSE) by Minimum error entropy (MEE) Distances in data space by distances in probability spaces ITL is a different way of thinking about data quantification

  14. 1 0.4 (x) 0.5 (x) x f x f 0.2 0 0 -5 0 -5 0 5 x Information Theoretic LearningEntropy Not all random variables (r.v.) are equally random! • Entropy quantifies the degree of uncertainty in a r.v. Claude Shannon defined entropy as 5

  15. Information Theoretic LearningRenyi’s Entropy • Norm of the pdf: Renyi’s entropy equals Shannon’s as

  16. 1 1 N=10 N = 1000 (x) (x) 0.5 0.5 x x f f Kernel function 0 0 -5 0 5 -5 0 5 x x N=10 N = 1000 0.4 0.4 (x) (x) x x f f 0.2 0.2 0 0 -5 0 5 -5 0 5 x x Information Theoretic LearningParzen windowing Given only samples drawn from a distribution: Convergence:

  17. Information Theoretic Learning Renyi’s Quadratic Entropy Order-2 entropy & Gaussian kernels: Pairwise interactions between samples O(N2) Information potential,V2(X) provides a potential field over the space of the samples parameterized by the kernel size s Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.

  18. xi xj Information Theoretic Learning Information Force • In adaptation, samples become information particles that interact through information forces. Information potential: Information force: Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000. Erdogmus, Principe, Hild, Natural Computing, 2002.

  19. What will happen if we allow the particles to move under the influence of these forces? Information force within a dataset arising due to H(X)

  20. Desired Adaptive System IT Criterion Adjoint Network Input Output Information Forces Weight Updates Information Theoretic Learning Backpropagation of Information Forces Information forces become the injected error to the dual or adjoint network that determines the weight updates for adaptation.

  21. Information Theoretic Learning Quadratic divergence measures Kulback-Liebler Divergence: Renyi’s Divergence: Euclidean Distance: Cauchy- Schwartz Distance : Mutual Information is a special case (divergence between the joint and the product of marginals) Principe, Fisher, Xu, Unsupervised Adaptive Filtering, (S. Haykin), Wiley, 2000.

  22. Information Theoretic Learning Unifying criterion for learning from samples

  23. Training ADALINE sample by sample Stochastic information gradient (SIG) Theorem: The expected value of the stochastic information gradient (SIG), is the gradient of Shannon’s entropy estimated from the samples using Parzen windowing. For the Gaussian kernel and M=1 The form is the same as for LMS except that entropy learning works with differences in samples. The SIG works implicitly with the L1 norm of the error. Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.

  24. SIG Hebbian updates In a linear network the Hebbian update is The update maximizing Shannon output entropy with the SIG becomes Which is more powerful and biologically plausible? Hebbian updates would converge to any direction but SIG found consistently the 90 degree direction! Generated 50 samples of a 2D distribution where the x axis is uniform and the y axis is Gaussian and the sample covariance matrix is 1 Erdogmus, Principe, Hild, IEEE Signal Processing Letters, 2003.

  25. System identification Feature extraction ITL Clustering Blind source separation ITL - Applications www.cnel.ufl.edu ITL has examples and Matlab code

  26. Renyi’s cross entropy Let be two r.vs with iid samples. Then Renyi’s cross entropy is given by Using parzen estimate for the pdfs gives

  27. “Cross” information potential and “cross” information force Force between particles of two datasets

  28. Cross information force between two datasets arising due to H(X;Y)

  29. Cauchy Schwartz Divergence A measure of similarity between two datasets Same probability density functions

  30. A New ITL Framework:Information Theoretic Mean Shift STATEMENT Consider a dataset with iid samples. We wish to find a new dataset which captures “interesting structures” of the original dataset . FORMULATION Cost = Redundancy Reduction term + Similarity Measure Term Weighted Combination

  31. Information Theoretic Mean Shift Form 1 This cost looks like a reaction diffusion equation: Entropy term implements diffusion Cauchy Schwarz implements attraction to the original data

  32. Analogy The weighting parameter λ squeezes the information flow through a bottleneck extracting different levels of structure in the data. • We can also visualize λ as a slope parameter. The previous methods used only λ=1 or

  33. Self organizing rule Rewriting cost function as Differentiating w.r.to xk={1,2,…,N} and rearranging gives Fixed Point Update!!

  34. An Example Crescent shaped Dataset

  35. Effect of λ

  36. Summary 2 Starting with the Data λ= 0 λ = 1 λ∞ Back to Data Single Point Modes

  37. Applications- Clustering Statement Segment data into different groups such that samples belonging to same group are “closer” to each other than samples of different groups. The idea Mode Finding Ability Clustering

  38. Mean Shift – a review Modes are stationary points of the equation,

  39. Two variants: GBMS and GMS Gaussian Blurring Mean Shift Gaussian Mean Shift Single dataset X Initialize X=Xo Two datasets X and Xo Initialize X=Xo

  40. Connection to ITMS λ = 1 λ= 0 GMS GBMS

  41. Applications- Clustering 10 Random Gaussian Clusters and its pdf plot

  42. GBMS result GMS result

  43. Image segmentation

  44. GBMS GMS

  45. Applications- Principal Curves • Non linear extension of PCA. • “Self-consistent” smooth curves which pass through the “middle” of a d-dimensional probability distribution or data cloud. A new definition (Erdogmus et al.) A point is an element of the d-dimensional principal set ,denoted by iff is orthonormal to at least (n-d) eigenvectors of and is a strict local maximum in the subspace spanned by these eigenvectors.

  46. PC continued… • is a 0-dimensional principal set corresponding to modes of the data. is the 1-dimensional principal curve, is a 2-dimensional principal surface and so on … • Hierarchical structure, . . • ITMS satisfies this definition (experimentally). • Gives principal curve for .

  47. Principal curve of spiral data passing through the modes

  48. Denoising Chain of Ring Dataset

  49. Applications -Vector Quantization • Limiting case of ITMS (λ ∞). • Dcs(X;Xo) can be seen as distortion measure between X and Xo. • Initialize X with far fewer points than Xo

  50. Comparison ITVQ LBG

More Related