1.16k likes | 1.62k Views
Information Theoretic Learning. Jose C. Principe Yiwen Wang Computational NeuroEngineering Laboratory Electrical and Computer Engineering Department University of Florida www.cnel.ufl.edu principe@cnel.ufl.edu. Acknowledgments. Dr. Deniz Erdogmus My students: Puskal Pokharel
E N D
Information Theoretic Learning Jose C. Principe Yiwen Wang Computational NeuroEngineering Laboratory Electrical and Computer Engineering Department University of Florida www.cnel.ufl.edu principe@cnel.ufl.edu
Acknowledgments • Dr. Deniz Erdogmus • My students: Puskal Pokharel • Weifeng Liu Jianwu Xu • Kyu-Hwa Jeong • Sudhir Rao • Seungju Han • NSF ECS – 0300340 and 0601271 (Neuroengineering program)
Resources • CNEL Website www.cnel.ufl.edu • Front page, go to ITL resources • (tutorial, examples, MATLAB code) • Publications
Information Filtering Deniz Erdogmus and Jose Principe From Linear Adaptive Filtering to Nonlinear Information Processing IEEE Signal Processing MAGAZINE November 2006
Outline • Motivation • Renyi’s entropy definition • A sample by sample estimator for entropy • Projections based on mutual information • Applications • Optimal Filtering • Classification • Clustering • Conclusions
Information Data is everywhere! Wireless Communications Remote Sensing Speech Processing Biomedical Applications Sensor Arrays
Data d Output Data x Adaptive System + - Cost function Learning Algorithm From Data to Models • Optimal Adaptive Models: y=f(x,w) Error e
From Linear to Nonlinear Mappings • Wiener showed us how to compute optimal linear • projections. The LMS/RLS algorithms showed us how • to find the Wiener solution sample by sample. • Neural networks brought us the ability to work • non-parametrically with nonlinear function approximators. • Linear regression nonlinear regression • Optimum linear filtering TLFNs • Linear Projections (PCA) Princ. Curves • Linear Discriminant Analysis MLPs
Adapting Linear and NonLinear Models • The goal of learning is to optimize the performance of • the parametric mapper according to some cost function. • In classification, minimize the probability of error. • In regression the goal is to minimize the error in the fit. • The cost function most widely used has been the mean • square error (MSE). It provides the Maximum Likelihood • solution when the error is Gaussian distributed. • In NONLINEAR systems this is hardly ever the case.
Beyond Second Order Statistics • We submit that the goal of learning should be totransfer • as much informationas possible from the inputs to the • weights of the system (no matter if unsupervised or supervised). • As such the learning criterion should be based onentropy(single data source) ordivergence (multiple data sources). • Hence the challenge is to find nonparametric, sample- • by-sample estimators for these quantities.
ITL: Unifying Learning Scheme • Normally supervised and unsupervised learning are treated differently, but there is no need to do so. One can come up with a general class of cost functions based on Information Theory that apply to both learning schemes. • Cost function (Minimize, Maximize, Nullify) 1. Entropy • Single group of RV’s 2. Divergence • Two or more groups of RV’s
ITL: Unifying Learning Scheme • Function Approximation • Minimize Error Entropy • Classification • Minimize Error Entropy • Maximize Mutual Information between class labels and outputs • Jaynes’ MaxEnt • Maximize output entropy • Linsker’s Maximum Information Transfer • Maximize MI between input and output • Optimal Feature Extraction • Maximize MI between desired and output • Independent Component Analysis • Maximize output entropy • Minimize Mutual Information among outputs
Information Theory Is a probabilistic description of random variables that quantifies the very essence of the communication process. It has been instrumental in the design and quantification of communication systems. Information theory provides a quantitative and consistent framework to describe processes with partial knowledge (uncertainty).
Information Theory Not all the random events are equally random! How to quantify this fact? Shannon proposed the concept of ENTROPY
Formulation of Shannon’s Entropy • Hartley Information (1928) • Large probability small information • Small probability large information • Two identical channels should have twice the capacity as one • Log2 is a natural measure for additivity
Formulation of Shannon’s Entropy • Expected value of Hartley Information • Communications – ultimate data compression (H - channel capacity for asymptotically error-free communication) • Measure of (relative) uncertainty • Shannon used a principled approach to define entropy
Review of Information Theory • Shannon Entropy: • Mutual Information: • Kullback-Leibler Divergence:
Properties of Shannon’s Entropy • Discrete RV’s • H(X) > 0 • H(X) < log N equality iff X is uniform • H(Y|X) < H(Y) equality iff X, Y indep. • H(X,Y) = H(X) + H(Y|X) • Continuous RV’s • Replace summation with integral • Differential entropy • Minimum entropy is sum of delta functions • Maximum entropy • Fixed variance Gaussian • Fixed upper/lower limits uniform
Properties of Mutual Information • IS(X;Y) = H(X) + H(Y) – H(X,Y) = H(X) – H(X|Y) = H(Y) – H(Y|X) • IS(X;Y) = IS(Y;X) • IS(X;X) = HS(X) HS(X,Y) HS(X|Y) HS(Y) IS(X;Y) HS(Y|X) HS(X)
A Different View of Entropy • Shannon’s Entropy • Renyi’s Entropy • Fisher’s Entropy (local) Renyi’s entropy becomes Shannon’s as
Renyi’s Entropy • Norm of the pdf: • Entropies in terms of V
Properties of Renyi’s Entropy • (a) Continuous function of all probability • (b) Permutationally symmetric • (c) H(1/n, …1/n) is an increasing function of n • (d) Recursivity • (e) Additivity If p and q are independent
Properties of Renyi’s entropy • Renyi’s entropy provides an upper and lower bound for the probability of the error in classification unlike Shannon, which provides only a lower bound (Fano’s inequality, which is the tightest bound)
Nonparametric Entropy Estimators (Only continuous variables are interesting…) • Plug in estimates • Integral estimates • Resubstitution estimates • Splitting data estimates • Cross validation estimates • Sample spacing estimates • Nearest Neighbor distances
Parzen Window Method • Put a kernel over the samples, normalize and add. Entropy becomes a function of continuous RV. • A kernel is a positive function that adds to 1 and peaks at the sample location (i.e. the Gaussian)
Parzen Windows Laplacian Uniform
Parzen Windows • Smooth estimator • Arbitrarily close fit as N infinity, s 0 • Curse of Dimensionality • Previous pictures for d = 1 dimension • For a linear increase in d, an exponential increase in N is required for an equally “good” approximation In ITL we use Parzen windows not to estimate the PDF but to estimate the 2-Norm of the PDF that corresponds to the first moment of the PDF.
Renyi’s Quadratic Entropy Estimation • Quadratic Entropy (a=2) • Information Potential • Use Parzen window pdf estimation with a (symmetric) Gaussian kernel Information potential: think of the samples as particles (gravity or electrostatic field) that interact with others with a law given by the kernel shape.
IP as an Estimator of Quadratic Entropy • Information Potential (IP) V (X) 2
IP as an Estimator of Quadratic Entropy • There is NO approximation in computing the Information Potential for a = 2 besides the choice of the kernel. • This result is the kernel trick used in Support Vector Machines. • It means that we never explicitly estimate the PDF, which improves greatly the applicability of the method.
Information Force (IF) • Between two Information Particles (IPTs) • Overall (X)
Central “Moments” • Mean • Variance • Entropy
Moment Estimation • Mean • Variance • Entropy
Which of the two Extremes? • Estimation of pdf must be accurate for practical ITL? • ITL (Minimization/maximization) doesn’t require an accurate pdf estimate? None of the above, but still not fully characterized
How to select the kernel size • Different values of s produce different entropy estimates. We suggest to use 3s ~ 0.1 dynamic range (interaction among 10 samples). • Or use Silverman’s rule • Kernel size is just a scaleparameter A stands for the minimum of the empirical data standard deviation and the data interquartile range scaled by 1.34
Extension to any kernel • We do not need to use Gaussian kernels in the Parzen estimator. • Can use any kernel that is symmetric and differentiable • (k(0) > 0, k’(0) = 0 and k”(0) < 0) . • We normally work with kernels scaled from an unit size kernel
Extension to any a • Redefine the Information Potential as • Using the Parzen estimator we obtain • This estimator corresponds exactly to the quadratic estimator (a = 2) with the proper kernel width, s.
Extension to any a, kernel • The a- information potential • The a- information force • where F2(X) is the quadratic IF. Hence we see that the “fundamental” definition is the quadratic IP and IF, and the “natural” kernel is the Gaussian.
Kullback Leibler Divergence • KL Divergence measures the “distance” between pdfs (Csiszar and Amari) • Relative entropy • Cross entropy • Information for discrimination
Mutual Information & KL Divergence . • Shannon’s Mutual Information • Kullback Leibler Divergence • Statistical Independence .
f3 f1 f2 KL Divergence is NOT a distance • Ideally for a distance, • Non-negative • Null only if pdf’s are equal • Symmetric • Triangular inequality • In reality,
New Divergences and Quadratic Mutual Infomation • Euclidean Distance between pdfs (Quadratic Mutual Information ED-QMI) • Cauchy Schwarz divergence and CS-QMI
One Example 0.4 0 0.6