340 likes | 483 Views
Computational Intelligence: Methods and Applications. Lecture 25 Kernel methods Source: Włodzisław Duch ; Dept. of Informatics, UMK ; Google: W Duch. Kernels!.
E N D
Computational Intelligence: Methods and Applications Lecture 25 Kernel methods Source: Włodzisław Duch; Dept. of Informatics,UMK; Google: W Duch
Kernels! Kernel trick: if vectors are transformed using some function (usually non-linear) into high-dimensional space separation of data may be easier to achieve. Replace: This leads to the same problem formulation, except that X is replaced everywhere by F(X); in particular Lagrangian contains scalar products: These scalar products are calculated between vectors in some transformed space; instead of calculating them directly it is sufficient to define a kernel function K(X,Y). What kind of functions correspond to scalar products in Hilbert spaces? They should be symmetric; formal conditions have been found in mathematical analysis by Mercer; they may influence convergence.
Kernel example Simplest: polynomial kernel: Example: quadratic kernel in 2-D Use of this kernel is equivalent to working in 5-D space: Hyperplane in 5D found using linear SVM corresponds to quadratic function in 2D; try to show that quadratic border in (X1,X2)space becomes a hyperplane in kernel space. Selection of kernel may strongly influence results.
Other popular kernels Some popular kernels working as scalar products: Gaussian: Sigmoidal: Distance: Dimensionality of the F space: number of independent polynomial products or number of training vectors. Distance kernel: for b=2 Euclidean distance linear case! In complex cases (ex. protein comparison) kernel = similarity function, especially designed for the problem.
Examples SMO, Sequential Multiple Optimization algorithm for SVM with polynomial kernels, is implemented in WEKA/Yale and GM 1.5. The only user adjustable parameters for polynomial version are the margin-related parameter C and the degree of the polynomial kernel.In GM optimal value of C may automatically be found by crossvalidation training (but it may be costly). Note that data should be standardized to avoid convergence problems. For other kernel functions – Gaussians, sigmoidal and exponential – additional parameters of kernels may be adjusted (GM) by the user. Example 1: Gaussians data clusters with some kernels Example 2: Cleveland Heart data Example 3: Ljubliana cancer data
Example 1: Gaussian mixtures Gaussian kernels work quite well, giving close to optimal Bayesian error (that may be computed only because we know the distributions, but it is not exact, since finite number of points is given). 4-deg. polynomial kernel is very similar to a Gaussian kernel, C=1.
Example 2: Cleveland heart data Left: 2D MDS features, linear SVM, C=1, acc. 81.9% Right: support vectors removed, margin is clear, all vector inside are SV. Gaussian kernel, C=10000, 10xCV, 100% train, 79.3± 7.8% test Gaussian kernel, C=1, 10xCV, 93.8% train, 82.6± 8.0% test Auto C=32 and Gaussian dispersion 0.004: about 84.4± 5.1% on test
Example 3: Ljubliana cancer recurrence 286 events: 85 recurrence (29.7%) and 201 no recurrence (70.3%); 9 features: tumor-size, inv-nodes, deg-malig, etc ... Linear kernel, C=1 (C=10 similar, C=100 hard to converge): whole data 75 errors, or 73.8% 10xCV: training 73.71.0%, test 71.18.3% Linear kernel, C=0.01: 10xCV: training 70.60.7%, test 70.31.4% (base rate !) Polynomial kernel k=3, C=10 (opt): 10xCV: training 89.80.6%, test 74.27.9% (best for polynomial kernel) Gaussian kernel, opt C=1 and s=1/4 10xCV: training 88.03.4%, test 74.86.5% (best for Gaussian kernel) But a rule: Involved Nodes > 0 & Degree_malig = 3 has 77.1% accuracy!
Some applications SVM found many applications, see the list at: http://www.clopinet.com/isabelle/Projects/SVM/applist.html • A few interesting applications, with highly competitive results: • On-line Handwriting Recognition, zip codes • 3D object recognition • Stock forecasting • Intrusion Detection Systems (IDSs) • Image classification • Detecting Steganography in digital images • Medical applications: diagnostics, survival rates ... • Technical: Combustion Engine Knock Detection • Elementary Particle Identification in High Energy Physics • Bioinformatics: protein properties, genomics, microarrays • Information retrieval, text categorization
Get kernelized! now a may be negative to avoid mentioning Yi Discriminant function – just replace dot product by kernel: Number of support vectors in a separable case is small, but in non-separable case may get large – all between the margins + errors. Kernels may be used in many discriminant methods, for example Kernel PCA or Kernel Fisher Discriminant Analysis. Covariance matrix after transformation: F(X)is d-dim vector, and F is d x n matrix
Kernel PCA Eigenvalues and eigenvectors of the covariance matrix: Z eigenvectors are a combination of the training vectors in F space, so: F is d x n, a coefficients are n x d Kis n x nmatrix of scalar products or kernel values Kij=K(X(i),X(j)) Kernel PCA coefficients are obtained from the eigenequation, and Z vectors are obtained by linear combinations; dis dim. in the F-space! We have non-linear PCA using linear methods! Good for classification, dimensionality reduction and visualization.
From linear to non-linear PCA ||X-Y||bkernel was used to illustrate how the constant values of the first 2 components (features) change from straight lines for the linear PCA (b=2), to a non-linear ones for b=1.5, 1 and 0.5. 1st PCA 2nd PCA From Schölkopf and Smola, Learning with kernels, MIT 2002
Kernel PCA for Gaussians 2-D data with 3 Gaussians; kernel PCA was performed with Gaussian kernels. Constant value of features follow the cluster densities! First two kernel PCs separate the data nicely. (made with B. Schölkopf Matlab program) Linear PCA has only 2 components, but kernel PCA has more, since the F space dimension is usually large - here each Gaussian with ||X(i)-X||, i=1..nis one function. Example: use this Matlab program.
Kernel FDA Fisher criterion for linear discrimination was used for classification and visualization; for two classes: Projecting X on the F space: Matrix of scalar products or kernel values Kij=K(X(i),X(j)) Scatter matrices are calculated now in the F space, Kernel Fisher criterion is: but matrices are much bigger now; faster numerical solution is found via quadratic optimization formulation.
Further reading Kernel methods were already introduced long time ago: M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821-837, 1964. Also quadratic optimization has been used in Adatron algorithm developed in statistical mechanics already in 1964; large-margin classifiers, slack variables and other ideas are also from 1960 ... Modern revival of these ideas: V.N. Vapnik, Statistical Learning Theory. Wiley, New York, 1998. B. Schölkopf, A.J. Smola, Learning with kernels. MIT Press 2001. Best source of tutorials, software and links: http://www.kernel-machines.org/
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Source: Włodzisław Duch; Dept. of Informatics,UMK; Google: W Duch
Density estimation Knowledge of joint probability density P(C,X) or just P(X) allows to do much more than just discrimination! • Local maxima of probability density functions (pdf-s) correspond to combination of features defining objects in feature spaces. • Estimating PDFs we may create adaptive systems learning from data with or without supervision. They are useful for: • Auto-association and hetero-association. • Completion of unknown parts of the input vector (content-addressable memory), prediction of missing values. • Extraction of logical rules, classical and probabilistic (or fuzzy). • Finding prototypes for objects or categories in feature spaces. • Using density functions as heuristics for solution of complex problems, learning from partial info & solving complex problems.
Cognitive inspirations How do we recognize objects? Nobody really knows ... Objects have features, combinations of features, or rather distributions of feature values in Feature Spaces (FS), characterize objects. A single object is a point in the FS; similar objects create a category, or a concept: for ex. happy or sad face, corresponding to some area of the feature space. P(Angry|Face features) will have maximum around one of the corners. In cognitive psychology FS are called “psychological spaces”. The shape of the P(X|C) distribution may be quite complex, estimated using known samples to create a fuzzy prototype.
Object recognition Population of neural columns, each acting as a weak classifiers to recognize some features, working in chorus – similar to “stacking”. Second-order similarity in low-dimensional (<300) space is sufficient. Face = fuzzy point in FS. The shape of the distribution P(Features|Face) is rather complex. Although neural processes are much more complex, results of neurodynamics may be approximated by PDFs.
Missing features Suppose that one of the features X =(X1,X2, ... Xd), for example X1, is missing. What is the most likely value for this feature? Frequently an average value E(X1) is used, but is this a reasonable idea? The average may fall in an area where there is no data! Fig. 2.22, Duda, Hart & Stork, 2000 In this case if X2 is known the best answer is the value corresponding to the maximum density at w2. Recover missing values searching for maximum density!
Maximum likelihood Suppose that the density P(X;q)is approximated using a combination of some parameterized functions. Given a set of observations (data samples) D={X(i)}, i=1..n, what parameters should one choose? Parameters qmay include also missing values, as a part of the model. A reasonable assumption is that the observed data D should have high chance of being generated using the model P(D;q). Assuming that the data vectors X(i)are independent, the likelihood of obtaining the dataset Dis: The most probable parameters of the model (including missing values) maximize likelihood. To avoid products use logarithm and minimize -L
Solution Maximum is found by setting the derivative of the log-likelihood to 0: Depending on the parameterization, sometimes this can be solved analytically, but for almost all interesting functions (including Gaussians) iterative numerical minimization methods are used. Many local minima of the likelihood function are expected, so the minimization problem may be difficult. Likelihood estimation may be carried for samples from a given class P(X|w;q), assuming that the probability of generating n such samples is equal to P(X|w;q)n, and the a priori class probabilities are estimated from their frequencies. Such parametric models are called “generative” models.
Example Example from “Maximum likelihood from incomplete data via the EM algorithm”, Dempster, Laird, Rubin 1977, data by Rao, from population genetics. There are 197 observation of 4 types of bugs: n1=125 times species (class) w1, n2 = 18 from class w2, n3 =20 from class w3, and n4 =34 from class w4. An expert provided the following parametric expressions for the probabilities to find these bugs: Find the value of parameter that maximize the likelihood: Multiplicative constants n!/(n1!n2!n3!n4!) are not important here.
Solution Log–likelihood: Derivative: Quadratic equation for q allows for analytical solution: q = 0.6268; now the model may provide estimations of expected frequencies: For all 4 classes, expected (real) number of observation: <n1>=129 (125), <n2>=18 (18), <n3>=18 (20), <n4>=31 (34) In practice analytic solutions are rarely possible.
General formulation Given data vectors D={X(i)}, i=1..n, and some parametric functions P(X|q) that model the density of the data P(X) the best parameters should minimize log-likelihood for all data samples: P(X|q) is frequently a Gaussian mixture; for a single Gaussian standard solution will give the formula for mean and variance. Assume now that X is not complete – features, or whole parts of the vector are missing. Let Z=(X,Y) be the complete vector. Joint density: Initial joint density may be formed analyzing cases without missing values; the idea is to maximize the complete data likelihood.
What to expect? E-step. Original likelihood function L(q |X)is based on incomplete information, and since Y is unknown it may be treated as a random variable that should be estimated. Complete-data likelihood function L(q |Z)=L(q |X,Y)may be evaluated calculating the expectation of incomplete likelihood over Y. This is done iteratively, starting from initial estimation q i-1new estimation q i of parameters and missing values is generated: where X and q i-1are fixed, qis a free variable, and the conditional expectation is calculated using the joint distribution of the X, Y variable with fixed X See detailed ML discussion in Duda, Hart & Stork, Chapter 3
Computational Intelligence: Methods and Applications Lecture 27 Expectation Maximization algorithm, density modeling Source: Włodzisław Duch; Dept. of Informatics,UMK; Google: W Duch
General formulation Given data vectors D={X(i)}, i=1..n, and some parametric functions P(X|q) that model the density of the data P(X) the best parameters should minimize log-likelihood for all data samples: P(X|q) is frequently a Gaussian mixture; for a single Gaussian standard solution will give the formula for mean and variance. Assume now that X is not complete – features or maybe part of the vector is missing. Let Z=(X,Y) be the complete vector. Joint density: Initial joint density may be formed analyzing cases without missing values; the idea is to maximize the complete data likelihood.
What to expect? E-step. Original likelihood function L(q |X)is based on incomplete information, and since Y is unknown it may be treated as a random variable that should be estimated. Complete-data likelihood function L(q |Z)=L(q |X,Y)may be evaluated calculating the expectation of incomplete likelihood over Y. This is done iteratively, starting from initial estimation q i-1new estimation q iof parameters and missing values is generated: where X and q i-1are fixed, q is a free variable, and the conditional expectation is calculated using the joint distribution of the X, Y variable with fixed X
EM algorithm First step: calculate expectation over unknown variables; get the function Second step: maximization, find new values of the parameters: Repeat until convergence, q i-q i-1 < e EM algorithm converges to local maxima, since during the iterations sequences of likelihoods is monotonically increasing and it is bounded. ET algorithm is sensitive to initial conditions. Linear combination of k Gaussian distributions may be efficiently treated with EM algorithm if one of the hidden variables v = 1..kthat is estimated represents Gaussian number from which data comes.
Example with missing data 4 data vectors, D = {X(1), .. X(4)}; XT={(0,2),(1,0),(2,2),(?,4)}, ? = missing Data model: a Gaussians with diagonal covariance matrix: Initial value of the parameters are improved calculating expectation over the missing value y=X1(4); let Xg= known data These functions are Gaussians, the first part does not depend on y and the conditional distribution P(y|x) = P(y,x)/P(x)
... missing data Conditional distribution: After some calculation Maximum of Qgives q1=(0.75, 2.0, 0.938, 2.0)T EM converges in few iterations here. Fig. from Duda, Hart and Stork, Ch. 3.8.
Some applications Reconstruction of missing values. Reconstruction of images, many medical applications. Reconstruction of signals in the presence of noise. Unsupervised learning – no information about classes is needed, more than clustering, natural taxonomy. Modeling of data, estimation of hidden parameters in mixtures. Training of probabilistic models, such as HMM (Hidden Markov models), useful in speech recognition, bioinformatics ... Associative memory, finding the whole pattern (image) after seeing a fragment – although I have never seen it yet done with EM ... Book: Geoffrey J. McLachlan, Thriyambakam Krishnan, The EM Algorithm and Extensions, Wiley 1996
EM demos Few demonstration of the EM algorithm for Gaussian mixtures may be found in the network. http://www-cse.ucsd.edu/users/ibayrakt/java/em/ http://www.neurosci.aist.go.jp/~akaho/MixtureEM.html EM is also a basis for “multiple imputation” approach to missing data. Each missing datum is replaced by m>1 simulated values and m versions of the complete data analyzed by standard methods; results are combined to produce inferential statements that incorporate missing-data uncertainty. Schafer, JL (1997) Analysis of Incomplete Multivariate Data, Chapman & Hall. Some demo software is available: http://www.stat.psu.edu/~jls/misoftwa.html Demonstration of EM in WEKA for clustering data.