520 likes | 725 Views
Sparse Kernels Methods. Steve Gunn. Overview. Part I : Introduction to Kernel Methods Part II : Sparse Kernel Methods. Part I. Introduction to Kernel Methods. Classification. Consider 2 class problem. Optimal Separating Hyperplane. Optimal Separating Hyperplane. Separate the data,.
E N D
Sparse Kernels Methods Steve Gunn
Overview • Part I : Introduction to Kernel Methods • Part II : Sparse Kernel Methods
Part I • Introduction to • Kernel Methods
Classification • Consider 2 class problem
Optimal Separating Hyperplane Separate the data, with a hyperplane, such that the data is separated without error, and the distance between the closest vector to the hyperplane is maximal.
Solution The optimal hyperplane minimises, subject to the constraints, and is obtained by finding the saddle point of the Lagrange functional
Finding the OSH Quadratic Programming Problem • Size is dependent upon training set size • Unique global minimum
Support Vectors • Information contained in support vectors • Can throw away rest of training data • SVs have non zero Lagrange multipliers
Non Separable Case • Introduce slack Variables Minimise C is chosen a priori and determines trade-off to non-separable case.
Finding the GSH Quadratic Programming Problem Size is dependent upon training set size Unique global minimum
Non-Linear SVM • Map input space to high dimensional feature space Find OSH or GSH in Feature Space
Kernel Functions • Hilbert Schmidt Theory is a symmetric function Mercer’s Conditions
Acceptable Kernel Functions • Polynomial Radial Basis Functions Multi-Layer Perceptrons
Generalisation Estimation Error Approximation Error Model Size Regression
Regression Approximate the data, with a hyperplane, using a loss function, e.g., and the SRM principle.
Solution Introduce slack variables and minimise subject to the constraints
Finding the Solution Quadratic Programming Problem • Size is dependent upon training set size • Unique global minimum where
Part I : Summary • Unique Global Minimum • Addresses Curse of Dimensionality • Complexity dependent upon data set size • Information contained in Support Vectors
Part II • Sparse Kernel Methods
Cyclic Nature of Empirical Modelling Design Induce Interpret Validate
Induction • SVMs have strong theory • Good empirical performance • Solution of the form, • Interpretation • Input Selection • Transparency
Additive Representation Additive structure Transparent Rejection of redundant inputs Unique decomposition
Sparse Kernel Regression Previously …. Now
The Priors • “Different priors for different parameters” • Smoothness – controls “overfitting” • Sparseness – enables input selection and controls overfitting
Sparse Kernel Model Replace the kernel with a weighted linear sum of kernels, And minimise the number of non-zero multipliers, along with the standard support vector optimisation, optimisation hard Solution sparse optimisation easier Solution sparse optimisation easier Solution NOT sparse
Choosing the Sub-Kernels • Avoid additional parameters if possible • Sub-models should be flexible
Tensor Product Splines The univariate spline which passes through the origin has a kernel of the form, And the multivariate ANOVA kernel is given by E.g. for a two input problem the ANOVA kernel is given by
Sparse ANOVA Kernel Introduce multipliers for each ANOVA term, And minimise the number of non-zero multipliers, along with the standard support vector optimisation,
Data ANOVA Basis Selection Sparse ANOVA Selection Parameter Selection Model Algorithm 3+ Stage Technique Auto-selection of Parameters Each stage consists of solving a convex, constrained optimisation problem. (QP or LP) • Capacity Control Parameter • cross-validation • Sparseness Parameter • Validation error Stage I
Sparse Basis Solution Quadratic Loss Function (Quadratic Program) e-Insensitive Loss Function (Linear Program)
AMPG Problem • Predict automobile MPG (392 samples) • Inputs: • no. of cylinders, displacement • horsepower, weight • acceleration, year • Output: • MPG
Horse Power 50 86 122 Horse Power 158 194 230 Network transparency through ANOVA representation.
SUPANOVA AMPG Results (=2.5) Loss Function Estimated Generalisation Error Stage I Stage III Linear Model Training Testing Mean Variance Mean Variance Mean Variance Quadratic Quadratic 6.97 7.39 7.08 6.19 11.4 11.0 e e Insensitive Insensitive 0.48 0.04 0.49 0.03 1.80 0.11 e Insensitive 1.10 0.07 1.37 0.10 Quadratic e Insensitive 7.07 6.52 7.13 6.04 11.72 10.94 Quadratic
Summary • SUPANOVA is a global approach • Strong Basis (Kernel Methods) • Can control loss function and sparseness • Can impose limit on maximum variate terms • Generalisation + Transparency
Further Information • http://www.isis.ecs.soton.ac.uk/ • isystems/kernel/ • SVM Technical Report • MATLAB SVM Toolbox • Sparse Kernel Paper • These Slides