950 likes | 1.23k Views
Kernel Methods: the Emergence of a Well-founded Machine Learning. John Shawe-Taylor Centre for Computational Statistics and Machine Learning University College London. Overview. Celebration of 10 years of kernel methods: what has been achieved and what can we learn from the experience?
E N D
Kernel Methods: the Emergence of a Well-founded Machine Learning John Shawe-Taylor Centre for Computational Statistics and Machine Learning University College London CCKM'06
Overview • Celebration of 10 years of kernel methods: • what has been achieved and • what can we learn from the experience? • Some historical perspectives: • Theory or not? • Applicable or not? • Some emphases: • Role of theory – need for plurality of approaches • Importance of scalability CCKM'06
Caveats • Personal perspective with inevitable bias • One very small slice through what is now a very big field • Focus on theory with emphasis on frequentist analysis • There is no pro-forma for scientific research • But role of theory worth discussing • Needed to give firm foundation for proposed approaches? CCKM'06
Motivation behind kernel methods • Linear learning typically has nice properties • Unique optimal solutions • Fast learning algorithms • Better statistical analysis • But one big problem • Insufficient capacity CCKM'06
Historical perspective • Minsky and Pappert highlighted the weakness in their book Perceptrons • Neural networks overcame the problem by gluing together many linear units with non-linear activation functions • Solved problem of capacity and led to very impressive extension of applicability of learning • But ran into training problems of speed and multiple local minima CCKM'06
Kernel methods approach • The kernel methods approach is to stick with linear functions but work in a high dimensional feature space: • The expectation is that the feature space has a much higher dimension than the input space. CCKM'06
Example • Consider the mapping • If we consider a linear equation in this feature space: • We actually have an ellipse – i.e. a non-linear shape in the input space. CCKM'06
Capacity of feature spaces • The capacity is proportional to the dimension – for example: • 2-dim: CCKM'06
Form of the functions • So kernel methods use linear functions in a feature space: • For regression this could be the function • For classification require thresholding CCKM'06
Problems of high dimensions • Capacity may easily become too large and lead to over-fitting: being able to realise every classifier means unlikely to generalise well • Computational costs involved in dealing with large vectors CCKM'06
Overview • Two theoretical approaches converged on very similar algorithms: • Frequentist led to Support Vector Machine • Bayesian approach led to Bayesian inference using Gaussian Processes • First we briefly discuss the Bayesian approach before mentioning some of the frequentist results CCKM'06
Bayesian approach • The Bayesian approach relies on a probabilistic analysis by positing • a noise model • a prior distribution over the function class • Inference involves updating the prior distribution with the likelihood of the data • Possible outputs: • MAP function • Bayesian posterior average CCKM'06
Bayesian approach • Avoids overfitting by • Controlling the prior distribution • Averaging over the posterior • For Gaussian noise model (for regression) and Gaussian process prior we obtain a ‘kernel’ method where • Kernel is covariance of the prior GP • Noise model translates into addition of ridge to kernel matrix • MAP and averaging give the same solution • Link with infinite hidden node limit of single hidden layer Neural Networks – see seminal paper Williams, Computation with infinite neural networks (1997) • Surprising fact: the covariance (kernel) function • that arises from infinitely many sigmoidal hidden units • is not a sigmoidal kernel – indeed the sigmoidal kernel • is not positive semi-definite and so cannot arise as a • covariance function! CCKM'06
Bayesian approach • Subject to assumptions about noise model and prior distribution: • Can get error bars on the output • Compute evidence for the model and use for model selection • Approach developed for different noise models • eg classification • Typically requires approximate inference CCKM'06
Frequentist approach • Source of randomness is assumed to be a distribution that generates the training data i.i.d. – with the same distribution generating the test data • Different/weaker assumptions than the Bayesian approach – so more general but less analysis can typically be derived • Main focus is on generalisation error analysis CCKM'06
Capacity problem • What do we mean by generalisation? CCKM'06
Generalisation of a learner CCKM'06
Example of Generalisation • We consider the Breast Cancer dataset from the UCIrepository • Use the simple Parzen window classifier: weight vector is where is the average of the positive (negative) training examples. • Threshold is set so hyperplane bisects the line joining these two points. CCKM'06
Example of Generalisation • By repeatedly drawing random training sets S of size m we estimate the distribution of by using the test set error as a proxy for the true generalisation • We plot the histogram and the average of the distribution for various sizes of training set 648, 342, 273, 205, 137, 68, 34, 27, 20, 14, 7. CCKM'06
Example of Generalisation • Since the expected classifier is in all cases the same we do not expect large differences in the average of the distribution, though the non-linearity of the loss function means they won't be the same exactly. CCKM'06
Error distribution: full dataset CCKM'06
Observations • Things can get bad if number of training examples small compared to dimension (in this case input dimension is 9) • Mean can be bad predictor of true generalisation – i.e. things can look okay in expectation, but still go badly wrong • Key ingredient of learning – keep flexibility high while still ensuring good generalisation CCKM'06
Controlling generalisation • The critical method of controlling generalisation for classification is to force a large margin on the training data: CCKM'06
Intuitive and rigorous explanations • Makes classification robust to uncertainties in inputs • Can randomly project into lower dimensional spaces and still have separation – so effectively low dimensional • Rigorous statistical analysis shows effective dimension • This is not structural risk minimisation over VC classes since hierarchy depends on the data: data-dependent structural risk minimisation see S-T, Bartlett, Williamson & Anthony (1996 and 1998) Surprising fact: structural risk minimisation over VC classes does not provide a bound on the generalisation of SVMs, except in the transductive setting – and then only if the margin is measured on training and test data! Indeed SVMs were wake-up call that classical PAC analysis was not capturing critical factors in real-world applications! CCKM'06
Learning framework • Since there are lower bounds in terms of the VC dimension the margin is detecting a favourable distribution/task alignment – luckiness framework captures this idea • Now consider using an SVM on the same data and compare the distribution of generalisations • SVM distribution in red CCKM'06
Handling training errors • So far only considered case where data can be separated • For non-separable sets we can introduce a penalty proportional to the amount by which a point fails to meet the margin • These amounts are often referred to as slack variables from optimisation theory CCKM'06
Support Vector Machines • SVM optimisation • Analysis of this case given using augmented space trick in S-T and Cristianini (1999 and 2002) On the generalisation of soft margin algorithms. CCKM'06
Complexity problem • Let’s apply the quadratic example to a 20x30 image of 600 pixels – gives approximately 180000 dimensions! • Would be computationally infeasible to work in this space CCKM'06
Dual representation • Suppose weight vector is a linear combination of the training examples: • can evaluate inner product with new example CCKM'06
Learning the dual variables • The αi are known as dual variables • Since any component orthogonal to the space spanned by the training data has no effect, general result that weight vectors have dual representation: the representer theorem. • Hence, can reformulate algorithms to learn dual variables rather than weight vector directly CCKM'06
Dual form of SVM • The dual form of the SVM can also be derived by taking the dual optimisation problem! This gives: • Note that threshold must be determined from border examples CCKM'06
Using kernels • Critical observation is that again only inner products are used • Suppose that we now have a shortcut method of computing: • Then we do not need to explicitly compute the feature vectors either in training or testing CCKM'06