210 likes | 317 Views
Support Vector classifiers for Land Cover Classification Mahesh Pal Paul M. Mather National Institute of tecnology School of geography
E N D
Support Vector classifiers for Land Cover Classification Mahesh Pal Paul M. Mather National Institute of tecnology School of geography Kurukshetra University of Nottingham India UK
1) What is a Support vector classifier. 2) Data used 3) Results and comparison with NN and ML classifiers 4) Conclusions.
Support vector classifiers (SVC) • Based on statistical learning theory. Minimise the probability of misclassifying an unknown data drawn randomly(structural risk minimisation) rather than minimising the misclassification error on training data (empirical risk minimisation). • In addition to the parameters for the classifiers, this classifier provides a set of data points (called support vectors), contain all information about classification problems. • Nonparametric in nature.
Empirical and structural risk minimisation In the case of two-class pattern recognition, the task of learning from examples can be formulated in the following way: given a set of decision functions {-1,1} where is a set of abstract parameters (Osuna et. al., 1997). For a set of examples: The aim is to find a function that provides the smallest possible value for the average error committed on independent examples randomly drawn from the same distribution P(x, y), called the expected risk: As P(x,y) is unknown, it is not possible to calculate R(α) thus we calculate empirical risk which is defined as: Where functions are usually called hypotheses, and the set is called the hypothesis space,and is denoted by H (where can be radial basis network, polynomial function etc)
If the number of training patterns (k) used to train the classifier is limited, a low error value on a training set does not necessarily imply that the classifier has a high generalisation ability, and the empirical risk minimisation principle will be non consistent. Vapnik and Chervonenkis (1971, 1991) showed that necessary and sufficient condition for consistency of the empirical risk minimisation principle is the fitness of the VC-dimension h of the hypothesis space H . Vapnik and Chervonenkis (1971) provides a bound on the deviation of empirical risk from the expected risk where: In order to implement the SRM principle a nested structure of hypothesis space is introduced by dividing the entire class of functions into nested subsets with the property that h(n) ≤ h(n + 1) where h(n) is the VC-dimension of the set . This can be achieved by training a set of machines, one for each subset and choose that trained machine whose sum of empirical risk and VC confidence is minimal (Osuna et. al., 1997).
Linearly separable class For a binary classification, with data (i=1, …,k) with labels = 1, Training patterns are linearly separable if: for all y = 1 for all y = -1 where w determines the orientation of discriminating plane and b determine the offset from origin. (or weight and bias in term of NN terminology) classification function for this will be (hypothesis space)
Approach to design a support vector classifier is to maximise the margin between two supporting planes. • A plane supports a class if all points in that class are on one side of that plane. • These two parallel planes are pushed apart until they bump into a small number of data points for each class. • These data points are called the Support vectors.
As SVC are designed to maximise the margin between the supporting planes. The margin is defined as : 2/ Maximising the margin is equivalent to minimising the following quadratic program: /2 subject to This is solved by quadratic programming optimisation techniques, by using Lagrangian multipliers and finally optimisation problem becomes (1)
Cont. • Eq. (1) can be minimised with respect to w and b , and the optimisation problem becomes (2) for and the decision rule for two class can be written as :
Non-separable data • Cortes and Vapnik (1995) suggested to relax the restriction that training vectors for one class will lie on one side of optimal hyperplane by introducing a positive slack variable and writing the equation of separating planes as: • for • and writing optimisation problem as : • with and • C is a positive constant such that
Cont. • C is chosen by user and large value of C means higher penalty to errors. Final equation for non-separable data will be as given below, with are Lagrange multipliers to enforce positivity of the and equation(1) becomes
Nonlinear SVC • Set of linear hyperplanes is not flexible to provide low empirical risk for many real life problem(Minky and papert-1969) • Two different ways to increase the flexibility of the set of functions: • 1. To use a set of functions which are superpositions of linear indicator functions (like sigmoid functions in NN) • 2. To map the input vectors in high dimensional space and constructing a separating hyperplane in that space.
If it is not possible to have a decision surface defined by a linear equation, a technique proposed by Boser et.al (1992) is used. • Feature vectors are mapped into a very high dimension feature space via a non linear mapping. • In higher dimension feature space data are spread so as to use a linear hyperplanes as a discriminating surfaces. • Concept of Kernel function is used to reduce the computation demand in feature space.
cont • For this case equation (2) can be written as : where a Kernel K is defined as A number of kernels can be used : Polynomial kernel Radial basis function where d and y are user defined
Advantages/disadvantages • Use Quadtratic programming (QP) optimisation, so no chance of local minima like NN • Use data points closer to the boundary, so uses few number of training data (called support vectors) • basically a two class problem, so different methods exists to create multi-class classifier, affecting their performance • Choice of kernel and kernel specific user defined parameters may affect the final classification accuracy • Choice of Parameter C affect the classification accuracy
Data used • ETM+ ( study area UK, Littleport, Cambridgeshire, 2000) • Hyperspectral (DAIS) data (Spain].
Analysis • Random sampling was used to select training and test data. • Different data set is used for training and testing the classifiers • 2700 training and 2037 test pixels with 7 classes are used with ETM+ data • 1600 training and 3800 test pixels for 8 classes are used with DAIS data • A total of 65 features (spectral bands) was used with DAIS data as seven features with severe striping were discarded. The initial number of features used was five, and the experiment was repeated with 10, 15, …, 65 features, giving a total of 13 experiments.
Continue. • A standard back-propagation neural classifier (NN) was used. All user-defined parameters are set as recommended by Kavzoglu (2001), with one hidden layer with 26 nodes. • Maximum likelihood (ML) was also used. • Classification accuracy and Kappa value is computed with ETM+ data while, classification accuracy is computed with DAIS data. • Like neural network classifiers the performance of support vector classifier depends on some user defined parameters such as kernel type, kernel specific parameters, multi-class method and the parameter C. • For this study “ one against one” multi-class method, C= 5000, radial basis kernel and (kernel specific parameter) value as 2 is used.
Conclusions • Performance of SVC is better in comparison with NN and ML classifiers. • Like NN, SVC is also affected by the choice of some user-defined parameters. This study concludes that it is easier to set these parameters. • There is no problem of local minima is SVC like NN classifiers. • Training time by SVC is quite small as compared to NN classifier. (0.30 minute by SVC as compared to 58 minute by NN classifier on a SUN machine) • SVC perform very well with small number of training data irrespective of number of features used. • SV classifiers are almost unaffected by Hughes (1968) phenomenon.