Support Vector Neural Training

Support Vector Neural Training Włodzisław Duch Department of Informatics Nicolaus Copernicus University, Toruń, Poland School of Computer Engineering, Nanyang Technological University, Singapore Google: Duch ICANN Warsaw, Sept. 2005

Plan • Main idea. • Support Vector Machines and active learning. • Neural Networks and Support Vectors • Pedagogical example • Results on real data

Main idea • What data should be used for training? Given conditional distributions P(X|C) for dengue fever for: • World population. • ASEAN countries. • Singapore only. • Choa Chu Kang only? Which distributions should we use? If we know that X is from Choa Chu Kang and P(X|C) is reliable local knowledge should be used. If X comes from region close to decision borders why use data from regions far away?

Learning • MLP/RBF: first fast MSE reduction, very slow later. Typical MSE(t) learning curve: after 10 iterations almost all work is done, but the final convergence is achieved only after a very long process, about 1000 iterations. What is going on?

Learning trajectories • Take weights Wi from iterations i=1..K; PCA on Wi covariance matrix captures 95-95% variance for most data, so error function in 2D shows realistic learning trajectories. Papers by M. Kordos & W. Duch Instead of local minima large flat valleys are seen – why? Data far from decision borders has almost no influence, the main reduction of MSE is achieved by increasing ||W||, sharpening sigmoidal functions.

Support Vectors SVM gradually focuses on the training vectors near the decision hyperplane – can we do the same with MLP?

Selecting Support Vectors Active learning: if contribution to the parameter change is negligible remove the vector from training set. If the difference is sufficiently small the pattern X will have negligible influence on the training process and may be removed from the training. Conclusion: select vectors with eW(X)>emin, for training. 2 problems: possible oscillations and strong influence of outliers. Solution: adjust emin dynamically to avoid oscillations; remove also vectors with eW(X)>1-emin=emax

SVNT algorithm Initialize the network parameters W, set De=0.01,emin=0, set SV=T. Until no improvement is found in the last Nlast iterations do • Optimize network parameters for Nopt steps on SV data. • Run feedforward step on T to determine overall accuracy and errors, take SV={X|e(X) [emin,1-emin]}. • If the accuracy increases: compare current network with the previous best one, choose the better one as the current best • increase emin=emin+De and make forward step selecting SVs • If the number of support vectors |SV| increases: decrease emin=emin-De; decrease De = De/1.2 to avoid large changes

XOR solution

Satellite image data Multi-spectral values of pixels in the 3x3 neighborhoods in section 82x100 of an image taken by the Landsat Multi-Spectral Scanner; intensities = 0-255, training has 4435 samples, test 2000 samples. Central pixel in each neighborhood is red soil (1072), cotton crop (479), grey soil (961), damp grey soil (415), soil with vegetation stubble (470), and very damp grey soil (1038 training samples). Strong overlaps between some classes. System and parameters Train accuracy Test accuracy SVNT MLP, 36 nodes, a=0.5 96.5 91.3 kNN, k=3, Manhattan -- 90.9 SVM Gaussian kernel (optimized) 91.6 88.4 RBF, Statlog result 88.9 87.9 MLP, Statlog result 88.8 86.1 C4.5 tree 96.0 85.0

Satellite image data – MDS outputs

Hypothyroid data 2 years real medical screening tests for thyroid diseases, 3772 cases with 93 primary hypothyroid and 191 compensated hypothyroid, the remaining 3488 cases are healthy; 3428 test, similar class distribution. 21 attributes (15 binary, 6 continuous) are given, but only two of the binary attributes (on thyroxine, and thyroid surgery) contain useful information, therefore the number of attributes has been reduced to 8. Method % train % test C-MLP2LN rules 99.89 99.36 MLP+SCG, 4 neurons 99.81 99.24 SVM Minkovsky opt kernel 100.0 99.18 MLP+SCG, 4 neur, 67 SV 99.95 99.01 MLP+SCG, 4 neur, 45 SV 100.0 98.92 MLP+SCG, 12 neur. 100.0 98.83 Cascade correlation 100.0 98.5 MLP+backprop 99.60 98.5 SVM Gaussian kernel 99.76 98.4

Hypothyroid data

Discussion SVNT is very easy to implement, here only batch version with SCG training was used. First step only, but promising results. Found smaller support vector sets than SVM; may be useful in one-class learning; speeds up training. Problems: possible oscillations, selection requires more careful analysis – but oscillations help to explore the MSE landscape; additional parameters – but rather easy to set; More empirical tests needed.

Thank youfor lending your ears ...

Support Vector Neural Training