340 likes | 477 Views
Nature Inspired Learning: Classification and Prediction Algorithms. Š ar ū nas Raudys Computational Intelligence Group Department of Informatics Vilnius University . Lithuania e-mail: sarunas@raudys.com Juodkrante, 2009 05 22. Nature inspired learning. Statics.
E N D
Nature Inspired Learning:Classification and Prediction Algorithms Šarūnas Raudys Computational Intelligence Group Department of Informatics Vilnius University. Lithuania e-mail: sarunas@raudys.com Juodkrante, 2009 05 22
Nature inspired learning Statics Accuracy, and the relations between sample size, and complexity W=S-1(M1-M2) Dynamics + learning rapidity becomes a very important issue perceptron
Nature inspired learning y output A Non-linear Single Layer Perceptron- a main element in the ANN theory nonlinearity weighted sum x1, x2, …, xp i n p u t s
Nature inspired learning Three tasks TRAINING THE SINGLE LAYER PERCEPTRON OUTLINE A plot of 300 bivariate vectors (dots and pluses) sampled from two Gaussian pattern classes, and the linear decision boundary FINISH START Minimization of deviations CLASSIFICATION CLUSTERIZATION, iftarget2 = target1
1. Cost function and training SLP used for classification. • 2. When to stop training? • 3. Six types of classification Equation while training SLP: • 1. Euclidean distance, (only means) • 2 Regularized, • Fisher, or • Fisher with pseudo-inversion of S • 5 Robust, • 6. Minimal empirical error, • 7 Support vector (maximal margin). • How to train SLP in the best way? FINISH START CLASSIFICATION 2 category case I will speak also about the multi-category case
Nature inspired learning x1, x2, …, xp y output y nonlinearity 1 2 N Training the non-linear SLP Training Data weighted sum inputs X= (x1, x2, …, xp) net=f( VTX+ v0), where f(net ) is a non-linear activation function, e.g. a sigmoid function: f (net)= 1/(1+e-net ) = f sigmoid(net), and v0, VT = (v1 , v2 , ... , vp) are the weights of the DF. STANDART
o=f( VTX+ v0), where f(net ) is a non-linear activation function, and v0, VT = (v1 , v2 , ... , vp) are the weights. Cost function (Amari 1967; Tsypkin, 1966) C=1/NS (yj - f( VTXj+ v0))2, Vt+1=Vt- hxgradient, Training where h is a learning step parameter and yjis training signal (desiredoutput) x1, x2, …, xp y 1 2 N TRAINING THE SINGLE LAYER PERCEPTRON BASED CLASSIFIER TrainingData Rule Optimal stopping V(FINISH)mimimumofthe cost function V(0) A true (unknown) minimum
1 2 N Training Data Vt+1=Vt- hxgradient Training the Non-linear Single Layer Perceptron True landscape Videal Finish Training data landscape Optimal stopping
Vt+1=Vt- hxgradient Early stopping Vopt=aoptVstart+ (1-aopt)Vfinish, where accuracy Raudys&Amari, 1998 A general Principle Late stopping Majority, who stopped too late, are here.
Nature inspired learning Data Set 2 Where to use Early stopping? - Knowledge discovery in very large databases Data Set 1 Train, however, stop training early! In order to save previous information, stop training early! Data Set 3
Standard sum of squares cost function =Standard regression C=1/NS (yj – f(VTXj+ v0))2. We assume that the data is normalized: Covariances Let correlations between input variables x1, x2, …, xp be zero. Then components of vector V will be proportional to correlations between x1, x2, …, xpand y. We may obtain such regression after thefirst iteration. Gradient descent training algorithm Vt+1= Vt- hxgradient
SLP AS SIX REGRESSIONS START
Nature inspired learning. Robust regression (yj - VTXj)2 robust yj - VTXj Š. Raudys (2000). Evolution and generalization of a single neurone. III. Primitive, regularized, standard, robust and minimax regressions. Neural Networks 13 (3/4):507-523. In order to obtain robust regression, instead of square function we have to use “robust function”
A real world problem. Use of robust regression in order to distinguish very weak baby signal from mother’s ECG. Robust regression pays attention to smallest deviations, not to the largest ones considered as the outliers. Mother and a fetus (“baby”) ECG. Two signals Result: the fetus signal
Nature inspired learning.Standard and regularized regression Use of “statistical methods” to perform diverse whitening data transformations, where the input variables x1, x2, …, xpare decorrelated and scaled in order to have the same variances. Then while training the perceptron in the transformed feature space,we can obtain standard regression after the very first iteration. Xnew=TXold T = L-1/2F, where SXX= F L FT is a singular value decomposition of the covariance matrix SXX. Vstart = 0, Speeding up the calculations (a converegence) IfSXX=SXX+ l I,we obtain regularized regression. Moreover, we can equalize eigenvalues and speed up training process.
SLP AS SEVEN STATISTICAL CLASSIFIERS Large weights Small weights The simplest classifier START
Nature inspired learning Conditions to obtain Euclidean distance classifier just after the first iteration When we train further, we have regularized discriminant analysis (RDA): Vt+1 = (2/(t-1)/hI + S) -1 (M1-M2) • is regularization parameter, l 0 with an increase in the number of training iterations Fisher classifier, or Fisher classifier with pseudoinverse of the covariance matrix
Nature inspired learning. Standard approach. Use the diversity of “statistical methods and multivariate models” in order to obtain efficient estimate of covariance matrix. Then perform whitening data transformations, where the input variables are decorrelated and scaled in order to have the same variances. While training the perceptron in the transformed feature space,we can obtain the Euclidean distance classifier after the first iteration. In original feature space it corresponds to the Fisher classifier or to modification of the Fisher (it depends on a method used to estimate covariance matrix) in original feature space. Fisher classifier Untransformed data Transformed data Euclidean classifier = Fisher in original space Euclidean classifier
Nature inspired learning Generalisation errors. EDC, Fisher and Quadratic classifiers
A real world problem.Dozens of ways used to estimate covariance matrix and perform whitening data transformation. It is “an additional information” (if correct), that can be useful in SLP training 196-dimensional data S. Raudys, M. Iwamura. Structures of covariance matrix in handwritten character recognition. Lecture Notes in Computer Science, 3138, 725-733, 2004. S. Raudys, A. Saudargiene. First-order tree-type dependence between variables and classification performance. IEEE Trans. on Pattern Analysis and Machine Intelligence. Vol. PAMI-23 (2), pp. 233-239, 2001.
Covariance matrices are different. Decision boundaries of EDC, LDF, QDF and Anderson- Bahadur linear DF. AB and F are different. If we would start with the AB decision boundary, not with the Fisher, it would be better. Hence, we have proposed a special method of input data transformation. Q FisherAB S. Raudys (2004). Integration of statistical and neural methods to design classifiers in case of unequal covariance matrices. Lecture Notes in Artificial intelligence, Springer-Verlag. Vol. 3238, pp. 270-280
Non-linear discrimination. Similarity features LNCS 3686, pp. 136 – 145, 2005 b a SV classifier KDA Generalization error SV classifier c d SLP optimal stopping of SLP SLP epochs 100+100 2D two class training vectors (pluses and circles) and decision boundaries of Kernel Discriminant Analysis (a),SVM (b),SLP trained in 200D dissimilarity feature space (c). Learning curve: generalization error of SLP classifier as a function of number of training epochs (d).
Nature inspired learning.A noise injection A “coloured” noise, used to form pseudo-validation set: we are adding a noise in directions of closest training vectors. So, we almost do not distort “geometry of the data”. In this technique, we use “additional information”: a space between neighboring points in multidimensional feature space is not empty – it is filled by vectors of the same class. A pseudo-validation data set used to realize early stopping
Nature inspired learning. Multi-category cases 1 2 Pair-wise classifiers: optimally stopped (+noise) SLPs + H-T fusion. Wee need to obtain the classifier (SLP) of optimal complexity: Early stopping
Learning Rapidity.Two Pattern Recognition (PR) tasks A time to learn the second task is restricted, say 300 training epochs Parameters that affect learning rapidity: h – learning step & the weights growth s = target1 – target2 + Regularization: a) weight decay term, b) a noise injection to input vectors, c) a corruption of the targets Wstart= wxWstart. w also controls learning rapidity h, s, andw
# of epochs h, s, andw s s, andw Optimal values of learning parameters s = target1 – target2 w h – the learning step
Collective learning. A l e n g t h y sequence of diverse PR tasks The angle and/or the time between two changes are varying all the time
In order to survive the agents should learn rapidly. Unsuccessful agents are replaced by newborn. Inside the group the agents help each other. In a case of emergency, they help to the weakest groups. Genetics learning and adaptive one. A moral: a single agent (SLP) can not learn very long sequence of the PR tasks successfully The multi-agent system composed of adaptive agents – the single layer perceptrons
A power of the PR task changes and parameter s as a function of time A power of the changes PR task changes s = t1-t2 s is following the variation of the power of the changes I tried to learn:s, “emotions”, “altruism”, the noise intensity, a length of learning set, e.t.c.
Integrating Statistical MethodsandNeural Networks. Nature inspired learning Regression Neural Networks, 13 (3/4), pp. 507-523, 2000 The theory for equal covariance matrix case The theory for unequal covariance matrices and multicategory casesLNCS, 4432, pp. 1 – 10, 2007LNCS, 4472, pp. 62–71, 2007 LNCS, 4142, pp. 47 – 56, 2006 LNAI, 3238, pp. 270-280, 2004 JMLR, ICNC'08