340 likes | 537 Views
Basic Models in Theoretical Neuroscience. Oren Shriki 2010. Supervised Learning. Supervised Learning. The learning is supervised by a ‘teacher’. The network is exposed to samples and presents its output for each input. The teacher presents the desired outputs.
E N D
Basic Models in Theoretical Neuroscience Oren Shriki 2010 Supervised Learning
Supervised Learning • The learning is supervised by a ‘teacher’. • The network is exposed to samples and presents its output for each input. • The teacher presents the desired outputs. • The target of learning is to minimize the difference between the network output and the desired output. • Usually, we define an error function and search for the set of connections that give minimal error.
Error Function • A popular approach is to choose a quadratic error function:Error = (desired output – network output)2 • Any deviation results in a positive error. The error is zero only when there is no deviation.
Linear Network We are given P labeled samples: The number of components in each sample is denoted by N. For simplicity, we first consider 2 dimensional inputs:
Linear Network The quadratic error: We are interested in the weights that give minimal error. For a linear network we can solve directly (no need for a learning process).
Linear Network Graphically, the quadratic error has the following form:
Linear Network Input correlations Correlation between input and output
Linear Network • By solving the equations we obtain the optimal weights. • How will the network perform with new examples? What will be its ability to generalize? • We expect the generalization ability to grow with the number of samples in the training set. • When there are not enough samples (# of samples < # of parameters) there are more variables than equations and there are infinitely many solutions.
Linear Network For N=2 and P=1, the error has the following form:
Generalization vs. Training Error • We define the following types of errors: • Training error:The mean error on the training set. • Generalization error:The mean error over all possible samples.
# of training samples to # of learned parameters ratio Generalization vs. Training Error Typically, the graphs have the following form: Generalization error for the optimal set from the training phase Training error
The learned parameter How to construct a learning algorithm to reduce the error? • Suppose the error as a function of the learned parameter has the following form: The idea: we update the parameter in the direction opposite to the sign of the derivative – gradient descent.
Types of Learning from Samples • Batch LearningThe weights are changed only after seeing a batch of samples. The samples can be recycled. • On-line LearningA learning step is performed after seeing each sample. Samples cannot be recycled. This type is more similar to natural learning.
Online Learning in the Linear Network The gradient descent learning rule:
Online Learning in the Linear Network • The parameter η (eta) represents the learning rate (LR). • When the LR is large, the learning process is fast but inaccurate. • When the LR is small, the learning process is typically more accurate but slow.
When the LR is small the learning is smooth Error Learned parameter
Large LR makes the learning noisy and can cause jumps from one minimum to another Error Learned parameter
Online Learning in the Linear Network Example: MATLAB Demo
Effect of LR in a Simple System • We next analyze a simple example for the effect of the LR on the convergence of the learning process. • The input samples are numbers drawn from a Gaussian random number generator. • For instance:9.5674, 8.3344, 10.1253, 10.2877, 8.8535, 11.1909, 11.1892, 9.9624, 10.3273, 10.1746 …
Effect of LR in a Simple System • Target: predict the next number. • Input: x. • Learned parameter: W.
Effect of LR in a Simple System • The quadratic error: • The gradient descent learning rule:
Effect of LR in a Simple System MATLAB Demo
Effect of LR in a Simple System • How can the results be explained? • We shall perform “convergence in the mean” analysis. • (Board work)
Effect of LR in a Simple System Conclusions from the analysis: • At critical values of the LR there are qualitative transitions in the nature of the learning process. • When the LR is too large, the learning will not converge although each step is performed in the right direction.
Changing the LR with Time • In order to tradeoff speed and accuracy we can reduce the LR with time. • Initially, when the LR is large, the learning process find the deep regions of the error landscape. • The decrease in the LR allows the network to eventually settle in one of the local minima. • If the LR goes down too fast, the learning will not be able to sample the relevant parameter space. To prevent this, the LR is usually taken to be proportional to 1 over the time step (recall the Harmonic series):
Life as Online Learning • During our life the plasticity of our brains changes. • Typically, we are more plastic as kids and over the years our plasticity goes down.
How to deal with more complex problems? • In most real-world problems, the functions to be learned involve non-linearities, and thus non-linear neurons have to be used. • In addition, many interesting problems require multilayered networks.
The Back-Propagation Algorithm (Board work)
Examples of Applications • Robotics (image processing, walking, navigation) • Prediction of economical processes • Character recognition • Medicine (diagnosis, prediction of outcome) • …
Digit Recognition http://yann.lecun.com/exdb/lenet/index.html
Digit Recognition (when the curve is parameterized) Inbal Pinto, Israeli Arts and Science Academy