1 / 39

Last lecture summary

Last lecture summary. Multilayer perceptron. MLP, the most famous type of neural network. input layer. hidden layer. output layer. Processing by one neuron. bias. activation function. output. weights. inputs. Linear activation functions. w ∙ x > 0 . w ∙ x ≤ 0 . linear. threshold.

yana
Download Presentation

Last lecture summary

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Last lecture summary

  2. Multilayer perceptron • MLP, the most famous type of neural network input layer hidden layer output layer

  3. Processing by one neuron bias activation function output weights inputs

  4. Linear activation functions w∙x > 0 w∙x≤ 0 linear threshold

  5. Nonlinear activation functions logistic (sigmoid, unipolar) tanh (bipolar)

  6. Backpropagation training algorithm • MLP is trained by backpropagation. • forward pass • present a training sample to the neural network • calculate the error (MSE) in each output neuron • backward pass • first calculate gradient for hidden-to-output weights • then calculate gradient for input-to-hidden weights • the knowledge of gradhidden-output is necessary to calculate gradinput-hidden • update the weights in the network

  7. input signal propagates forward error propagates backward

  8. Momentum • Online learning vs. batch learning • Batch learning improves the stability by averaging. • Another averaging approach providing stability is using the momentum (μ). • μ (between 0 and 1) indicates the relative importance of the past weight change ∆wm-1 on the new weight increment ∆wm

  9. Other improvements • Delta-Bar-Delta (Turboprop) • Each weight has its own learning rate β. • Second order methods • Hessian matrix (How fast changes the rate of increase of the function in the small neighborhood?  curvature) • QuickProp, Gauss-Newton, Levenberg-Marquardt • less epochs, computationally (Hessian inverse, storage) expensive

  10. New stuff

  11. Bias-variance • Just a small reminder • bias (lack of fit, undefitting) – model does not fit data enough, not enough flexible (too small number of parameters) • variance (overfitting) – model is too flexible (too much parameters), fits noise • bias-variance tradeoff – improving the generalization ability of the model (i.e. find the correct amount of flexibility)

  12. Parameters in MLP: weights • If you use one more hidden neuron, the number of weights increases by how much? • # input neurons + # output neurons • If MLP is used for regression task, be careful! • To use MLP statistically correctly, the number of degrees of freedoms (i.e. weights) can’t exceed the number of data points. • Compare to polynomial regression example from the 2nd lecture

  13. Improving generalization of MLP • Flexibility comes from hidden neurons. • Choose such a # of hidden neurons so neither undefitting, nor overfitting occurs. • Three most common approaches: • exhaustive search • early stopping • regularization

  14. Exhaustive search • Increase a number of hidden units, and monitor the performance on the validation data set. number of neurons

  15. Early stopping • fixed and large number of neurons is used • network is trained while testing its performance on a validation set at regular intervals • minimum at validation error – correct weights epochs

  16. Weight decay • Idea: keep the growth of weights to a minimum in such a way that non-important weights are pulled toward zero • Only the important weights are allowed to grow, others are forced to decay • regularization

  17. This is achieved not by minimizing MSE, but by minimizing • second term – regularization term • m – number of weights in the network • δ – regularization parameter • the larger the δ, the more important the regularization

  18. Network pruning • Both early stopping and weight decay use all weights in the NN. They do not reduce the complexity of the model. • Network pruning – reduce complexity by keeping only essential weights/neurons. • Several pruning approaches, e.g. • optimal brain damage (OBD) • optimal brain surgeon (OBS) • optimal cell damage (OCD)

  19. Radial Basis Function Networks

  20. Radial Basis Function (RBF) Network • Becoming an increasingly popular neural network. • Is probably the main rival to the MLP. • Completely different approach by viewing the design of a neural network as an approximation problem in high-dimensional space. • Uses radial functions as activation function.

  21. Gaussian RBF • Typical radial function is the Gaussian RBF. • Response decreases with distance from a central point. • Parameters: • center c • width (radius r) r radius c - center

  22. Local vs. global units • Local • they are localized (i.e., non-zero) just in the certain part of the space • Gaussian • Global • sigmoid, linear Global Local

  23. MLP RBF Pavel Kordík, Data Mining lecture, FEL, ČVUT, 2009

  24. RBFN architecture Each of n compo-nents of the input vector x feeds forward to m basis functions whose outputs are linearly combined with weights w (i.e. dot product x∙w) into the network output f(x). no weights h1 x1 W1 h2 x2 W2 h3 x3 W3 f(x) Wm hm xn Input layer Hidden layer (RBFs) Output layer Pavel Kordík, Data Mining lecture, FEL, ČVUT, 2009

  25. Pavel Kordík, Data Mining lecture, FEL, ČVUT, 2009 Σ Σ

  26. The basic architecture for a RBF is a 3-layer network. • The input layer is simply a fan-out layer and does no processing. • The hiddenlayer performs a non-linear mapping from the input space into a (usually) higher dimensional space in which the patterns become linearly separable. • The output layer performs a simple weighted sum (i.e. w∙x). • If the RBFN is used for regression then this output is fine. • However, if pattern classification is required, then a hard-limiter or sigmoid function could be placed on the output neurons to give 0/1 output values

  27. Clustering • The unique feature of the RBF network is the process performed in the hidden layer. • The idea is that the patterns in the input space form clusters. • If the centres of these clusters are known, then the distance from the cluster centre can be measured.

  28. Furthermore, this distance measure is made non-linear, so that if a pattern is in an area that is close to a cluster centre it gives a value close to 1. • Beyond this area, the value drops dramatically. • The notion is that this area is radially symmetrical around the cluster centre, thus the non-linear function becomes known as the radial-basis function. non-linearly transformed distance distance from the center of the cluster

  29. RBFN for classification Category 1 Σ Σ Category 2 Category 1 Category 2

  30. RBFN for regression

  31. XOR problem 1 0 1 0

  32. XOR problem • 2 inputs x1, x2, 2 hidden units, one output • The parameters of hidden neurons are set as • center: c1= <0,0>, c2 = <1,1> • radius: ris chosen such that 2r2 = 1 h1 x1 φ1 x2 φ2 h2

  33. 0,1 1,1 1,1 1 1 0,1 1,0 0,0 0,0 1,0 0 0 1 0 1 0 When mapped into the feature space < h1 , h2 >, two classes become linearly separable.So,a linear classifier with h1(x) and h2(x) as inputs can be used to solve the XOR problem. Linear classifier is represented by the output layer.

  34. RBF Learning • Design decision • number of hidden neurons • max of neurons = number of input patterns • min of neurons = determine • more neurons – more complex, smaller tolerance • Parameters to be learnt • centers • radii • A hidden neuron is more sensitive to data points near its center. This sensitivity may be tuned by adjusting the radius. • smaller radius  fits training data better (overfitting) • larger radius  less sensitivity, less overfitting, network of smaller size, faster execution • weights between hidden and output layers

  35. Learning can be divide in two independent tasks: • Center and radii determination • Learning of output layer weights • Learning strategies for RBF parameters • Sample center position randomly from the training data • Self-organized selection of centers • Both layers are learnt using supervised learning

  36. Select centers at random • Choose centers randomly from the training set. • Radius r is calculated as • Weights are found by means of numerical linear algebra approach. • Requires a large training set for a satisfactory level of performance.

  37. Self-organized selection of centers • centers are selected using k-means clustering algorithm • radii are usually found using k-NN • find k-nearest centers • The root-mean squared distance between the current cluster centre and its k (typically 2) nearest neighbours is calculated, and this is the value chosen for r. • The output layer is learnt using a gradient descent technique

  38. Supervised learning • Supervised learning of all parameters (centers, radii, weights) using gradient descent. • Mathematical formulas for updating all of these parameters. They are not shown here, I don’t want to scare you more than necessary.

  39. RBFN and MLP • RBFN trains faster than a MLP • Although the RBFN is quick to train, it is slower in retrieving than a MLP. • RBFNs are essentially well established statistical techniques being presented as neural networks. Learning mechanisms in statistical neural networks are not biologically plausible. • RBFN can give “I don’t know” answer. • RBFN construct local approximations to non-linear I/O mapping. MLP construct global approximations to non-linear I/O mapping.

More Related