1 / 80

CSM6120 Introduction to Intelligent Systems

CSM6120 Introduction to Intelligent Systems. Neural Networks. Sub-symbolic learning. When we use some sort of rule-based system, generally we understand the rules We understand the conclusions it draws, because it can tell us

yoshe
Download Presentation

CSM6120 Introduction to Intelligent Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSM6120Introduction to Intelligent Systems Neural Networks

  2. Sub-symbolic learning • When we use some sort of rule-based system, generally we understand the rules • We understand the conclusions it draws, because it can tell us • When a system learns from such rules, processes it in an understood way, and comes up with an understandable result, it is known as symbolic learning

  3. Sub-symbolic learning • Some systems of learning – which we are coming to – operate in quite a different way • Sub-symbolic learning is learning where we don't really understand or have control over the process by which it comes to its conclusion • Though we could, with considerable effort and vast amounts of time, find out • This includes neural networks, genetic algorithms, genetic programming and to some extent, statistical methods

  4. Artificial neural networks • ANNs for short • We use the word “Artificial” to distinguish them from biological neural networks • They come in various shapes and sizes • Inputs (variables), outputs (results) • May be one or more outputs

  5. Neural networks (applications) • Face recognition • Event recognition, tracking • Time series prediction • Process control • Optical character recognition • Handwriting recognition • Robotics, movement • Games: car control, etc • Etc…

  6. ANNs statistical? • We all think of ANNs as pure AI, right? • There is much published literature which claims that ANNs are really no more than non-linear regression models, and claims that they can be implemented using standard statistical software • But there is a difference: randomness

  7. Neurons • Original ANN research aimed at modelling networks of real neurons in the brain • 1010 neurons in the brain – unrealistic target! • Brain is massively parallel • One job shared between many neurons • If one goes wrong, no big deal • Known as distributed processing • Fault-tolerant

  8. Patterns within data • ANNs learn to recognise patterns within sets of data • Fault tolerance helps – can still cope with unexpected input examples, placing emphasis only on what it considers important

  9. The neuron metaphor • Neurons • Accept information from multiple inputs • Transmit information to other neurons • Multiply inputs by weights along edges • Apply some function to the set of inputs at each node

  10. ANN topology

  11. Nodes (neurons, or perceptrons) • Each input node brings into the network the value of one independent variable • The hidden layer nodes do most of the work • The output node(s) give us our result

  12. Types of neuron Linear neuron Logistic neuron Perceptron

  13. Activation functions • Transforms neuron’s input into output • Features of activation functions: • A squashing effect is required • Prevents accelerating growth of activation levels through the network • Simple and easy to calculate, differentiable...?

  14. Perceptrons • Formal neuron (McCulloch & Pitts, 1943) • Summation of inputs and threshold functions • Showed how any logical function could be computed with simple model neurons Perceptron

  15. Perceptrons • Learning by weight adjustment • Output is 1 if the weighted sum of inputs is greater than a given threshold, otherwise output is 0 • An extra weight (bias) is used as a way of adjusting the threshold Perceptron

  16. Alternative notation • The threshold (for these examples) is always = 0 • If the sum of the weighted inputs >= 0 then P = 1, else P = 0 X = 1 w0 A w1 P w2 B

  17. Example • If the inputs are binary (0 or 1) then the neuron units act like a conventional logic module • In this example, P = A or B • Equates to: if 2A + B – 0.5 >= 0 then output 1, else output 0 X = 1 -0.5 A 2 P 1 B

  18. Example 2 • What is being computed here? X = 1 -2.5 A 2 P 1 B By only changing the bias, we can model another logical function = AND

  19. Example 3 • What is being computed here? X = 1 -1.5 A 2 2 P -2 B X = 1 -0.5 -3 1 C Q 2.5

  20. Standard backpropagation • The commonest learning rule for ANNs • Made popular by Rumelhart and McClelland in 1986 • Connections between nodes given random initial weights • We therefore get a value at the output node(s) which is what happens when these random weights are applied to the data at the input • Use the difference between the output and correct values to adjust weights

  21. For AND A B P 0 0 0 0 1 0 1 0 0 1 1 1 Training perceptrons X = -1 ? A ? P ? B • What are the weight values? • Initialize with random weights

  22. Training perceptrons -1 For AND A B P 0 0 0 0 1 0 1 0 0 1 1 1 0.3 A 0.5 P -0.4 B

  23. Learning algorithm • Epoch: Presentation of the entire training set to the neural network • In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1]) • Error: The error value is the amount by which the value outputted by the network differs from the target value • For example, if we required the network to output 0 and it outputted a 1, then Error = -1

  24. Learning algorithm • Error • The weightings on the connections are then adjusted to try to reduce this error • Further iterations of weight adjustment proceed until we decide we’re finished • And that is a study in itself • Epochs • There may be many thousands of epochs in one training run, before a satisfactory model is achieved • But how many do we need? • The BIG question (or one of them!)

  25. Perceptron learning rule Weights are updated: wi= wi + wi wi =  (t - o) xi t= the target value o is the perceptron output  is a small constant (e.g. 0.12) called the learning rate • For each training example: • If the output is correct (t=o) the weights wi are not changed • If the output is incorrect (to) the weights wi are changed • such that the output of the perceptron for the new weights • is closer to t • The algorithm converges to the correct classification • If the training data is linearly separable • And  is sufficiently small

  26. Search space • It’s often useful to think of ANN training as a marble rolling across a surface, which has valleys in it • Some valleys may be deeper than others – we want to find a deep one • When we find it, we stop training Local minimum Global minimum

  27. Gradient descent learning rule • Consider linear unit without threshold and continuous output o (not just 0,1 as with perceptrons) • o = w0 + w1 x1 + … + wn xn • We train the wi such that they minimise the squared error • E[w1,…,wn] = ½ dD (td-od)2 where D is the set of training examples

  28. (w1,w2) Gradient: E[w]=[E/w0,… E/wn] (w1+w1,w2 +w2) Gradient descent w=- E[w] wi=- E/wi

  29. Gradient descent (linear units) Gradient-descent(training_examples, ) Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is the vector of input values, and t is the target output value,  is the learning rate (e.g. 0.1) • Initialize each wi to some small random value • Until the termination condition is met: • Initialize each wi to zero • For each <(x1,…xn),t> in training_examples: • Input the instance (x1,…,xn) to the linear unit and compute the output o • For each linear unit weight wi • wi = wi + (t-o)xi(update deltas) • For each linear unit weight wi • wi=wi+wi (update weights)

  30. Incremental stochastic gradient descent • Two approaches to this: • Batch mode = gradient descent over the entire data D • Incremental mode = gradient descent over individual training examples d Incremental gradient descent can approximate batch gradient descent arbitrarily closely if  is small enough

  31. Perceptron vs gradient descent rule Perceptron learning rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate  Gradient descent learning for linear units • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate  • Even when training data contains noise

  32. Decision boundaries • In simple cases, divide feature space by drawing a hyperplane across it • Known as a decision boundary • Discriminant function: returns different values on opposite sides (straight line) • Problems which can be thus classified are linearly separable

  33. Linear separability X1 A A A Decision Boundary B A B A B B A B B A B X2 B

  34. x2 + + + - - x1 + - - Decision surface of a perceptron x2 + - x1 + - Linearly separable Non-linearly separable • Perceptron is able to represent some useful functions • AND(x1,x2) choose weights w0=-1.5, w1=1, w2=1 • But functions that are not linearly separable (e.g. XOR) • are not representable – we need multilayer perceptrons here

  35. Multilayer networks • Cascade neurons together • The output from one layer is the input to the next • Each layer has its own sets of weights • Hidden layer(s)…

  36. Linear regression neural networks • What happens when we arrange linear neurons in a multilayer network?

  37. Linear regression neural networks • …nothing special happens • The product of two linear transformations is itself a linear transformation

  38. Neural networks • We want to introduce non-linearities to the network • Non-linearities allow a network to identify complex regions in space

  39. Linear separability • 1-layer cannot handle XOR • More layers can handle more complicated spaces – but require more parameters • Each node splits the feature space with a hyperplane • If the second layer is AND, a 2-layer network can represent any convex hull

  40. A B B A B A A B B A B A A B B A B A Separability Exclusive-OR problem Classes with meshed regions Most general region shapes Structure Single-Layer Two-Layer Three-Layer

  41. Feed-forward networks • Predictions are fed forward through the network to classify

  42. Feed-forward networks • Predictions are fed forward through the network to classify

  43. Feed-forward networks • Predictions are fed forward through the network to classify

  44. Feed-forward networks • Predictions are fed forward through the network to classify

  45. Feed-forward networks • Predictions are fed forward through the network to classify

  46. Feed-forward networks • Predictions are fed forward through the network to classify

  47. Error backpropagation • Error backpropagation solves the gradient for each partial component separately • The target values for each layer come from the next layer • This feeds the errors back along the network

  48. Backpropagation (BP) algorithm • BP employs gradient descent to attempt to minimize the squared error between the network output values and the target values for these outputs • Two stage learning • Forward stage: calculate outputs given input pattern • Backward stage: update weights by calculating deltas

  49. Termination conditions for BP • The weight update loop may be iterated thousands of times in a typical application • The choice of termination condition is important because • Too few iterations can fail to reduce error sufficiently • Too many iterations can lead to overfitting the training data (see later!) • Termination criteria • After a fixed number of iterations (epochs) • Once the training error falls below some threshold • Once the validation error meets some criterion

  50. Why BP works in practiceA possible scenario • Weights are initialized to values near zero • Early gradient descent steps will represent a very smooth function (approximately linear). Why? • The sigmoid function is almost linear when the total input (weighted sum of inputs to a sigmoid unit) is near 0 • The weights gradually move close to the global minimum • As weights grow in the later stages of learning, they represent highly non-linear network functions • Gradient steps in this later stage move toward local minima in this region, which is acceptable

More Related