CSM6120 Introduction to Intelligent Systems

CSM6120Introduction to Intelligent Systems Neural Networks

Sub-symbolic learning • When we use some sort of rule-based system, generally we understand the rules • We understand the conclusions it draws, because it can tell us • When a system learns from such rules, processes it in an understood way, and comes up with an understandable result, it is known as symbolic learning

Sub-symbolic learning • Some systems of learning – which we are coming to – operate in quite a different way • Sub-symbolic learning is learning where we don't really understand or have control over the process by which it comes to its conclusion • Though we could, with considerable effort and vast amounts of time, find out • This includes neural networks, genetic algorithms, genetic programming and to some extent, statistical methods

Artificial neural networks • ANNs for short • We use the word “Artificial” to distinguish them from biological neural networks • They come in various shapes and sizes • Inputs (variables), outputs (results) • May be one or more outputs

Neural networks (applications) • Face recognition • Event recognition, tracking • Time series prediction • Process control • Optical character recognition • Handwriting recognition • Robotics, movement • Games: car control, etc • Etc…

ANNs statistical? • We all think of ANNs as pure AI, right? • There is much published literature which claims that ANNs are really no more than non-linear regression models, and claims that they can be implemented using standard statistical software • But there is a difference: randomness

Neurons • Original ANN research aimed at modelling networks of real neurons in the brain • 1010 neurons in the brain – unrealistic target! • Brain is massively parallel • One job shared between many neurons • If one goes wrong, no big deal • Known as distributed processing • Fault-tolerant

Patterns within data • ANNs learn to recognise patterns within sets of data • Fault tolerance helps – can still cope with unexpected input examples, placing emphasis only on what it considers important

The neuron metaphor • Neurons • Accept information from multiple inputs • Transmit information to other neurons • Multiply inputs by weights along edges • Apply some function to the set of inputs at each node

ANN topology

Nodes (neurons, or perceptrons) • Each input node brings into the network the value of one independent variable • The hidden layer nodes do most of the work • The output node(s) give us our result

Types of neuron Linear neuron Logistic neuron Perceptron

Activation functions • Transforms neuron’s input into output • Features of activation functions: • A squashing effect is required • Prevents accelerating growth of activation levels through the network • Simple and easy to calculate, differentiable...?

Perceptrons • Formal neuron (McCulloch & Pitts, 1943) • Summation of inputs and threshold functions • Showed how any logical function could be computed with simple model neurons Perceptron

Perceptrons • Learning by weight adjustment • Output is 1 if the weighted sum of inputs is greater than a given threshold, otherwise output is 0 • An extra weight (bias) is used as a way of adjusting the threshold Perceptron

Alternative notation • The threshold (for these examples) is always = 0 • If the sum of the weighted inputs >= 0 then P = 1, else P = 0 X = 1 w0 A w1 P w2 B

Example • If the inputs are binary (0 or 1) then the neuron units act like a conventional logic module • In this example, P = A or B • Equates to: if 2A + B – 0.5 >= 0 then output 1, else output 0 X = 1 -0.5 A 2 P 1 B

Example 2 • What is being computed here? X = 1 -2.5 A 2 P 1 B By only changing the bias, we can model another logical function = AND

Example 3 • What is being computed here? X = 1 -1.5 A 2 2 P -2 B X = 1 -0.5 -3 1 C Q 2.5

Standard backpropagation • The commonest learning rule for ANNs • Made popular by Rumelhart and McClelland in 1986 • Connections between nodes given random initial weights • We therefore get a value at the output node(s) which is what happens when these random weights are applied to the data at the input • Use the difference between the output and correct values to adjust weights

For AND A B P 0 0 0 0 1 0 1 0 0 1 1 1 Training perceptrons X = -1 ? A ? P ? B • What are the weight values? • Initialize with random weights

Training perceptrons -1 For AND A B P 0 0 0 0 1 0 1 0 0 1 1 1 0.3 A 0.5 P -0.4 B

Learning algorithm • Epoch: Presentation of the entire training set to the neural network • In the case of the AND function an epoch consists of four sets of inputs being presented to the network (i.e. [0,0], [0,1], [1,0], [1,1]) • Error: The error value is the amount by which the value outputted by the network differs from the target value • For example, if we required the network to output 0 and it outputted a 1, then Error = -1

Learning algorithm • Error • The weightings on the connections are then adjusted to try to reduce this error • Further iterations of weight adjustment proceed until we decide we’re finished • And that is a study in itself • Epochs • There may be many thousands of epochs in one training run, before a satisfactory model is achieved • But how many do we need? • The BIG question (or one of them!)

Perceptron learning rule Weights are updated: wi= wi + wi wi =  (t - o) xi t= the target value o is the perceptron output  is a small constant (e.g. 0.12) called the learning rate • For each training example: • If the output is correct (t=o) the weights wi are not changed • If the output is incorrect (to) the weights wi are changed • such that the output of the perceptron for the new weights • is closer to t • The algorithm converges to the correct classification • If the training data is linearly separable • And  is sufficiently small

Search space • It’s often useful to think of ANN training as a marble rolling across a surface, which has valleys in it • Some valleys may be deeper than others – we want to find a deep one • When we find it, we stop training Local minimum Global minimum

Gradient descent learning rule • Consider linear unit without threshold and continuous output o (not just 0,1 as with perceptrons) • o = w0 + w1 x1 + … + wn xn • We train the wi such that they minimise the squared error • E[w1,…,wn] = ½ dD (td-od)2 where D is the set of training examples

(w1,w2) Gradient: E[w]=[E/w0,… E/wn] (w1+w1,w2 +w2) Gradient descent w=- E[w] wi=- E/wi

Gradient descent (linear units) Gradient-descent(training_examples, ) Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is the vector of input values, and t is the target output value,  is the learning rate (e.g. 0.1) • Initialize each wi to some small random value • Until the termination condition is met: • Initialize each wi to zero • For each <(x1,…xn),t> in training_examples: • Input the instance (x1,…,xn) to the linear unit and compute the output o • For each linear unit weight wi • wi = wi + (t-o)xi(update deltas) • For each linear unit weight wi • wi=wi+wi (update weights)

Incremental stochastic gradient descent • Two approaches to this: • Batch mode = gradient descent over the entire data D • Incremental mode = gradient descent over individual training examples d Incremental gradient descent can approximate batch gradient descent arbitrarily closely if  is small enough

Perceptron vs gradient descent rule Perceptron learning rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate  Gradient descent learning for linear units • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate  • Even when training data contains noise

Decision boundaries • In simple cases, divide feature space by drawing a hyperplane across it • Known as a decision boundary • Discriminant function: returns different values on opposite sides (straight line) • Problems which can be thus classified are linearly separable

Linear separability X1 A A A Decision Boundary B A B A B B A B B A B X2 B

x2 + + + - - x1 + - - Decision surface of a perceptron x2 + - x1 + - Linearly separable Non-linearly separable • Perceptron is able to represent some useful functions • AND(x1,x2) choose weights w0=-1.5, w1=1, w2=1 • But functions that are not linearly separable (e.g. XOR) • are not representable – we need multilayer perceptrons here

Multilayer networks • Cascade neurons together • The output from one layer is the input to the next • Each layer has its own sets of weights • Hidden layer(s)…

Linear regression neural networks • What happens when we arrange linear neurons in a multilayer network?

Linear regression neural networks • …nothing special happens • The product of two linear transformations is itself a linear transformation

Neural networks • We want to introduce non-linearities to the network • Non-linearities allow a network to identify complex regions in space

Linear separability • 1-layer cannot handle XOR • More layers can handle more complicated spaces – but require more parameters • Each node splits the feature space with a hyperplane • If the second layer is AND, a 2-layer network can represent any convex hull

A B B A B A A B B A B A A B B A B A Separability Exclusive-OR problem Classes with meshed regions Most general region shapes Structure Single-Layer Two-Layer Three-Layer

Feed-forward networks • Predictions are fed forward through the network to classify

Error backpropagation • Error backpropagation solves the gradient for each partial component separately • The target values for each layer come from the next layer • This feeds the errors back along the network

Backpropagation (BP) algorithm • BP employs gradient descent to attempt to minimize the squared error between the network output values and the target values for these outputs • Two stage learning • Forward stage: calculate outputs given input pattern • Backward stage: update weights by calculating deltas

Termination conditions for BP • The weight update loop may be iterated thousands of times in a typical application • The choice of termination condition is important because • Too few iterations can fail to reduce error sufficiently • Too many iterations can lead to overfitting the training data (see later!) • Termination criteria • After a fixed number of iterations (epochs) • Once the training error falls below some threshold • Once the validation error meets some criterion

Why BP works in practiceA possible scenario • Weights are initialized to values near zero • Early gradient descent steps will represent a very smooth function (approximately linear). Why? • The sigmoid function is almost linear when the total input (weighted sum of inputs to a sigmoid unit) is near 0 • The weights gradually move close to the global minimum • As weights grow in the later stages of learning, they represent highly non-linear network functions • Gradient steps in this later stage move toward local minima in this region, which is acceptable

CSM6120 Introduction to Intelligent Systems