AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks

Neural Networks Lee McCluskey, room 3/10 Email lee@hud.ac.uk http://scom.hud.ac.uk/scomtlm/cha2555/ AI – Week 23Sub-symbolic AI Multi-Layer Neural Networks

RECAP: Simple Model of an Artificial Neuron (McCulloch and Pitts 1943) • A set of synapses (i.e. connections) brings in activations (inputs) from other neurons. • A processing unit sums the inputs x weights, and then applies a transfer function using a “threshold value” to see if the neuron “fires”. • An output line transmits the result to other neurons (output can be binary or continuous). If the sum does not reach the threshold, output is 0.

Another Look at XOR • We showed in a previous lecture that the XOR truth table can not be realised using a single-layer perceptron network; because it is not linearly separable. • Multi-Layer networks / multi-layer perceptrons (MLNs) are able to deal with non-linearly separable problems. • We can use a MLN to classify the XOR data using two separating lines (and the step function).

Constructing the Network Consider the following feed forward, fully connected 2-2-1 network: Here we are combining the information into a single output Here we are constructing the two required separating lines i.e., I1 = -x1 - x2 + 1.5 I2 = -x1 - x2 + 0.5

Evaluating the Network We can calculate the activations of the hidden layer for the network I1 = -x1 - x2 + 1.5 I2 = -x1 - x2 + 0.5 tTotal input hidden layer Output from hidden layer Inputs

Perceptrons To determine whether the jth output node should fire, we calculate the value If this value exceeds 0 the neuron will fire otherwise it will not fire.

Multi-layer Perceptrons (MLPs) • In general, MLPs use the sigmoid activation function: • The sigmoid function is mathematically more “user friendly” than the step function. • Due to the asymptotic nature of the sigmoid function it is unrealistic to expect values of 0 and 1 to be realised exactly. It is usual to relax the output requirements to target values of 0.1 and 0.9. • By adopting the sigmoid function with a more complex architecture, the multi-layer perceptron is able to solve complicated problems.

Backpropagation learning • Pseudo code: • Assume all weights have been initialised randomly to [-1,1] • REPEAT • NOCHANGES = TRUE • For each input pattern • Perform a forward sweep to find the actual output • Calculate network errors tj – oj • If any tj – oj > TOLERANCE set NOCHANGES = FALSE • DO BACKPROPAGATION to determine weight changes • Update weights UNTIL NOCHANGES

The Backpropagation Algorithm • The change to make to a weight called Δwij is got by “gradient descent”. It is based on the “delta value” δjfor an output node j, which represents the error at output j, • Defined by • δj = “difference between output required and output observed”times “gradient of the threshold function” • = (tj - oj) * df/dx • f is the threshold function 1/(1 – e^(-x)); oj = f(input), the output at j • Hence (do some differentiation) • δj = (tj - oj) * oj *(1- oj) • for an output node j.

The Backpropagation Algorithm So for the weight before output nodes …. (new weight) wij = (old weight) wij’ + oi * “learning rate” * δj • And for the weight before hidden nodes similarly …. • (new weight) wij = (old weight) wij’ + oi * “learning rate” * (sum of wkj * δk ) Where j – k is a link output from j

Hidden Layers and Hidden Nodes • The question of how many hidden layers to use and how many nodes each layer should contain needs to be addressed and answered. • Consider first an m-1-n network with n input nodes, m output nodes and just a single node in the hidden layer. This produces m+n weights. It is useful to regard the weights as being degrees of freedom in the network. • Adding a second node to the hidden layer doubles the freedom in the network; producing 2(m+n) weights. • It is not difficult to see the effect that adding a single node has on the size of the problem to be solved. • An m-k-n MLP will produce k(m+n) = km + kn degrees of freedom in the network.

Hidden Layers and Hidden Nodes • If we assume that training time is proportional to the number of weights in the network, then we can see a need to balance effectiveness (reasonable accuracy) with efficiency (reasonable training time). • A good “rule of thumb” is to start with and increase the number of nodes in the hidden layer if the network has trouble training – experience counts for a lot here. • It is only when we have tried everything else – and failed –(i.e., number of hidden nodes, activation function, data scaling etc., ) that further hidden layers are added.

Conclusion: RL v ANN • types of learning: • ANN - learning by example, supervised learning, • RL – learning by observation, low level cognition • characterisation of applications: • ANN - learning an approximation to a function where lots of training data are available. Particularly good in classification where there is noisy data e.g. diagnosis or object recognition • RL – learning low level reactive behaviour, such as in lower forms of animals, good for low level cognitive tasks. Also been used for learning in high level tasks (eg games) where rewards are possible and reasoning with actions (moves) too complex.

Conclusion: RL v ANN • Similarities : • - both classed as "sub-symbolic" in heavy use of numbers and rather opaque when functioning. • - both learning approaches requiring repeated trials • - both inspired by natural learning • - both resistant to noise and more graceful in degradation with degraded inputs

Conclusion: RL v ANN • Differences – • ANNs fixed architecture of layers of neurons, with simple firing mechanism and weights randomly assigned at start, and fixed set of inputs • ANNs needs supervised TRAINING ie classified data a priori, in the form of value for inputs and a correct output • RL need to perform trial and error interactions with the environment • RL learns a mapping from a situation to an action by trial and error: it learns to perform actions which will maximise the sum of re-inforcements, so is more of a real time “hands off” approach than ANNs, it aims to learn policies by assigning blame and learning to avoid situations.

Summary of MLPs • Feed forward. • Fully connected. • Sigmoid activation function. • Restriction on 0, 1 outputs are relaxed to 0.1, 0.9 to accommodate the asymptotic properties of the sigmoid function. • Backpropagation learning is used to train the network. • The number of hidden nodes (units) can be chosen using a “rule of thumb”. • Outputs are continuous rather than binary.

Example MLP Inputs x2 = 0.5 x1 = 0.1 Required outputs o2 = 0.9 o1 = 0.1 o2 x2 0.5 0.5 -0.3 h1 x1 0.5 0.4 o1 -0.5 1 1 0.3

AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks