Chapter 6: Artificial Neural Networks Part 2 of 3 (Sections 6.4 – 6.6)

Chapter 6: Artificial Neural NetworksPart 2 of 3 (Sections 6.4 – 6.6) SCCS451 Artificial Intelligence Week 12 Asst. Prof. Dr. SukanyaPongsuparb Dr. SrisupaPalakvangsa Na Ayudhya Dr. BenjarathPupacdi

Agenda • Multi-layer Neural Network • Hopfield Network

Multilayer Neural Networks • A multilayer perceptron is a feedforward neural network with ≥ 1 hidden layers. Single-layer VS Multi-layer Neural Networks

Roles of Layers • Input Layer • Accepts input signals from outside world • Distributes the signals to neurons in hidden layer • Usually does not do any computation • Output Layer (computational neurons) • Accepts output signals from the previous hidden layer • Outputs to the world • Knows the desired outputs • Hidden Layer (computational neurons) • Determines its own desired outputs

Hidden (Middle) Layers • Neurons in hidden layers unobservable through input and output of the networks. • Desired output unknown (hidden) from the outside and determined by the layer itself • 1 hidden layer for continuous functions • 2 hidden layers for discontinuous functions • Practical applications mostly use 3 layers • More layers are possible but each additional layer exponentially increases computing load

How do multilayer neural networks learn? • More than a hundred different learning algorithms are available for multilayer ANNs • The most popular method is back-propagation.

Back-propagation Algorithm • In a back-propagation neural network, the learning algorithm has 2 phases. • Forward propagation of inputs • Backward propagation of errors • The algorithm loops over the 2 phases until the errors obtained are lower than a certain threshold. • Learning is done in a similar manner as in a perceptron • A set of training inputs is presented to the network. • The network computes the outputs. • The weights are adjusted to reduce errors. • The activation function used is a sigmoid function.

Common Activation Functions Output is a real number in the [0, 1] range. Hard limit functions often used for decision-making neurons for classification and pattern recognition Popular in back-propagation networks Often used for linear approximation

3-layer Back-propagation Neural Network

How a neuron determines its output • Very similar to the perceptron 1. Compute the net weighted input 2. Pass the result to the activation function 0.98 0.98 Input Signals 2 0.1 0.2 0.98 5 j 0.5 0.98 1 0.3 0.98 Let θ= 0.2 8 0.98 X = (0.1(2) + 0.2(5) + 0.5(1) + 0.3(8)) – 0.2 = 3.9 Y = 1 / (1 + e-3.9) = 0.98

How the errors propagate backward • The errors are computes in a similar manner to the errors in the perceptron. • Error = The output we want – The output we get Error at an output neuron k at iteration p Iteration p 2 0.1 0.2 5 k 0.98 0.5 1 0.3 Suppose the expected output is 1. 8 ek(p) = 1 – 0.98 = 0.02 Error Signals

Back-Propagation Training Algorithm Step 1: Initialization Randomly define weightsandthreshold θsuch that the numbers are within a small range where Fi is the total number of inputs of neuron i. The weight initialization is done on a neuron-by-neuron basis.

Back-Propagation Training Algorithm Step 2: Activation Propagate the input signals forward from the input layer to the output layer. 0.98 2 0.1 Let θ= 0.2 Input Signals 0.2 0.98 5 j 0.5 0.98 1 0.3 0.98 8 0.98 X = (0.1(2) + 0.2(5) + 0.5(1) + 0.3(8)) – 0.2 = 3.9 Y = 1 / (1 + e-3.9) = 0.98

Back-Propagation Training Algorithm Step 3: Weight Training There are 2 types of weight training • For the output layer neurons • For the hidden layer neurons ***It is important to understand that first the input signals propagate forward, and then the errors propagate backward to help train the weights. *** In each iteration (p + 1), the weights are updated based on the weights from the previous iteration p. The signals keep flowing forward and backward until the errors are below some preset threshold value.

3.1 Weight Training (Output layer neurons) 1 w1,k Iteration p w2,k 2 • These formulas are used to perform weight corrections. k yk(p) wj,k j wm,k ek(p) = yd,k(p) - yk(p) m δ = error gradient

We want to compute this We know this We know how to compute this predefined 1 w1,k Iteration p We know how to compute these w2,k 2 k yk(p) wj,k j wm,k ek(p) = yd,k(p) - yk(p) m We do the above for each of the weights of the outer layer neurons.

3.2 Weight Training (Hidden layer neurons) 1 Iteration p 1 w1,j w2,j 2 2 • These formulas are used to perform weight corrections. j wi,j i k wn,j n l

We want to compute this We know this input predefined We know this 1 Iteration p 1 We know how to compute these w1,j w2,j 2 2 Propagates from the outer layer j wi,j i k We do the above for each of the weights of the hidden layer neurons. wn,j n l

P = 1

P = 1 Weights trained Weights trained

After the weights are trained in p = 1, we go back to Step 2 (Activation) and compute the outputs for the new weights. If the errors obtained via the use of the updated weights are still above the error threshold, we start weight training for p = 2. Otherwise, we stop.

P = 2

P = 2 Weights trained Weights trained

Example: 3-layer ANN for XOR x2 or input2 (0, 1) (1, 1) (0, 0) (1, 0) x1 or input1 XOR is not a linearly separable function. A single-layer ANN or the perceptron cannot deal with problems that are not linearly separable. We cope with these problem using multi-layer neural networks.

Example: 3-layer ANN for XOR Let α = 0.1 (Non-computing)

Calculate y3 = sigmoid(0.5+0.4-0.8) = 0.5250 0.5 1 0.5250 e= 0 – 0.5097 = – 0.5097 0.4 -0.63 0.9 0.9689 1 0.8808 1.0 Calculate y4= sigmoid(0.9+1.0+0.1) = 0.8808 Example: 3-layer ANN for XOR • Training set: x1= x2= 1 and yd,5= 0 Let α = 0.1 0.8 0.5 1 -1.2 0.3 0.4 0.5097 0 0.9 1 1.1 1.0 -0.1 y5 = sigmoid(-0.63+0.9689-0.3) = 0.5097 26

0.5 1 0.5250 0.4 -0.63 0.9 0.9689 1 0.8808 1.0 Example: 3-layer ANN for XOR (2) • Back-propagation of error (p = 1, output layer) Δwj,k(p) = α x yj(p) x δk(p) Let α = 0.1 Δw3,5 (1) = 0.1 x 0.5250 x (-0.1274) = -0.0067 wj,k(p+1) = wj,k(p) + Δwj,k(p) w3,5 (2) = -1.2 – 0.0067 = -1.2067 0.8 0.5 1 -1.2067 -1.2 0.3 e= – 0.5097 0.4 y5 = 0.5097 0 0.9 1 δ = y5x (1-y5) x e 1.1 = 0.5097 x (1-0.5097) x (-0.5097) = -0.1274 1.0 -0.1 27

0.5 1 0.5250 0.4 -0.63 0.9 0.9689 1 0.8808 1.0 Example: 3-layer ANN for XOR (3) • Back-propagation of error (p = 1, output layer) Let α = 0.1 0.8 0.5 1 -1.2067 -1.2 0.3 e= – 0.5097 0.4 y5 = 0.5097 0 0.9 1 δ =-0.1274 1.1 1.0888 1.0 Δwj,k(p) = α x yj(p) x δk(p) -0.1 Δw4,5 (1) = 0.1 x 0.8808 x (-0.1274) = -0.0112 wj,k(p+1) = wj,k(p) + Δwj,k(p) w4,5 (2) = 1.1 – 0.0112 = 1.0888 28

0.5 1 0.5250 0.4 -0.63 0.9 0.9689 1 0.8808 1.0 Example: 3-layer ANN for XOR (4) • Back-propagation of error (p = 1, output layer) Let α = 0.1 Δθk(p) = α x y(p) x δk(p) Δθ5 (1) = 0.1 x -1 x (-0.1274) = 0.0127 θ5 (p+1) = θ5 (p) + Δ θ5 (p) θ5 (2) = 0.3 + 0.0127= 0.3127 0.8 0.5 1 -1.2067 0.3127 -1.2 0.3 e= – 0.5097 0.4 y5 = 0.5097 0 0.9 1 δ =-0.1274 1.1 1.0888 1.0 -0.1 29

0.5 1 0.5250 0.4 -0.63 0.9 0.9689 1 0.8808 1.0 Example: 3-layer ANN for XOR (5) • Back-propagation of error (p = 1, input layer) Let α = 0.1 δj(p) = yi(p) x (1-yi (p)) x ∑ [αk(p) wj,k(p)], all k’s Δwi,j(p) = α x xi(p) x δj(p) δ3(p) = 0.525 x (1- 0.525) x (-0.1274 x -1.2) = 0.0381 Δw1,3 (1) = 0.1 x 1 x 0.0381 = 0.00381 wi,j(p+1) = wi,j(p) + Δwi,j(p) w1,3 (2) = 0.5 + 0.00381 = 0.5038 0.8 0.5038 0.5 1 -1.2067 0.3127 -1.2 0.3 e= – 0.5097 0.4 y5 = 0.5097 0 0.9 1 δ =-0.1274 1.1 1.0888 1.0 -0.1 30

0.5 1 0.5250 0.4 -0.63 0.9 0.9689 1 0.8808 1.0 Example: 3-layer ANN for XOR (6) • Back-propagation of error (p = 1, input layer) Let α = 0.1 δj(p) = yi(p) x (1-yi (p)) x ∑ [αk(p) wj,k(p)], all k’s Δwi,j(p) = α x xi(p) x δj(p) δ4(p) = 0.8808 x (1- 0.8808) x (-0.1274 x 1.1) = -0.0147 Δw1,4 (1) = 0.1 x 1 x -0.0147 = -0.0015 wi,j(p+1) = wi,j(p) + Δwi,j(p) w1,4 (2) = 0.9 -0.0015 = 0.8985 0.8 0.5038 0.5 1 -1.2067 0.3127 -1.2 0.3 e= – 0.5097 0.4 y5 = 0.5097 0.8985 0 0.9 1 δ =-0.1274 1.1 1.0888 1.0 -0.1 31

0.5 1 0.5250 0.4 -0.63 0.9 0.9689 1 0.8808 1.0 Example: 3-layer ANN for XOR (7) • Back-propagation of error (p = 1, input layer) Let α = 0.1 δ3(p) = 0.0381δ4(p) = -0.0147 Δwi,j(p) = α x xi(p) x δj(p) Δw2,3 (1) = 0.1 x 1 x 0.0381 = 0.0038 wi,j(p+1) = wi,j(p) + Δwi,j(p) w2,3 (2) = 0.4 + 0.0038 = 0.4038 0.8 0.5038 0.5 1 -1.2067 0.3127 -1.2 0.3 e= – 0.5097 0.4 0.4038 y5 = 0.5097 0.8985 0 0.9 1 δ =-0.1274 1.1 1.0888 1.0 -0.1 32

0.5 1 0.5250 0.4 -0.63 0.9 0.9689 1 0.8808 1.0 Example: 3-layer ANN for XOR (8) • Back-propagation of error (p = 1, input layer) Let α = 0.1 δ3(p) = 0.0381δ4(p) = -0.0147 Δwi,j(p) = α x xi(p) x δj(p) Δw2,4 (1) = 0.1 x 1 x -0.0147 = -0.0015 wi,j(p+1) = wi,j(p) + Δwi,j(p) w2,4 (2) = 1 – 0.0015 = 0.9985 0.8 0.5038 0.5 1 -1.2067 0.3127 -1.2 0.3 e= – 0.5097 0.4 0.4038 y5 = 0.5097 0.8985 0 0.9 1 δ =-0.1274 1.1 1.0888 1.0 -0.1 0.9985 33

0.5 1 0.5250 0.4 -0.63 0.9 0.9689 1 0.8808 1.0 Example: 3-layer ANN for XOR (9) • Back-propagation of error (p = 1, input layer) Let α = 0.1 δ3(p) = 0.0381δ4(p) = -0.0147 Δθk(p) = α x y(p) x δk(p) Δθ3 (1) = 0.1 x -1 x 0.0381 = -0.0038 θ3 (p+1) = θ3 (p) + Δ θ3 (p) θ3 (2) = 0.8 - 0.0038 = 0.7962 0.7962 0.8 0.5038 0.5 1 -1.2067 0.3127 -1.2 0.3 e= – 0.5097 0.4 0.4038 y5 = 0.5097 0.8985 0 0.9 1 δ =-0.1274 1.1 1.0888 1.0 -0.1 0.9985 34

0.5 1 0.5250 0.4 -0.63 0.9 0.9689 1 0.8808 1.0 Example: 3-layer ANN for XOR (10) • Back-propagation of error (p = 1, input layer) Let α = 0.1 δ3(p) = 0.0381δ4(p) = -0.0147 Δθk(p) = α x y(p) x δk(p) Δθ4 (1) = 0.1 x -1 x (-0.0147) = 0.0015 θ4 (p+1) = θ4 (p) + Δ θ4 (p) θ4 (2) = -0.1 + 0.0015 = -0.0985 0.7962 0.8 0.5038 0.5 1 -1.2067 0.3127 -1.2 0.3 e= – 0.5097 0.4 0.4038 y5 = 0.5097 0.8985 0 0.9 1 δ =-0.1274 1.1 1.0888 1.0 -0.1 0.9985 -0.0985 35

Example: 3-layer ANN for XOR (9) α = 0.1 0.7962 0.8 0.5038 0.5 -1.2067 0.3127 -1.2 0.3 0.4 0.4038 0.8985 0.9 1.1 1.0888 1.0 -0.1 0.9985 Now the 1st iteration (p = 1) is finished. Weight training process is repeated until the sum of squared errors is less than 0.001 (threshold). 36 -0.0985

Learning Curve for XOR The curve shows ANN learning speed. 224 epochs or 896 iterations were required.

Final Results 7.3 0.8 4.7 0.5 -10.4 4.6 -1.2 0.3 0.4 4.8 6.4 0.9 Training again with different initial values may result differently. It works so long as the sum of squared errors is below the preset error threshold. 1.1 9.8 1.0 -0.1 6.4 38 2.8

Final Results Different result possible for different initial. But the result always satisfies the criterion.

McCulloch-Pitts Model: XOR Op. Activation function: sign function

Decision Boundary (a) Decision boundary constructed by hidden neuron 3; (b) Decision boundary constructed by hidden neuron 4; (c) Decision boundaries constructed by the complete three-layer network

Problems of Back-Propagation • Not similar to the process of a biological neuron • Heavy computing load

Accelerated Learning in Multi-layer NN (1) • Represent sigmoid function by hyperbolic tangent: where a and b are constants. Suitable values: a = 1.716 and b = 0.667

Accelerated Learning in Multi-layer NN (2) • Include a momentum term in the delta rule where  is a positive number (0  1) called the momentum constant. Typically, the momentum constant is set to 0.95. This equation is called the generalized delta rule.

Learning with Momentum Reduced from 224 to 126 epochs

Accelerated Learning in Multi-layer NN (3) • Adaptive learning rate: Idea • small  smooth learning curve • large  fast learning, possibly instable • Heuristic rule: • increase learning rate when the change of the sum of squared errors has the same algebraic sign for several consequent epochs. • decrease learning rate when the sign alternates for several consequent epochs

Effect of Adaptive Learning Rate

Momentum + Adaptive Learning Rate

The Hopfield Network • Neural networks were designed on an analogy with the brain, which has associative memory. • We can recognize a familiar face in an unfamiliar environment.  Our brain can recognize certain patterns even though some information about the patterns differ from what we have remembered. • Multilayer ANNs are not intrinsically intelligent. • Recurrent Neural Networks (RNNs) are used to emulate human’s associative memory. • Hopfield network is a RNN.

The Hopfield Network: Goal • To recognize a pattern even if some parts are not the same as what it was trained to remember. • The Hopfield network is a single-layer network. • It is recurrent. The network outputs are calculated and then fed back to adjust the inputs. The process continues until the outputs become constant. • Let’s see how it works.

Chapter 6: Artificial Neural Networks Part 2 of 3 (Sections 6.4 – 6.6)