990 likes | 1.83k Views
Backpropagation neural networks. BP training method means a multilayer perceptron, feedforward neural network trained by BP with one or more hidden layers. Used to solve problem in many areas As usual, the aim
E N D
Backpropagation neural networks • BP training method means a multilayer perceptron, feedforward neural network trained by BP with one or more hidden layers. • Used to solve problem in many areas • As usual, the aim • To train the net to achieve balance between the ability to respond correctly to the input patterns that are used for training • The ability to give reasonable(good0 responses to input that is similar, but not identical, to that used in training (generalization)
Backpropagation neural networks • Training of a network by BP involves 3 stages • Feedfoward of the input training pattern • Calculation and backpropagation of the associated error • The adjustment of the weights • After training the application of the net only involve the feedforward phase
Architecture • The network consists of an input layer of source neurons, at least one middle or hidden layer of computational neurons, and an output layer of computational neurons. • The input signals are propagated in a forward direction on a layer-by-layer basis.
What does the middle layer hide? • A hidden layer “hides” its desired output. Neurons in the hidden layer cannot be observed through the input/output behaviour of the network. • There is no obvious way to know what the desired output of the hidden layer should be. • Commercial ANNs incorporate three and sometimes four layers, including one or two hidden layers. • Each layer can contain from 10 to 1000 neurons. • Experimental neural networks may have five or even six layers, including three or four hidden layers, and utilise millions of neurons.
Back-propagation neural network • Learning in a multilayer network proceeds the same way as for a perceptron. • A training set of input patterns is presented to the network. • The network computes its output pattern, and if there is an error or in other words a difference between actual and desired output patterns the weights are adjusted to reduce this error.
In a BP, the learning algorithm has 3 phases. • First, a training input pattern is presented to the network input layer. The network propagates the input pattern from layer to layer until the output pattern is generated by the output layer. • If this pattern is different from the desired output, an error is calculated and then propagated backwards through the network from the output layer to the input layer. • The weights are modified as the error is propagated.
… … Ym Yk Y1 w0k wpk w0m w1m wpm wp1 w1k wjm wj1 w01 w11 wjk 1 Z1 Zj Zp vnj v0p v01 v11 vi1 vn1 v0j v1j vij v1p vnp vip 1 X1 Xi Xn … … BP-NN with 1 hidden layer
… … Ym Yk Y1 w0k wpk w0m w1m wpm wp1 w1k wjm wj1 w01 w11 wjk 1 Z1 Zj Zp vnj v0p v01 v11 vi1 vn1 v0j v1j vij v1p vnp vip 1 X1 Xi Xn … … Feedforward of the input training pattern • Each input unit (Xi) receives an input signal and broadcasts this signal to each of the hidden units Z1, …, Zp. • Each hidden unit computes its activation and sends its signal (zj) to each output unit • Each output unit (Yk) computes its activation (yk) to form the response of the net for the given input pattern
Each output unit compares its computes activation yk with its target value tk to determine the associated error for that pattern with that unit. • Based on this error, the factor k (k = 1, …, m) is computed • kis used to distribute error at output unit Ykback to all units in the previous layer( the hidden unit that connected to Yk) • Also used to update the weights between the output and the hidden layer. • Similarly the factor j is computed for each hidden unit Zj … … Ym Yk Y1 w0k wpk w0m w1m wpm wp1 w1k wjm wj1 w01 w11 wjk 1 Z1 Zj Zp vnj v0p v01 v11 vi1 vn1 v0j v1j vij v1p vnp vip 1 X1 Xi Xn … … Training the input pattern
… … Ym Yk Y1 w0k wpk w0m w1m wpm wp1 w1k wjm wj1 w01 w11 wjk 1 Z1 Zj Zp vnj v0p v01 v11 vi1 vn1 v0j v1j vij v1p vnp vip 1 X1 Xi Xn … … Adjusting the weight • It is not necessary to propagate the error back to the input layer, but jis used to update the weights between the hidden layer and the input layer. • After all of the factor have been determined, the weights for all layers are adjusted simultaneously. • The adjustment to the weight wjk(from hidden unit Zjto output unit Yk) is based on the factor k and theactivation zj of the hidden unit Zj. • The adjustment to the weight vijis based on the factor jand activation xiof the input unit
Notation • x Input training vector: (x = (x1, …, xi, …, xn)) • t output target vector: t= (t1, …, tk, …, tm) • kPortion of error correction weight adjustment for wjk that is due to an error at output unit Yk; • The information about the error at unit Yk that is propagated back to the hidden units that feed into unit Yk • jPortion of error correction weight adjustment for vijthat is due to the BP of error information from output layer to the hidden unit Zj. • Learning rate • Xi Input unit i: For an input unit, the input signal and output signal are the same namely xi • v0j bias on hidden unit j
Notation • Zj hidden unit j • The net input to Zj is denoted as z_inj: • z_inj = v0j + xi vij ; • the output signal (activation) of Zj is denoted zj: • zj = f(z_inj ) • w0k bias on output unit k • Yk output unit k: • the net input to Yk is denoted y_ink: • y_ink = w0k + zi wjk • The output signal (activation) of Ykis denotedyk: • yk = f(y_ink)
Activation Function • An activation for a BP net should • Continuous • Differentiable • Monotonically non-decreasing • For computational efficiency the derivative should be easy to compute
1 f1(x) = 1 + exp(-x) With f1’(x) = f1(x)[1 - f1(x)] f(x) Activation Function • Binary sigmoid function (0, 1)
2 f2(x) = 1 + exp(-x) With f2’(x) = ½ [1 + f2(x)] [1 – f2(x)] ex – e-x tanh (x) = ex + e-x Activation Function • Bipolar sigmoid function (-1, 1) or hyperbolic tangent (tanh) - 1, f(x)
Training Algorithm • Refer text page 294
Initial Weight • Random initialization • Nguyen-Widrow Initialization
Initial Weight • Random initialization • The choice of initial weights will influence whether the net reaches a global (or only a local) minimum of the error • If so how quickly its converge • The update of the weight between two units depends on both the derivative of the upper unit’s activation function and the activation of the lower unit • For this reasons, it is important to avoid choices of initial weights that would make it likely that either activations or derivatives of activations are zero
Initial Weight • Random initialization • The values for initial weights must not be too large or the initial input signals to each hidden or output unit will likely to fall in the region where the derivative of the sigmoid function has a very small value (saturation region) • If the initial weights are too small, the net input to a hidden or output unit will be close to zero, which causes slow learning
Initial Weight • Random initialization • A common procedure is to initialize the weights (and biases) to random values between -0.5 and 0.5 (or between -1 and 1 or some other suitable interval) • The values may be +ve or –ve because the final weights after training maybe either sign.
n = 0.7(p)1/n = 0.7√p Initial Weight • Nguyen-Widrow Initialization • Faster learning • Based on response of hidden neurons to a single output n number of input units p number Scale factor
vi,j(old) vi,j = ||v,j(old)|| ex – e-x Based on activation tanh (x) = ex + e-x Initial Weight • Nguyen-Widrow Initialization • For each hidden unit (j = 1, …, p): • Initialize its weight vector (from input units): • vi,j(old) = random number between -0.5 and 0.5 (or between - and ) • Compute ||vj(old)|| = √ V1j(old)2 + V2j(old)2 + … + Vnj(old)2 • Reinitialize weights: Set bias v0,j= random number between - and
How long to train the net • To achieve balance between correct responses to training patterns and good responses to new input patterns (balance between memorization and generalization) • Not necessary to continue training until the total squared error actually reaches minimum • Hecht-Nielsen (1990) suggest s using 2 sets of disjoint data during training • Training pattern • Training-test patterns
How long to train the net • Hecht-Nielsen (1990) suggest s using 2 sets of disjoint data during training • Training pattern • Training-test patterns • Weight adjustment are based on the training patterns • But at interval during training the error is computed using training-testing patterns • Training continues as long as the error decreases • When error start increases – the net start memorizing – terminate training
W = e, or P = W e P How many training pairs there should be • If there are enough training patterns, the net will be able to generalize as desired (classify unknown training patterns correctly) • Enough training patterns is determined by the condition P – number of training patterns, W – the number of weights to be trained E – the accuracy of classification expected e.g if e = 0.1, a net with 80 weights will require 800 training patterns to be assured of classifying 90% of testing pattern correctly assuming the net was trained to classify 95% of the training patterns correctly
Data Representation • In many problems input vectors and output vectors have components in the same range of values • In many NN applications the data may be given by • A continuous-valued variable or • A set or ranges • E.g. food • temperature – can be represented by actual value or range of 4 values (frozen, chilled, room temp or hot) • In general easy for NN to learn a set of distinct responses than a continuous-valued response • But breaking continuous data into artificial categories can make the net difficult to learn.
Number of hidden layers • Theoretical results show than 1 hidden layer is sufficient for a BP net to approximate any continuous mapping from the input patterns to the output patterns to an arbitrary degree of accuracy • However, two hidden layers may make the training easier in some situation
Application Procedure • After training, a BP net is applied by using only the feed-forward phase of the training algorithm. Step 0. Initialize weights (obtain from training algorithm) Step 1. For each input vector, do Steps 2-4. Step 2. For I = 1, …, n: set activation of input unit xi; Step 3. For j = 1, …, p: z_inj = v0j + xi vij ; zj = f(z_inj ). Step 4. For k = 1, …, m: z_ink= v0k+ xi vik ; zk= f(z_inj ).
Predicting Dissatisfied Credit Card Customers • The data for this analysis was collected from two sources. The first set was comprised of data from a survey of Capital One credit card customers. • The variables from the survey included topics such as customer purchasing habits, employment status, and non-Capital One credit card usage, as well as others. • The internal data was drawn from Capital One’s consumer database and included data on each customers balance, APR, credit limit, credit worthiness, cash advances, and others. • Each customer record also included a binary rating indicating satisfaction or dissatisfaction with their Capital One credit card. • The total dataset included 22,242 records and 25 variables.
Predicting Dissatisfied Credit Card Customers Cleaning • Before any model could be built, the data had to be cleaned. • Throughout the data, were missing values. • The variables with the most occurrences of missing variables were v11 (average daily balance), v4 (over limit in the past 30 days?), and v5 (balance on non- Capital One cards). • These variables were missing 210, 198, and 171 values respectively. Out of 22,242 cases, 7.87% had one or more missing values.
Relevant variables • V2: How much money did you spend on purchases in the last 30 days? • V3: How many times did you make purchases in the last 30 days? • V10: How many years have you had any credit card? From the internal data, we selected the following variables: • V11: The average daily balance. • V12: The current balance. • V13: The current credit limit. • V14: How many months the customer is past due. • V15: The annual percentage rate. • V16: Index of credit worthiness. • V17: The number of months with a Capital One credit card. • V18: Initial credit limit assigned when account was opened.
æ ö 2 . 4 2 . 4 ç ÷ - + , ç ÷ F F è ø i i The back-propagation training algorithm Step 1: Initialisation Set all the weights and threshold levels of the network to random numbers uniformly distributed inside a small range: where Fi is the total number of inputs of neuron i in the network. The weight initialisation is done on a neuron-by-neuron basis.
Step 2: Activation Activate the back-propagation neural network by applying inputs x1(p), x2(p),…, xn(p) and desired outputs yd,1(p), yd,2(p),…, yd,n(p). (a) Calculate the actual outputs of the neurons in the hidden layer: where n is the number of inputs of neuron j in the hidden layer, and sigmoid is the sigmoid activation function.
Step 2: Activation (continued) (b) Calculate the actual outputs of the neurons in the output layer: where m is the number of inputs of neuron k in the output layer.
Step 3: Weight training Update the weights in the back-propagation network propagating backward the errors associated with output neurons. (a) Calculate the error gradient for the neurons in the output layer: where Calculate the weight corrections: Update the weights at the output neurons:
Step 3: Weight training (continued) (b) Calculate the error gradient for the neurons in the hidden layer: Calculate the weight corrections: Update the weights at the hidden neurons:
Step 4: Iteration Increase iteration p by one, go back to Step 2 and repeat the process until the selected error criterion is satisfied. • As an example, we may consider the three-layer back-propagation network. • Suppose that the network is required to perform logical operation Exclusive-OR. • Recall that a single-layer perceptron could not do this operation. Now we will apply the three-layer net.
The effect of the threshold applied to a neuron in the hidden or output layer is represented by its weight, , connected to a fixed input equal to 1. • The initial weights and threshold levels are set randomly as follows: • w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = 1.2, w45 = 1.1, 3 = 0.8, 4 = 0.1 and 5 = 0.3.