Learning and Perceptrons

Learning and Perceptrons CIS 479/579 Bruce R. Maxim UM-Dearborn

Momentum and Friction • When human players use a mouse to aim • Momentum turns the view more than they expect for large angles (the ballistic mouse thing) • Friction slows down the turn for small angles • Adjustments are needed to avoid losing accuracy • For AI players they just aim at an exact target and shoot • Perfect shooters may not be fun to play against so turning errors can be introduced

Explicit Model • A mathematical function for computing the actual turning angle in terms of desired angle and previous output  = 1.0 + 0.1 * noise(angle) output(t) = (angle * ) *  + output(t – 1) * (1 - ) • = scaling factors for blending previous output with angle request in range [0.3,0.5] • = initialized to random value between in range [0.9, 1.1] noise( ) returns value in range [-1,1] could use cos(angle2 * 0.217 +342/angle)

Linear Approximation • We can use a perceptron to approximate the function described earlier • Once the animat learns the a faster approximation for function it can be removed from the AI code • Aiming errors just become a constraint on animat behavior

Methodology • Approximation computed by training network iteratively • Desired output is computed for random inputs • By grouping results, the batch algorithm can be used to find values for weights and bias • A small perceptron is applied twice (to get pitch and then yaw) rather creating a larger one that does both • This reduces memory use at the expense of programming time

Accumulating Errors • Momentum and friction causes errors or drift that tend to accumulate after several turns • These errors allow the AI to perform more realistically performance • Ignoring the variations in aiming will make the AI too error prone to challenge human players

Inverse Error • To compensate for aiming errors, we could define an inverse error function to help correct the aiming errors • Not every function as a definable inverse so that AI would be better served by a math-free method of approximating this type of function • Given enough trial and error through simulation opportunities the AI should be able to predict the corrected angles needed

Learning - 1 • In effect the AI learns how to deal with aiming errors by receiving evaluative feedback • Using this feedback the AI can incrementally improve its task performance • The AI uses its sensors to detect the actual angles the body was turned since the last update • Unfortunately the AI learns to shoot where it should have shot last time

Learning - 2 • With enough trials the AI can learn to anticipate where to shoot (the NN weights provide a crude memory to work with) • Both the inputs and outputs will need to be scaled because the perceptron will have to deal with values that are not within the unit vector

Aimy • Perceptron is used to learn corrected angles needed to prevent undershooting and overshooting • Gathers data from its sensors to determine how far its body turned based on each requested angle • Incremental training is used to approximate the inverse function needed to prevent aming errors

Evaluation - 1 • Animat should have the opportunity to correct aiming while moving around • Perceptrons can learn more quickly when more training samples are presented • The animat can corrects its aim on only two dimensions (pitch and yaw) • Only when pitch is near horizontal can the animat aim while it is moving

Evaluation - 2 • When looking fully up or fully down there is no forward movement is possible, this prevents learning • To prevent this trap, the animat is only allowed to control yaw until satisfactory results are obtained • The worst that happens is the animat spinning around while learning

Evaluation - 3 • The way in which the yaw is chosen determines the angles available for learning • If the animat full control over the yaw, it can decide what to learn and what to ignore (the effect may be for the NN to always predict the same turn to correct aiming errors) • This is a good reason for forcing the NN to examine a variety of randomly generated angles during training to get a more representative training set and better learning

Multilayer Perceptrons • Single layer perceptrons can only deal with linear problems • Non-linear problems can only be approximated by single layer perceptrons • Multilayer perceptrons (MLP) • Have extra middle layers know as “hidden” layers • The middle layers require more sophisticated activation functions than single layer perceptrons (e.g. linear activations would make MLP behave like single layer perceptron)

Topology • MLP topology is said to be forward feed because there are no backward (recurrent) connections • There can be an arbitrary number of hidden layers in MLP • Adding too many hidden layers increases the computational complexity of the network • One hidden layer is usually enough to allow the MLP to be a universal approximator capable of approximating any continuous function

Hidden Layers • In some cases, there may be many independencies among the input variables and adding an extra hidden layer can be helpful • Adding hidden layers some times can reduce the total number of weights needed for suitable approximation • MLP with two hidden layers can approximate any non-continuous functions

Hidden Neurons • Choosing the number of neurons in the hidden layer is an art, often depends on the AI designer’s intuition and experience • The neurons in the hidden layer are needed to represent the problem knowledge internally • As the number of dimensions grows the complexity of the decision surface (path through hidden layer) increases • Basically the output on one side of the surface is positive and negative on the other side

Connections • Neurons can be fully connected to one another within and between layers • Neurons can also be sparsely connected and even skip layers (e.g. straight from input to output) • Most MLP are fully connected to simplify programming

Activation Function Properties • Derivable (known and computable derivative) • Continuous (derivative defined every where) • Complexity (nonlinear for higher order tasks) • Monotonous (derivative positive) • Boundless (activation output and its derivative are finite) • Polarity (bipolar preferred to positive)

Activation Functions • Activation functions for the input and output layers are usually one of the following: • Step, Linear, Threshold logic, Sigmoid • Hidden layer activation functions might be one of the following • Sigmoid: sig(x) = 1/(1 + e-x) • Hyperbolic tangent • Bipolar Sigmoid: sigb(x) = 2/(1 + e-x) - 1

Role of Hidden Layers • The use of a hidden layer implies that the information needed to compute the output must be filtered before passing it on to the next layer • Each layer of the MLP receives its input from the previous layer and passes its modified output on to the next layer

Feed-Forward Algorithm current = input; // process input layer for layer = 1 to n { for i = 1 to m // compute output of each neuron { // multiply arrays and sum result s = NetSum(neuron(I).weights.current); output[i] = Activate(s); } // next layer uses this layer’s output as input current = output; }

Benefits of MLP • The importance of MLP’s is not that they really mimic animal brains, they do not • MLP have a thoroughly researched mathematical foundation and have been proven to work well in some applications • MLP can be trained to do interesting things and this training really just involves numeric optimization (minimizing output error)

Back Propagation - 1 • BP is the process of filtering error from the output layer back through the preceding layers • BP was developed in response to fact that single layer perceptron algorithms do not train hidden layers • BP is the essence of most MLP learning algorithms

Back Propagation - 2 • Form of hill climbing know as “gradient ascent” hill climbing • several directions tried simultaneously • “steepest gradient” used to direct search • Training may require thousands of backpropagations • BP can get stuck or become unstable during training • BP can be done in stages

Back Propagation - 3 • BP can train a net to recognize several concepts simultaneously • Trained neural networks can be used to make predictions • Too many trainable weights relative to the number of training facts can lead to overflow problems

Back Propagation Algorithm - 1 • Given: set of input-output pairs • Task: compute weights for 3 layer network at maps inputs to corresponding outputs Algorithm: 1.Determine the number of neurons required 2.Initialize weights to random values 3.Set activation values for threshold units

Back Propagation Algorithm - 2 4.Choose and input-output pair and assign activation levels to input neurons 5.Propagate activations from input neurons to hidden layer neurons for each neuron hj = 1/(1 + e- w1ijXi) 6.Propagate activations from hidden layer neurons to output neurons for each neuron oj = 1/(1 + e- w2ijhi)

Back PropagationAlgorithm - 3 7.Compute error for output neurons by comparing pattern to actual 8.Compute error for neurons in hidden layer 9.Adjust weights in between hidden layer and output layer 10.Adjust weights between input layer and hidden layer 11.Go to step 4

Backprop - 1 // compute gradient in last layer neurons for j = 1 to m delta[j] = deriv_activate(net_sum) * (desired[j] – output[j]); for i = last – 1 to first // process layers for j = 1 to m { total = 0; for k = 1 to n total += delta[k] * weights[j][k]; delta[j] = deriv_activate(net_sum) * total; }

Backprop - 2 // steepest descent for error gradient for // each weight for j = 1 to m for i = 1 to n // adjust weights using error gradient weight[j][i] += learning_rate * delta[j] * output[I]; // The generalized delta rule is used to // compute each weight wij // learning_rate set by KE // delta[j] is gradient of neuron j error

Quick Propagation • Batch technique • Exploits locally adaptive techniques to adjust step magnitude based on local parameters • Uses knowledge of higher-order derivatives (e.g. Newton’s methods) • Allows for better prediction of the slope of the curve and location of minima • Weights updated using method similar to backprop

Quickprop - 1 // Requires two additional arrays for step and // gradient - it remembers last set of values // New weight update replaces steepest descent for j = 1 to m for i = 1 to n // compute gradient and step { new_gradient[j][i] = -delta[j] * input[i]; new_step[j][i] = new_gradient[j][i] / (old_gradient[j][i] – new_gradient[j][I]) * old_step[j][i];

Quickprop - 1 // adjust weight weight[j][i] += new_step[j][i]; // store values for next iteration old_step[j][i] = new_step[j][i]; old_gradient[j][i] = new_gradient[j][i]; } • Note since this is a batch algorithm all gradients for each training samples are added together

Resilient Propagation • Weights updated only after all training samples have been seen • The step size is not determined by the gradient unlike steepest descent techniques • Equations are not too hard to implement

Rprop - 1 // New weight update replaces steepest descent for j = 1 to m for i = 1 to n // compute gradient and step { new_gradient[j][i] = -delta[j] * input[i]; // analyze change to get size of update if(new_gradient[j][i]*old_gradient[j][i]>0) new_update[j][i] = nplus * new_update[j][i]; else if(new_gradient[j][i]*old_gradient[j][i]<0) new_update[j][i] = nminus * new_update[j][i]; else new_update[j][i] = old_update[j][i];

Rprop - 2 // determine step direction if(new_gradient[j] > 0) step[j][i] = -new_update[j][i]; else if(new_gradient[j] < 0) step[j][i] = new_update[j][i]; else step[j][i] = 0; // adjust weight and store values weight[j][i]+= step[j][i]; old_update[j][i] = new_update[j][i]; old_gradient[j][i] = new_gradient[j][i]; }

Building Neural Networks • Define the problem in terms of neurons • think in terms of layers • Represent information as neurons • operationalize neurons • select their data type • locate data for testing and training • Define the network • Train the network • Test the network

Structuring the Training Facts • Use randomly ordered facts • Use representative data • Include people who survive surgery as well as people who do not • Neurons can’t be coded 1=horse#1, 2=horse#2, etc. • Networks like lots of inputs and outputs • Better to use two output neurons (one for buy and one for sell than one coded 1=buy and 0=sell)

Structuring the Training Facts • For historical data, use “rows” not “columns” don’t use: day1 day2 day3 3 4 5 do use: day 3 4 5

Structuring the Training Facts • Neural networks like differences over big numbers use –50 not 350 vs 400 • For seasonal data use 1 column per month with winter cases coded 1 for Dec, Jan, Feb, and 0 for other months • Think qualitatively not quantitatively use: restaurant visit on Monday in early Feb not: restaurant visit on day 43

Generalization – 1 • Learning phase is responsible for optimizing the weights from the training examples • It would be good if the NN could also process new or unseen examples correctly as well (generalization) • If NN is bound too tightly to training examples is known as overfitting • Overfitting is never a problem with single layer perceptrons

Generalization – 2 • For MLP number of hidden neurons affects complexity of decision surface • Need to find the trade-off between the number of hidden neurons and result quality • Incorrect or incomplete data interferes with generalization • Bad training examples are usually to blame for failure of MLP to learn concepts

Testing and Validation • Training sets – used to optimize the weights for a given set of parameters • Validation sets – used to check the quality of training, help to find best combination of parameters • Testing sets – check final quality of validated perceptrons (no test info is used to improve NN)

How can you tell things aren’t working out? • Your network refuses to learn 10-20% of the training facts • Things to try • Check definition file for data range errors • Check for bad (incorrect) facts • Some training facts may conflict with one another • The training tolerance level may be too strict for the data being used • Switch from absolute score to differences

Batch vs Incremental • Batch preferred over incremental training • Converge to answer faster • Have greater accuracy • Incremental data can be gathered for batch processing if necessary • Incremental approaches best suited for real-time, in-game learning (requires less memory)

Learning and Perceptrons