150 likes | 265 Views
Back Propagation and Representation in PDP Networks. Psychology 209 February 6, 2013. Homework 4. Part 1 due Feb 13 Complete Exercises 5.1 and 5.2.
E N D
Back Propagation and Representation in PDP Networks Psychology 209February 6, 2013
Homework 4 • Part 1 due Feb 13 • Complete Exercises 5.1 and 5.2. • It may be helpful to carry out some explorations of parameters, as suggested in Exercise 5.3. This may help you achieve a solution in the last part of the homework, below. However, no write-up is required for this. • Part 2 due Feb 20 • Consult Chapter 8 of the PDP Book by Rumelhart, Hinton, and Williams (In readings directory for Feb 6). Consider the problems described there that were solved using back propagation, and choose one; or create a problem of your own to investigate with back propagation. • Carry out Exercise 5.4, creating your own network, template, pattern, and startup file (similar to bpxor.m), and answer question 5.4.1.
The Perceptron For input pattern p, teacher tp and output op, change the threshold And weights as follows: Note: including bias = -qin net and using threshold of 0, thentreating bias as a weight from a unit that is always on is equivalent
Gradient Descent Learning in the ‘LMS’ Associator Output is a linear function of inputs and weights: Find a learning rule to minimize the Summed squared Error: Consider the policy: This breaks down into the sum overpatterns of terms of the form: Taking derivatives, we find:
What if we want to learn how to solve xor? We need to figure out how to adjust the weightsinto the ‘hidden’ unit, following the principle ofgradient descent:
We start with an even simplerproblem 1 2 0 w10 w21 Assume units are linear, both weights = .5 and, i = 1, t = 1. Weight changes should follow the gradient: We use the chain rule to calculate for each weight. First we unpack the chain, then we calculate the elements of it.
Including a non-linear activation function • Let • Then • So our chains from before become:
Including the activation function in the chain rule and including more than one output unit leads to the formulation below, in which we use ‘di’ to represent ∂E/∂neti Calculating the d term for output unit i: di = (ti-ai)f’(neti) i And the d term for hidden unit j: dj = f’(netj)Sidiwij j We can continue this back indefinitely… ds = f’(nets)Srdrwrs The weight change rule at every layer is: Dwrs = edras k
Back propagation algorithm • Propagate activation forward • Activation can only flow from lower-numbered units to higher numbered units • Propagate “error” backward • Error flows from higher numbered units back to lower numbered units • Calculate ‘weight error derivative’ terms = dras • One can change weights after processing a single pattern or accumulate weight error derivatives over a batch of patterns before changing the weights.
Variants/Embellishments to back propagation • Full “batch mode” (epoch-wise) learning rule with weight decay and momentum:Dwrs= eSpdrpasp – wwrs + aDwrs(prev) • Weights can alternatively be updated after each pattern or after every k patterns. • An alternative error measure has both conceptual and practical advantages:CEp = -Si[tiplog(aip) + (1-tip)log(1-aip)] • If targets are actually probabilistic, minimizing CEp maximizes the probability of the observed target values. • This also eliminates the ‘pinned output unit’ problem.
Why is back propagation important? • Provides a procedure that allows networks to learn weights that can solve any deterministic input-output problem. • Contrary to expectation, it does not get stuck in local minima except in cases where the network is exceptionally tightly constrained. • Allows networks to learn how to represent information as well as how to use it. • Raises questions about the nature of representations and of what must be specified in order to learn them.
Is Backprop biologically plausible? • Neurons do not send error signals backward across their weights through a chain of neurons, as far as anyone can tell • But we shouldn’t be too literal minded about the actual biological implementation of the learning rule. • Some neurons appear to use error signals, and there are ways to use differences between activation signals to carry error information