Topic 1

Topic 1 Neural Networks

OUTLINES • Neural Networks • Cerebellar Model Articulation Controller (CMAC) • Applications References • C.L. Lin & H.W. Su, “Intelligent control theory in guidance and control system design: an overview,” Proc. Natl. Sci. Counc. ROC (A), pp. 15-30

1. Neural Networks • As you read these words you are using a complex biological neural network. You have a highly interconnected set of 1011 neurons to facilitate your reading, breathing, motion and thinking. • In the artificial neural network, the neurons are not biological. They are extremely simple abstractions of biological neurons, realized as elements in a program or perhaps as circuits made of silicon.

Biological Inspiration • Human brain consists of a large number (about 1011) of highly interconnected elements (about 104 connections per element) called neurons. • Three principle components are the dendrites, the cell body and the axon. • The point of contact is called a synapse.

Biological Neurons Dendrites(樹突): carry electrical into the cell body Cell Body(細胞體): sums and thresholds these incoming signals Axon(軸突): carry the signal from the cell body out to other neurons Synapse(突觸): contact between an axon of one cell and a dendrites of another cell

Neural Networks • Neural Networks: a promising new generation of information processing systems, usually operate in parallel, that demonstrate the ability to learn, recall, and generalize from training patterns or data. • Artificial neural networks are collections of mathematical models that emulate some of the observed properties of biological nervous systems and draw on the analogies of adaptive biological learning.

Basic Model ~ 1 s1 s2 node y … output y = f(s1,s2,…,sn) sn input • A neural network is composed of four pieces: nodes, connections between the nodes, nodal functions, and a learning rule for updating the information in the network.

Basic Model ~ 2 • Nodes: a number of nodes, each an elementary processor (EP) is required. • Connectivity: This can be represented by a matrix that shows the connections between the nodes. The number of nodes plus the connectivity define the topology of the network. In the human brain, all neurons are connected to about 104 other neurons. Artificial nets can range from totally connected to a topology where each node is just connected to its nearest neighbors.

Basic Model ~ 3 • Elementary processor functions: A node has inputs s1,…, sn and an output y, and the node generates the output y as a function of the inputs. • A learning rule: There are two types of learning: • Supervised learning: you have to teach the networks the “answers.” • Unsupervised learning: the network figures out the answers on its own. All the learning rules try to embed information by sampling the environment.

Perceptron Model • Suppose we have a two class problem. If we can separate these classes with a straight line (decision surface), then they are separable. • The question is, how can we find the best line, and what do we mean by “best.” • In n dimensions, we have a hyperplane separating the classes. These are all decision surfaces. • Another problem is that you may need more than one line to separate the classes.

Decision Surfaces x x x x x x x x x x x o o o x o o o o o o o Multi-line decision surface Linearly separableclasses

Single Layer Perceptron Model x1 f(x) w1 output inputs wn xn • xi: inputs to the node; y: output;wi: weights; : threshold value. • The output y can be expressed as: • The function f is called the nodal (transfer) function and is not the same in every application

Nodal Function 1 1 1 -1 Hard-limiter Threshold function Sigmoid function

Single Layer Perceptron Model • Two-input case: w1x1 + w2x2 = 0 • If we use the hard limiter, then we could say that if the output of the function is a 1, the input vector belongs to class A. If the output is a –1, the input vector belongs to class B. • XOR: caused the field of neural networks to lose credibility in the 1940’s. The perceptron model could not draw a line to separate the two classes given by the exclusive-OR.

Exclusive OR problem ? o (0,1) x (1,1) ? o x (0,0) (1,0)

Two-layer Perceptron Model y1 w11 x1 • The outputs from the two hidden nodes are • The network output is w’1 w12 z w21 x2 w’2 w22 y2

Exclusive-XOR problem f 0.5 output unit -2 +1 g +1 hidden unit 1.5 +1 +1 input units x y

Exclusive-XOR problem • g = sgn (1·x + 1·y1.5) • f= sgn (1·x + 1·y 2g 0.5) • input (0,0)  g=0  f=0 input (0,1)  g=0  f=1 input (1,0)  g=0  f=1 input (1,1)  g=1  f=0

output patterns internal representation units input patterns Multilayer Network k wkj j wji i

Weight Adjustment • Adjust weights by: wji(l+1) = wji(l) + wji where wji(l) is the weight from unit i to unit j at time l (or the lth iteration) and wji is the weight adjustment. • The weight change may be computed by the delta rule: wji =  j ii where  is a trial-independent learning rate and jis the error at unit j: j = tj  oj where tj is the desired output and oj is the actual output at output unit j. • Repeat iterations until convergence.

Generalized Delta Rule • : the target output for jth component of the output pattern for pattern p. • : the jth element of the actual output pattern produced by the presentation of input pattern p. • : the value of the ith element of the input pattern. • : is the change to be made to the weight from the ith to the jth unit following presentation of pattern p.

Delta Rule and Gradient Descent • : the error on input/output pattern p: the overall measure of the error. • We wish to show that the delta rule implements a gradient descent in E when units are linear.We will proceed by simply showing that which is proportional to as prescribed by the delta rule.

Delta Rule & Gradient Descent • When there are no hidden units it is easy to compute the relevant derivative. For this purpose we use the chain rule to write the derivative as the product of two parts: the derivative of the error with respect to the output of the unittimesthe derivative of the output with respect to the weight. • The first part tells how the error changes with the output of the jth unit and the second part tells how much changing wji changes that output.

Delta Rule & Gradient Descent no hidden units • The contribution of unit j to the error is simply proportional to pj. • Since we have linear units, . From which we conclude that . • Thus, we have

Delta Rule and Gradient Descent • Combining this with the observation that should lead us to conclude that the net change in wji after one complete cycle of pattern presentations is proportional to this derivative and hence that the delta rule implements a gradient descent in E. In fact, this is strictly true only if the values of the weights are not changed during this cycle.

Delta Rule for Activation Functions in Feedforward Networks • The standard delta rule essentially implements gradient descent in sum-squared error for linear activation functions. • Without hidden units, the error surface is shaped like a bowl with only one minimum, so gradient descent is guaranteed to find the best set of weights. • With hidden units, however, it is not so obvious how to compute the derivatives, and the error surface is not concave upwards, so there is the danger of getting stuck in local minimum.

Delta Rule for Semilinear Activation Functions in Feedforward Networks • The main theoretical contribution is to show that there is an efficient way of computing the derivatives. • The main empirical contribution is to show that the apparently fatal problem of local minima is irrelevant in a wide variety of learning tasks. • A semilinear activation function is one in which the output of a unit is a non-decreasing and differentiable function of the net total input, where oi=ii if unit i is an input unit.

Delta Rule for Semilinear Activation Functions in Feedforward Networks • Thus, a semilinear activation function is one in which and f is differentiable and non-decreasing. • To get the correct generalization of the delta rule, we must set where E is the same sum-squared error function defined earlier.

Delta Rule for Semilinear Activation Functions in Feedforward Networks • As in the standard delta rule, it is to see this derivative as resulting from the product of two parts: • One part reflecting the change in error as a function of the change in the net input to the unit and one part representing the effect of changing a particular weight on the net input. • The second factor is

Delta Rule for Semilinear Activation Functions in Feedforward Networks • Define • Thus, • This says that to implement gradient descent in E we should make our weight changes according to just as in the standard delta rule. • The trick is to figure out what pj should be for each unit uj in the network.

Delta Rule for Semilinear Activation Functions in Feedforward Networks • Compute • The second factor: which is simply the derivative of the function fj for the jth unit, evaluated at the net input netpj to that unit.Note: • To compute the first factor, we consider two cases.

Delta Rule for Semilinear Activation Functions in Feedforward Networks • First, assume that unit uj is an output unit of the network. In this case, it follows from the definition of Ep that • Thus, for any output unit uj.

Delta Rule for Semilinear Activation Functions in Feedforward Networks • If uj is not an output unit we use the chain rule to write • Thus, whenever uj is not an output unit.

Delta Rule for Semilinear Activation Functions in Feedforward Networks • If uj is an output unit: • If uj is not an output unit: • The above two equations give a recursive procedure for computing the ’s for all units in the network, which are then used to compute the weight changes in the network.

Delta Rule for Semilinear Activation Functions in Feedforward Networks • The application of the generalized delta rule, thus, involves two phases. • During the first phase the input is presented and propagated forward through the network to compute the output value opj for each unit. This output is then compared with the targets, resulting in an error signal pj for each output unit.

Delta Rule for Semilinear Activation Functions in Feedforward Networks • The second phase involves a backward pass through the network (analogous to the initial forward pass) during which the error signal is passed to each unit in the network and the appropriate weight changes are made.

Ex: Function Approximation t  p e + 1-2-1 Network

Network Architecture p a 1-2-1 Network

Initial Values Initial Network Response:

Forward Propagation Initial input: Output of the 1st layer: Output of the 2nd layer: error:

Transfer Func. Derivatives

Backpropagation • The second layer sensitivity: • The first layer sensitivity:

Weight Update • Learning rate

Choice of Network Structure • Multilayer networks can be used to approximate almost any function, if we have enough neurons in the hidden layers. • We cannot say, in general, how many layers or how many neurons are necessary for adequate performance.

Illustrated Example 1 1-3-1 Network

Illustrated Example 2 1-2-1 1-3-1 1-4-1 1-5-1

5 1 5 3 3 4 2 4 2 0 0 1 Convergence Convergence to Global Min. Convergence to Local Min. The numbers to each curve indicate the sequence of iterations.

Topic 1

Topic 1

Presentation Transcript

Topic 1

TOPIC 1

Topic 1

Topic 1

Topic 1

Topic 1

Topic 1

Topic 1

Topic 1

Topic 1

Topic 1

Topic 1

TOPIC 1

Topic 1

TOPIC 1

Topic 1

Topic 1

1. TOPIC

Topic 1

Topic 1

Topic 1