290 likes | 376 Views
Delve into the concepts of neural networks, focusing on perceptrons, training techniques, and their strengths and limitations. Explore how neurons function, the convergence theorem, and the backpropagation method. Discover the structure of perceptrons, learning processes, and implications for decision boundaries. Unravel the complexity of multi-layer perceptrons, their advantages, drawbacks, and the solution to the XOR problem using network topology. Learn about backpropagation, smooth functions, and neural net training goals. Gain insights into error reduction, gradient descent, and the computation of weights for optimal network performance. Improve your understanding of artificial intelligence and computational models with this comprehensive guide.
E N D
Learning with Perceptronsand Neural Networks Artificial Intelligence CMSC 25000 February 14, 2002
Agenda • Neural Networks: • Biological analogy • Perceptrons: Single layer networks • Perceptron training: Perceptron convergence theorem • Perceptron limitations • Neural Networks: Multilayer perceptrons • Neural net training: Backpropagation • Strengths & Limitations • Conclusions
Neurons: The Concept Dendrites Axon Nucleus Cell Body Neurons: Receive inputs from other neurons (via synapses) When input exceeds threshold, “fires” Sends output along axon to other neurons Brain: 10^11 neurons, 10^16 synapses
Artificial Neural Nets • Simulated Neuron: • Node connected to other nodes via links • Links = axon+synapse+link • Links associated with weight (like synapse) • Multiplied by output of node • Node combines input via activation function • E.g. sum of weighted inputs passed thru threshold • Simpler than real neuronal processes
Artificial Neural Net w x w Sum Threshold + x w x
Perceptrons • Single neuron-like element • Binary inputs • Binary outputs • Weighted sum of inputs > threshold • (Possibly logic box between inputs and weights)
Perceptron Structure y w0 wn w1 w3 w2 x0=-1 x1 x2 x3 xn . . . compensates for threshold x0 w0
Perceptron Convergence Procedure • Straight-forward training procedure • Learns linearly separable functions • Until perceptron yields correct output for all • If the perceptron is correct, do nothing • If the percepton is wrong, • If it incorrectly says “yes”, • Subtract input vector from weight vector • Otherwise, add input vector to weight vector
Perceptron Convergence Example • LOGICAL-OR: • Sample x1 x2 x3 Desired Output • 1 0 0 1 0 • 2 0 1 1 1 • 3 1 0 1 1 • 4 1 1 1 1 • Initial: w=(0 0 0);After S2, w=w+s2=(0 1 1) • Pass2: S1:w=w-s1=(0 1 0);S3:w=w+s3=(1 1 1) • Pass3: S1:w=w-s1=(1 1 0)
Perceptron Convergence Theorem • If there exists a vector W s.t. • Perceptron training will find it • Assume v.x > for all +ive examples x • w=x1+x2+..xk, v.w>= k • |w|^2 increases by at most 1, in each iteration • |w+x|^2 <= |w|^2+1…..|w|^2 <=k (# mislabel) • v.w/|w| > k / <= 1 Converges in k <= (1/ )^2 steps
x2 0 0 0 0 + +++ + + 0 0 0 x1 Perceptron Learning • Perceptrons learn linear decision boundaries • E.g. x2 + 0 But not 0 + x1 xor X1 X2 -1 -1 w1x1 + w2x2 < 0 1 -1 w1x1 + w2x2 > 0 => implies w1 > 0 1 1 w1x1 + w2x2 >0 => but should be false -1 1 w1x1 + w2x2 > 0 => implies w2 > 0
Neural Nets • Multi-layer perceptrons • Inputs: real-valued • Intermediate “hidden” nodes • Output(s): one (or more) discrete-valued X1 Y1 Y2 X2 X3 X4 Inputs Hidden Hidden Outputs
Neural Nets • Pro: More general than perceptrons • Not restricted to linear discriminants • Multiple outputs: one classification each • Con: No simple, guaranteed training procedure • Use greedy, hill-climbing procedure to train • “Gradient descent”, “Backpropagation”
Solving the XOR Problem o1 w11 Network Topology: 2 hidden nodes 1 output w13 x1 w01 w21 y -1 w23 w12 w03 w22 x2 -1 w02 o2 Desired behavior: x1 x2 o1 o2 y 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 -1 Weights: w11= w12=1 w21=w22 = 1 w01=3/2; w02=1/2; w03=1/2 w13=-1; w23=1
Backpropagation • Greedy, Hill-climbing procedure • Weights are parameters to change • Original hill-climb changes one parameter/step • Slow • If smooth function, change all parameters/step • Gradient descent • Backpropagation: Computes current output, works backward to correct error
Producing a Smooth Function • Key problem: • Pure step threshold is discontinuous • Not differentiable • Solution: • Sigmoid (squashed ‘s’ function): Logistic fn
Neural Net Training • Goal: • Determine how to change weights to get correct output • Large change in weight to produce large reduction in error • Approach: • Compute actual output: o • Compare to desired output: d • Determine effect of each weight w on error = d-o • Adjust weights
z1 z2 z3 y3 z3 w03 -1 w23 w13 y1 y2 z2 z1 w21 w01 w22 w02 w11 -1 w12 -1 x2 x1 Neural Net Example xi : ith sample input vector w : weight vector yi*: desired output for ith sample Sum of squares error over training samples Full expression of output in terms of input and weights
Gradient Descent • Error: Sum of squares error of inputs with current weights • Compute rate of change of error wrt each weight • Which weights have greatest effect on error? • Effectively, partial derivatives of error wrt weights • In turn, depend on other weights => chain rule
z1 z2 z3 y3 z3 w03 -1 w23 w13 y1 y2 z2 z1 w21 w01 w22 w02 w11 -1 w12 -1 x2 x1 Gradient of Error Note: Derivative of sigmoid: ds(z1) = s(z1)(1-s(z1) z1 MIT AI lecture notes, Lozano-Perez 2000
From Effect to Update • Gradient computation: • How each weight contributes to performance • To train: • Need to determine how to CHANGE weight based on contribution to performance • Need to determine how MUCH change to make per iteration • Rate parameter ‘r’ • Large enough to learn quickly • Small enough reach but not overshoot target values
Backpropagation Procedure i j k • Pick rate parameter ‘r’ • Until performance is good enough, • Do forward computation to calculate output • Compute Beta in output node with • Compute Beta in all other nodes with • Compute change for all weights with
Backpropagation Observations • Procedure is (relatively) efficient • All computations are local • Use inputs and outputs of current node • What is “good enough”? • Rarely reach target (0 or 1) outputs • Typically, train until within 0.1 of target
Neural Net Summary • Training: • Backpropagation procedure • Gradient descent strategy (usual problems) • Prediction: • Compute outputs based on input vector & weights • Pros: Very general, Fast prediction • Cons: Training can be VERY slow (1000’s of epochs), Overfitting
Training Strategies • Online training: • Update weights after each sample • Offline (batch training): • Compute error over all samples • Then update weights • Online training “noisy” • Sensitive to individual instances • However, may escape local minima
Training Strategy • To avoid overfitting: • Split data into: training, validation, & test • Also, avoid excess weights (less than # samples) • Initialize with small random weights • Small changes have noticeable effect • Use offline training • Until validation set minimum • Evaluate on test set • No more weight changes
Classification • Neural networks best for classification task • Single output -> Binary classifier • Multiple outputs -> Multiway classification • Applied successfully to learning pronunciation • Sigmoid pushes to binary classification • Not good for regression
Neural Net Conclusions • Simulation based on neurons in brain • Perceptrons (single neuron) • Guaranteed to find linear discriminant • IF one exists -> problem XOR • Neural nets (Multi-layer perceptrons) • Very general • Backpropagation training procedure • Gradient descent - local min, overfitting issues