440 likes | 1.05k Views
Perceptron. Inner-product scalar Perceptron Perceptron learning rule XOR problem linear separable patterns Gradient descent Stochastic Approximation to gradient descent LMS Adaline. Inner-product. A measure of the projection of one vector onto another. Example. Activation function.
E N D
Inner-product scalar • Perceptron • Perceptron learning rule • XOR problem • linear separable patterns • Gradient descent • Stochastic Approximation to gradient descent • LMS Adaline
Inner-product • A measure of the projection of one vector onto another
1040 Neurons • 104-5 connections per neuron
x1 x2 xn Perceptron • Linear treshold unit (LTU) x0=1 w1 w0 w2 o . . . wn McCulloch-Pitts model of a neuron
The goal of a perceptron is to correctly classify the set of pattern D={x1,x2,..xm} into one of the classes C1 and C2 • The output for class C1 is o=1 and fo C2 is o=-1 • For n=2
Perceptron learning rule • Consider linearly separable problems • How to find appropriate weights • Look if the output pattern o belongs to the desired class, has the desired value d • is called the learning rate • 0 < ≤ 1
In supervised learning the network has ist output compared with known correct answers • Supervised learning • Learning with a teacher • (d-o) plays the role of the error signal
Perceptron • The algorithm converges to the correct classification • if the training data is linearly separable • and is sufficiently small • When assigning a value to we must keep in mind two conflicting requirements • Averaging of past inputs to provide stable weights estimates, which requires small • Fast adaptation with respect to real changes in the underlying distribution of the process responsible for the generation of the input vector x, which requires large
Frank Rosenblatt • 1928-1969
Rosenblatt's bitter rival and professional nemesis was Marvin Minsky of Carnegie Mellon University • Minsky despised Rosenblatt, hated the concept of the perceptron, and wrote several polemics against him • For years Minsky crusaded against Rosenblatt on a very nasty and personal level, including contacting every group who funded Rosenblatt's research to denounce him as a charlatan, hoping to ruin Rosenblatt professionally and to cut off all funding for his research in neural nets
XOR problem and Perceptron • By Minsky and Papert in mid 1960
Gradient Descent • To understand, consider simpler linear unit, where • Let's learn wi that minimize the squared error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)} • (t for target)
We want to move the weigth vector in the direction that decrease E wi=wi+wi w=w+w
Gradient Descent Gradient-Descent(training_examples, ) Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is the vector of input values, and t is the target output value, is the learning rate (e.g. 0.1) • Initialize each wi to some small random value • Until the termination condition is met, • Do • Initialize each wi to zero • For each <(x1,…xn),t> in training_examples • Do • Input the instance (x1,…,xn) to the linear unit and compute the output o • For each linear unit weight wi • Do • wi= wi + (t-o) xi • For each linear unit weight wi • Do • wi=wi+wi
Stochastic Approximation to gradient descent • The gradient decent training rule updates summing over all the training examples D • Stochastic gradient approximates gradient decent by updating weights incrementally • Calculate error for each example • Known as delta-rule or LMS (last mean-square) weight update • Adaline rule, used for adaptive filters Widroff and Hoff (1960)
LMS • Estimate of the weight vector • No steepest decent • No well defined trajectory in the weight space • Instead a random trajectory (stochastic gradient descent) • Converge only asymptotically toward the minimum error • Can approximate gradient descent arbitrarily closely if made small enough
Summary • Perceptron training rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate • Linear unit training rule uses gradient descent or LMS guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate • Even when training data contains noise • Even when training data not separable by H
Inner-product scalar • Perceptron • Perceptron learning rule • XOR problem • linear separable patterns • Gradient descent • Stochastic Approximation to gradient descent • LMS Adaline
XOR?Multi-Layer Networks output layer hidden layer input layer