Perceptron convergence theorem (1-level)

Perceptron convergence theorem (1-level) pp. 100-103, Christopher M. Bishop’s text

Review of Perceptron Learning Rule • Cycle through all the patterns in training set • Test each pattern in turn using current set of weights • If pattern is correctly classified, do nothing • Else • add pattern vector (scaled) to weight vector if pattern is mislabeled C2 • subtract pattern vector (scaled) to weight vector if pattern is mislabeled C1 n=pattern index i=weight index

Perceptron convergence thm • For any data set that’s linearly separable, the learning rule is guaranteed to find a solution in a finite number of steps • By many – Rosenblatt (1962), Block, Nilsson, Minsky and Pappert, Duda and Hart, etc.

Proof strategy • By contradiction • On one hand we’ll show that the weight vector can’t grow too slow • On the other hand we’ll show that it can’t grow too fast either • Putting these two results together will imply that you can only go on some finite number of steps

First half: weight vector can’t grow too slow • We are considering a data set that’s linearly separable • Thus there is at least 1 weight vector, w-hat, for which all training vectors are correctly classified, so that For all training patterns

Lower Bound (cont’d) • Learning process starts with some arbitrary vector • We assume that this is 0 • We also assume eta=1 • Haven’t gotten around to understanding why these assumptions are OK yet • But clearly if we can prove convergence with these assumptions, then we’ve shown that there are values of the starting vector and eta that allow convergence, and that’s good enough! • Updating equation becomes n=pattern index i=weight index A misclassified pattern vector

Lower bound (cont’d) • Now, after running the algorithm for a while, suppose each pattern vector phi^n has been presented and misclassified tav^n times • Then the total weight vector at this time would be

Lower bound (cont’d) • Take scalar product of both sides with w-hat to get (total # of wt updates) • Inequality results from replacing each update vector by the one with the smallest dot product with w-hat. • Smallest one exists because there’re only a finite number of them • Conclusion: LHS is lower-bounded by a linear function of tav!!

Upper Bound • We’ll now show an upper bound to the size of the weight vector • To do this we look at the weight dotted with itself rather than with w-hat • We have: • <0 because pattern phi^n was misclassified • You might say that this is so ‘by design’

Upper bound (cont’d) Biggest pattern vector 1 Thus after tav weight updates we have: Conclusion: size of w increases no faster than sqrt(tav) 0

Bounds • Upper bound conclusion: size of w is O( sqrt(tav)) • Lower bound conclusion (from before): dot product of w-hat and w is Omega(tav) • These two bounds become incompatible as tav grows, making tav finite. • Other texts have details about this. The bound on tav is not good, but it exists at least.

Perceptron convergence theorem (1-level)

Perceptron convergence theorem (1-level)

Presentation Transcript

BGP Convergence

The Perceptron

Multi Layer Perceptron

Multi Layer Perceptron

Perceptron

Convergence Refinement

Lecture 3: Perceptron

Perceptron

Perceptron Learning

MULTILAYER PERCEPTRON

Perceptron Algorithm

Perceptron