450 likes | 1.98k Views
Perceptron convergence theorem (1-level). pp. 100-103, Christopher M. Bishop’s text. Review of Perceptron Learning Rule. Cycle through all the patterns in training set Test each pattern in turn using current set of weights If pattern is correctly classified, do nothing Else
E N D
Perceptron convergence theorem (1-level) pp. 100-103, Christopher M. Bishop’s text
Review of Perceptron Learning Rule • Cycle through all the patterns in training set • Test each pattern in turn using current set of weights • If pattern is correctly classified, do nothing • Else • add pattern vector (scaled) to weight vector if pattern is mislabeled C2 • subtract pattern vector (scaled) to weight vector if pattern is mislabeled C1 n=pattern index i=weight index
Perceptron convergence thm • For any data set that’s linearly separable, the learning rule is guaranteed to find a solution in a finite number of steps • By many – Rosenblatt (1962), Block, Nilsson, Minsky and Pappert, Duda and Hart, etc.
Proof strategy • By contradiction • On one hand we’ll show that the weight vector can’t grow too slow • On the other hand we’ll show that it can’t grow too fast either • Putting these two results together will imply that you can only go on some finite number of steps
First half: weight vector can’t grow too slow • We are considering a data set that’s linearly separable • Thus there is at least 1 weight vector, w-hat, for which all training vectors are correctly classified, so that For all training patterns
Lower Bound (cont’d) • Learning process starts with some arbitrary vector • We assume that this is 0 • We also assume eta=1 • Haven’t gotten around to understanding why these assumptions are OK yet • But clearly if we can prove convergence with these assumptions, then we’ve shown that there are values of the starting vector and eta that allow convergence, and that’s good enough! • Updating equation becomes n=pattern index i=weight index A misclassified pattern vector
Lower bound (cont’d) • Now, after running the algorithm for a while, suppose each pattern vector phi^n has been presented and misclassified tav^n times • Then the total weight vector at this time would be
Lower bound (cont’d) • Take scalar product of both sides with w-hat to get (total # of wt updates) • Inequality results from replacing each update vector by the one with the smallest dot product with w-hat. • Smallest one exists because there’re only a finite number of them • Conclusion: LHS is lower-bounded by a linear function of tav!!
Upper Bound • We’ll now show an upper bound to the size of the weight vector • To do this we look at the weight dotted with itself rather than with w-hat • We have: • <0 because pattern phi^n was misclassified • You might say that this is so ‘by design’
Upper bound (cont’d) Biggest pattern vector 1 Thus after tav weight updates we have: Conclusion: size of w increases no faster than sqrt(tav) 0
Bounds • Upper bound conclusion: size of w is O( sqrt(tav)) • Lower bound conclusion (from before): dot product of w-hat and w is Omega(tav) • These two bounds become incompatible as tav grows, making tav finite. • Other texts have details about this. The bound on tav is not good, but it exists at least.