情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels

情報知識ネットワーク特論Prediction and Learning 2:Perceptron and Kernels 有村博紀，喜田拓也北海道大学大学院情報科学研究科コンピュータサイエンス専攻email: {arim,kida}@ist.hokudai.ac.jphttp://www-ikn.ist.hokudai.ac.jp/~arim ．

How to learn strings and graphs C C C C X H H X X N X N H H H H H H H H H H H H • Learning problem • unknown function f: Graphs → {+1, -1} Classify +1 -1 -1 TCGCGAGGT +1 +1 -1 TCGCGAGGCTAGCT Fe H +1 GCAGAGTAT H -1 TCGCGAGGCTAT H +1 TCGCGAGGCTAT

Learning Strings and Graphs • Linear learning machines (this week) • Classification by a hyperplane in N dimensional space RN • Efficient learning methods minimizing the reguralized risk • String and graph kernel methods (next week) • Substring and subgraph features • Efficient computation by dynamic programming (DP)

Prediction and Learning • Training Data • A set of n pairs of observations (x1, y1), ..., (xn, yn)generated by some unknown rule • Prediction • Predict the output y given a new input x • Learning • Find a function y = h(x)for the prediction within a class of hypotheses H= {h0, h1, h2, ..., hi, ...}.

An On-line Learning Framework • Data • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule. • Learning • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs the mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process. • Goal • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction. [Littlestone 1987]

Linear Learning Machines

Linear Learning Machines • N-dimensional Euclidean space • The set of points w = (x1, ..., xN)∈RN • hyperplane • w = (x1, ..., xN) ∈RN: an weight vector • b∈R : a bias • the hyperplane determined by (w, b) S = { x ∈RN : 〈w, x〉 + b = 0 } • Notation • 〈w, x〉 = w1x1+ ... +, wNxN = ∑i wi xi • ||w||2= 〈w, w〉

Linear Learning Machines • Linear threshold function f : RN→ {+1, -1} f(x) = sgn(w1x1+...+, wNxN + b)= sgn(〈w, x〉 + b ) • function f(x) is determined by pair(w, b) • weight vector w = (w1, ..., wN) ∈RN: • bias b∈R ≡ Linear classifier +1 +1 hyperplane -1 〈w, x〉 +1 -1 point x bias (b<0) +1 weight vector w -1

Margin • Sample • S = {(x1, y1), ..., (xm, ym) } • Margin γ of a hyperplane w.r.t. sample S • Scale invariance • (w,b) and (cw,cb) define the same hyperplane (c>0) +1 -1 margin γ +1 -1 point x bias (b<0) +1 -1 weight vector w

An Online Learning Framework • Data • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule. • Learning • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs an mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process. • Goal • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.

Perceptron Learning Algorithm • Perceptron • Linear classifiers as a model of a single neuron. • Learning algorithm [Rosenblatt 1956] • The first iterative algorithm for learning linear classification • Online and mistake-driven • Guaranteed to converge in the linearly separable case.The speed is given by the quantity called the margin [Novikoff 1962]. .

Perceptron Learning Algorithm Initialization • Start with zero vector w :=0 • When a mistake occurs on (x, y) • Positive mistake (if y=+1) • the weight vector w is too weak • update by w := w + x/||x|| (add normalized input) • Negative mistake (if y=-1) • the weight vector w is too strong • update by w := w - x/||x||(add normalized input) • Update rule • If mistake occurs then update w by w := w + y·x/||x||

Perceptron Learning Algorithm • Algorithm: • Given m examples (x1, y1),..., (xm, ym); • Initialize: w= 0 (= 0N); • Repeat the followings: • Receive the next input x. • Predict: f(x)=sgn(〈w, x〉) ∈{+1, -1}. • Receive the correct output y ∈{+1, -1}. • If mistake occurs (y f(x)< 0) then update: w := w + y·x /||x||. • variation w := w +η·y·x/||x||. • η > 0: a learning parameter • Assumption: b = 0

Perceptron Learning Algorithm • Assumption (linear-separable case): • The unknown linear-threshold function f* = 〈w, x〉+ bhas margin γ w.r.t. a sample S. • Theorem (Novikoff 1962; ): • The Perceptron learning algorithm makes at most • mistakes where R = max(x,y)∈S ||x|| is the size of the maximum input vector. • The mistake bound M of the algorithm is independent from the dimension N

Proof of Theorem (Novikoff) • When update is made • A mistake occurs: yf(x) < 0. • Update: w' = w + y·x /||x||. • Sketch • Upperbound of ||w|| • Lowerbound of 〈w, w*〉 • Inequality: 〈w, w*〉≦ ||w||·||w*|| .

Finding a separating hyperplane • Consistent Hypothesis Finder • Find any hypothesis within C that separates positive examples from negative examples. • If a class C is not complex then any consistent Hypothesis Finder learns class C. • Excercise: • Show the following: Let S be a sample of size m. We can modify the perceptron to find a consistent hypothesis with S in O(mnM) time, where M = (2R/γ)2 is the mistake bound of the perceptron. +1 +1 -1 +1 -1 +1 -1

Addtion vs. Multiplication Littlestone, Learning quickly when irrelevant attributes abound: A new linear thereshold algorithm, Machine Learning, 2(4): 285-318,1988. Kivinen and Warmuth, Exponentiated gradient versus gradient descent for linear predictors, Information and Computation, 132(1):1-63, 1997.

Addtion vs. Multiplication • Perceptron • Update: Addition • Weighted majority & Winnow • Update: Multiplication • Different merits... • Presently, additive update algorithms are more popular (due to Kernel techniques).

Extensions of Perceptron Kivinen, Smola, Williamson, "Online leanring with Kernels", IEEE Trans. Signal Processing.

Extensions of Perceptron Algorithm • What the Perceptron algorithm does? • Risk function + Gradient descend • Perceptron's update rule • If a mistake occurs then update w := w + y·x/||x|| • Otherwise, do nothing: w := w • a mistake occurs iff y· f(x) < 0 • Risk function • Risk = Expected Error + Penalty for Complexity

Risk minimization • Loss function lo(f(x), y) = lo(y·f(x)) • Expected risk • Emprical risk lo(z) error correct -1 +1 z = yf(x) +1 +1 -1 +1 +1 -1 +1 -1

Online Risk Minimization for Perceptron • Batch learning • Minimizing the empirical risk by optimization methods • Online learning (Derivation of Perceptron) • Sample S = { (xt, yt) }. (The last example only) • Minimization by classical gradient descend • Same as perceptron's update rule *1) minimization of the instantaneous risk on a single example *2) η > 0: learning parameter

RegulaRisk minimization • Soft margin loss function • Problem of error and noises • margin parameter ρ • Regularized Emprical risk • Problem of overfitting • Control the complexity ofweight vector w lo(z) error correct +ρ z = yf(x) +1 +1 -1 ρ -1 +1 -1 +1

Introducing Kernels into Perceptron • How the Perceptron algorithm works... • mistake-driven • update rule of the weight vector. • additive update

Perceptron Learning Algorithm Initialization • Start with zero vector w :=0 • When a mistake occurs on (x, y) • Positive mistake (if y=+1) • the weight vector w is too weak • update by w := w + x/||x|| (add normalized input) • Negative mistake (if y=-1) • the weight vector w is too strong • update by w := w - x/||x||(add normalized input) • Update rule • If mistake occurs then update w by w := w + y·x/||x||

Online algorithm with Kernels • Weight vector built by Perceptron alg. • Weighted sum of input vectors • Coefficient αi • αi = 1 if mistake occurs at xi. • αi = 0 otherwise. • Prediction • done by inner-product representation (or kernel computation) • Kernel function:

Summary • What the Perceptron algorithm does? • Risk function + Gradient descend • Instantaneous risk minimization (last step) • Extensions • Soft margin classification • Regularized risk minimization • Kernel trick • Linear Learning Machine Family • Perceptron, Winnow, Weighted majority • SVM, Approximate maximal margin learner, ...

情報知識ネットワーク特論 Prediction and Learning 2: Perceptron and Kernels