1 / 16

情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm

情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm. 有村 博紀 , 喜田拓也 北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻 email: {arim,kida}@ist.hokudai.ac.jp http://www-ikn.ist.hokudai.ac.jp/ikn-tokuron/ http://www-ikn.ist.hokudai.ac.jp/~arim .. Prediction and Learning. Training Data

saxton
Download Presentation

情報知識ネットワーク特論 Prediction and Learning 1: Majority vote algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 情報知識ネットワーク特論Prediction and Learning 1:Majority vote algorithm 有村 博紀,喜田拓也 北海道大学大学院 情報科学研究科 コンピュータサイエンス専攻email: {arim,kida}@ist.hokudai.ac.jphttp://www-ikn.ist.hokudai.ac.jp/ikn-tokuron/http://www-ikn.ist.hokudai.ac.jp/~arim .

  2. Prediction and Learning • Training Data • A set of n pairs of observations (x1, y1), ..., (xn, yn)generated by some unknown rule • Prediction • Predict the output y given a new input x • Learning • Find a function y = h(x)for the prediction within a class of hypotheses H= {h0, h1, h2, ..., hi, ...}.

  3. An On-line Learning Framework • Data • A set of n pairs of observations (x1, y1),..., (xn, yn), ...generated by some unknown rule. • Learning • A learning algorithm A receives the next input xn, predicth(xn), receives the output yn, and incurs the mistake if yn ≠ h (xn). If mistake occurs, Aupdates the current hypothesis h. Repeat this process. • Goal • Find a good hypothesis h∈H by minimizing the number of mistakes in prediction.

  4. Learning an unknown function Strategy • Select a hypothesis h∈H for making the prediction y = h(x)from a given class of functions H= {h0, h1, h2, ..., hi, ...}. • Question • How can we select a best hypothesis h∈Hthat minimizes the number of mistakes during prediction? • We ignore the computation time.

  5. Naive Algorithm (Sequential) • Algorithm: • Given: the hypothesis class H= {h1, ..., hN}. • Initialize: k = 1; • Repeat the followings: • Receive the next input x. • Predict by h(x) = hk(x). Receive the correct output y. • If the mistake occurs then k = k + 1. Exhaustive search! • Observation: • Naive algorithm makes at most N mistakes.

  6. Halving Algorithm • Naive Algorithm • causes N mistakes in the worst case. • is usually exponentially large in the size |h| of a hypothesis h∈H. • Basic Idea • Want to acheive exponential speed-up! • Eliminate at least half of the hypotheses whenever a mistake happens. • A key is to carefully choose the prediction value h(x) by majority voting so that one mistake implies at least half of the hypotheses fail.

  7. Halving Algorithm • Algorithm: • Initialize the hypothesis class H= {h1, ..., hN}. • Repeat the followings: • Receive the next input x. • Splits H into A+1 = { h∈H : h(x) = +1} and A+1 = { h∈H : h(x) = -1}. • If |A+1| ≥ |A-1| then predict y = +1; otherwise predict y = -1. Receive the output x. • If the prediction is wrong then remove all hypotheses that make mistake by A = A - Ay. [Barzdin and Feivalds 1972] Majority voting Eliminate at least half

  8. Halving Algorithm: Result • Assumption (Consistent case): • The unknown function fbelongs to class H. • Theorem (Barzdins '72; Littlestone '87): • The Halving algorithm makes at most log Nmistakes where N is the number of hypotheses in H. • This gives a general strategy to design efficient online learning algorithms. • Halving algorithm is not optimal [Littlestone 90] • ]

  9. [Proof] • When receiving the input vector xi ∈{+1,-1}n, • xi splits the active experts in A into A+1 and A-1, where Aα = { i ∈A : xi = val } for every val ∈{+1,-1}. • Since the prediction is made according to the larger set,if a mistake occurs then the larger half is removed from A. • Therefore, the number of active experts in A decreases at least half. • It follows that |A| ≤n⋅(1/2)n after M mistakes. Note that any subset Aα (val ∈{+1,-1}) to which a perfect expert belong always makes the correct prediction. • This ensures that all perfect expert survives after any update of A. • Since |A| ≥1 by assumption, we conclude that the halving algorithm makes at most M = lg n mistakes.

  10. Majority Vote Algorithm Naive & Halving Algorithms • Works only in consistent case • Often miss the correct answer in an inconsistent case • Inconsistent case • A target function does not always belong to the hypothesis class H. • None of the hypotheses can completely trace the target function. • Tentative Goal • Predict as well as the best experts

  11. Majority Vote Algorithm • Majority Vote algorithm: • Initialize: w= (1, ...,1) ∈RN; • For i = 1,..., mdo: • Receive the next input x. • Predict by f(x) = Σh∈H wihi(x) (majority vote) • Receive the correct answer y ∈{+1,-1}. • If the mistake occurs (y ≠ f(x) ) then For all hi(x)∈H such that f(x) = hi(x) dowi = wi / 2 //majority hypotheses who contributed to the //last prediction [Littlestone & Warmuth 1989]

  12. Majority Vote Algorithm: Result • Assumption (Inconsistent case): • The unknown function fmay not belong to H. • The best expert makes M mistakes according to the target function f. • Theorem (Littlestone & Warmuth) • the majority vote algorithm makes at most 2.4(M + log N) mistakes, where N is the number of hypotheses in H. • The majority vote algorithm behaves as well as the unknown best expert.

  13. [Proof] • First, • we focus on the change of the sum of the weights W = ∑i wi during learning. • Suppose that at a round h ≥1, the best expert made m mistakes and the majority vote algorithm made M mistakes so far. Initially, the sum of the weight is set to W = n by construction. • Suppose that • the majority vote algorithm makes a mistake on an input vector x with weight vector w. • Let I be the set of experts who contributed to the prediction, and let WI = ∑i ∈I wi be the sum of the corresponding weights.

  14. By assumption, • we have WI ≥W⋅(1/2) (*1). • Since the weights of the wrong experts in I are halved, the sum W' of the weights after update is given by W' =W - WI⋅(1/2) ≤W - W⋅(1/2)⋅(1/2) =W - W⋅(1/4) = W⋅(3/4) from (*1). Thus, • Wt ≤ Wt-1⋅(3/4) • Since whenever a mistake occurs, the sum W becomes 3/4 of the before, the current sum is upperbounded by • W ≤n⋅(3/4)M(*2). • On the other hand, • we observe the change of the weight of a best expert, say k. By assumption, the best expert k made m mistakes. • Since the initial weight is wk = 1 and its weight must be halved m times, we have that the current weight is wk = (1/2)m ≥0 (*3).

  15. If k is one of the experts in E = {1,...,n}, then its weight is a part of W. Therefore, at any round h, we have the inequiality wk ≤W (*4). • Combining • the above discussions (*2), (*3), and (*4), we have an inequation • (1/2)m ≤n⋅( 3/4)M. • Solving this inequation: (1/2)m ≤n⋅(3/4)M ⇒ (4/3)M ≤n⋅2m ⇒ M lg(4/3) ≤(m + lg n) ⇒ M ≤(1/lg(4/3))(m + lg n), • we have M ≤2.40⋅(m + lg n) since 1/lg(4/3) = 2.4094.... ■

  16. Conclusion • Learning functions from examples • Given a class H of exponentially many hypotheses • A simplest strategy: Select best one from H • Sequential algorithm • Consistent case: O(|H|) mistakes • Halving algorithms • Consistent case: O(log |H|) mistakes • Majority Voting algorithm • Inconsistent case: O(m + log |H|) mistakes for the mistake bound m of the best hypothesis. • Next • Linear learning machines (Perceptron and Winnow)

More Related