Computational Learning Theory

Computational Learning Theory • Introduction • The PAC Learning Framework • Finite Hypothesis Spaces • Examples of PAC Learnable Concepts • Infinite Hypothesis Spaces • Mistake Bound Model of Learning

Sample Complexity for Infinite Hypothesis Spaces In the case of hypothesis spaces that are finite we came up with a bound on the number of examples necessary to guarantee PAC learning, based on the size of the hypothesis space |H|: m >= 1/ ε (ln |H| + ln (1/ δ )) What happens if the size of the hypothesis space is infinite? We now define another measure of the complexity of the hypothesis Space called the Vapnik Chervonenkis Dimension or VC(H).

The Vapnik Chervonenkis Dimension Can we tell how expressive a hypothesis space is? Consider the following three examples and the possible hypothesis on such sample: Total hypothesis = 8

The Vapnik Chervonenkis Dimension If a set of examples can be partitioned in all possible ways, we say the hypothesis space H shatters that set of examples. If we have a set of s examples, we need all possible 2shypotheses to shatter the set of examples. Definition. The Vapnik-Chervonenkis dimension or VC dimension is the size of the largest finite subset of examples in the input space X shattered by H. (the VC dimension can be infinite).

Example: Intervals in R Consider the input space is the set of real numbers R. Our hypothesis space H is the set of intervals. As an illustration: We measure the speed of several racing cars. We want to distinguish between the speed of race car type A and race car type B. Let’s suppose we have recorded two speeds x1 and x2: x1 x2 Can H shatter these two examples?

Example: Intervals in R x1 x2 We need 22 = 4 hypothesis. They are all shown above. So the VC dimension is at least two. What if we have three speeds?: x1 x2 x3 We can’t represent the hypothesis that covers x1 and x3 but not x2. So the VC(H) = 2.

Example: Points in Plane Now consider the input space is the set of coordinates in R2. Our hypothesis space H is the set of lines. Some of the 8 possible hypotheses x2 x1 x3 Can H shatter these three examples? Yes, as long as the point are not colinear. In general, the VC dimension of the space of hyperplanes in r dimensions is r + 1.

The Vapnik Chervonenkis Dimension Properties of the VC dimension. The VC dimension is always less than log2 |H| Why? Suppose that VC(H) = d. This means we need 2d different hypothesis to shatter the d examples. Therefore 2d <= |H| and d = VC(H) <= log2 |H|.

Sample Complexity and the VC dimension How many examples we need to decide a class of concepts C is PAC learnable? If we use the VC dimension for sample complexity, there is a proof that shows that m >= 1/ ε (4 log2 (2/ δ ) + 8 VC(H) log2 (13/ ε )) So we can use the VC dimension to prove a class of concepts Is PAC learnable.

Mistake Bound Model We try to answer the question: How many mistakes we need to make before converging to a correct hypothesis? We receive a sequence of learning examples: e1, e2, e3, …, en For each example we use our current hypothesis to predict the class of that example. How many examples do we need to make before finding the right hypothesis?

Mistake Bound Model We try to answer the question: How many mistakes we need to make before converging to a correct hypothesis? We receive a sequence of learning examples: e1, e2, e3, …, en For each example we use our current hypothesis to predict the class of that example. How many examples do we need to make before finding the “right” hypothesis? (right means a hypothesis identical to the true concept).

Example: Conjunctions of Literals • As an illustration consider the problem of learning • conjunctions of literals over n Boolean attributes. • Assume again the Find-S algorithm that outputs the most specific • hypothesis consistent with the training data. • Find-S: • 1. Initialize h to the most specific hypothesis l1 ^ ~l1 ^ … ^ ln ^ ~ln • For each positive training example X • Remove from h any literal not satisfied by X • 3. Output hypothesis h

Example: Conjunctions of Literals How many mistakes can we make assuming the target concept is a conjunction of literals? Notice that with the first mistake we eliminate n terms. With every subsequent mistake we eliminate at least one term. Therefore, the maximum number of mistakes we can make is n+1.

Example: Halving Algorithm What is the Halving algorithm? Imagine the candidate elimination algorithm that keeps in the Version Space the set of all consistent hypotheses. Let’s assume the hypothesis output for every example is a majority voting over all hypotheses in the version space. For the two-class problem: Class(x) = + if more than half of the hypotheses vote for + Class(X) = - otherwise

Example: Halving Algorithm How many mistakes (upper bound) can we make before converging to the right hypothesis? Every time we make a mistake at least half of the hypotheses are wrong. So we need at most log2 |H| mistakes to converge to the right hypothesis Hypotheses voting for + Hypotheses voting for -

Homework Example 7.5 in the textbook. Due October 15 before class.

Computational Learning Theory