610 likes | 639 Views
Probably Approximately Correct Learning (PAC). Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142. Recall: Bayesian learning. Create a model based on some parameters Assume some prior distribution on those parameters Learning problem
E N D
Probably Approximately Correct Learning (PAC) Leslie G. Valiant. A Theory of the Learnable. Comm. ACM (1984) 1134-1142
Recall: Bayesian learning • Create a model based on some parameters • Assume some prior distribution on those parameters • Learning problem • Adjust the model parameters so as to maximize the likelihood of the model given the data • Utilize Bayesian formula for that.
PAC Learning • Given distribution D of observables X • Given a family of functions (concepts) F • For each x ε X and f ε F: f(x) provides the label for x • Given a family of hypotheses H, seek a hypothesis h such that Error(h) = Prx ε D[f(x) ≠ h(x)] is minimal
PAC New Concepts • Large family of distributions D • Large family of concepts F • Family of hypothesis H • Main questions: • Is there a hypothesis h ε H that can be learned • How fast can it be learned • What is the error that can be expected
Estimation vs. approximation • Note: • The distribution D is fixed • There is no noise in the system (currently) • F is a space of binary functions (concepts) • This is thus an approximation problem as the function is given exactly for each x X • Estimation problem: f is not given exactly but estimated from noisy data
Example (PAC) • Concept: Average body-size person • Inputs: for each person: • height • weight • Sample: labeled examples of persons • label + : average body-size • label - : not average body-size • Two dimensional inputs
Example (PAC) • Assumption: target concept is a rectangle. • Goal: • Find a rectangle h that “approximates” the target • Hypothesis family H of rectangles • Formally: • With high probability • output a rectangle such that • its error is low.
Example (Modeling) • Assume: • Fixed distribution over persons. • Goal: • Low error with respect to THIS distribution!!! • How does the distribution look like? • Highly complex. • Each parameter is not uniform. • Highly correlated.
Model Based approach • First try to model the distribution. • Given a model of the distribution: • find an optimal decision rule. • Bayesian Learning
PAC approach • Assume that the distribution is fixed. • Samples are drawn are i.i.d. • independent • Identically distributed • Concentrate on the decision rule rather than distribution.
PAC Learning • Task: learn a rectangle from examples. • Input: point (x,f(x)) and classification + or - • classifies by a rectangle R • Goal: • Using the fewest examples • compute h • h is a good approximation for f
PAC Learning: Accuracy • Testing the accuracy of a hypothesis: • using the distribution D of examples. • Error = h D f (symmetric difference) • Pr[Error] = D(Error) = D(h D h) • We would like Pr[Error] to be controllable. • Given a parameter e: • Find h such that Pr[Error] < e.
PAC Learning: Hypothesis • Which Rectangle should we choose? • Similar to parametric modeling?
Setting up the Analysis: • Choose smallest rectangle. • Need to show: • For any distribution D and Rectangle h • input parameters: e and d • Select m(e,d) examples • Let h be the smallest consistent rectangle. • Such that with probability 1-d (on X): D(f D h) < e
More general case (no rectangle) • A distribution: D (unknown) • Target function: ct from C • ct : X {0,1} • Hypothesis: h from H • h: X {0,1} • Error probability: • error(h) = ProbD[h(x) ct(x)] • Oracle: EX(ct,D)
PAC Learning: Definition • C and H are concept classes over X. • C is PAC learnable by H if • There Exist an Algorithm A such that: • For any distribution D over X and ct in C • for every input eand d: • outputs a hypothesis h in H, • while having access to EX(ct,D) • with probability 1-d we have error(h) < e • Complexities.
Finite Concept class • Assume C=H and finite. • h is e-bad if error(h)> e. • Algorithm: • Sample a set S of m(e,d) examples. • Find h in H which is consistent. • Algorithm fails if h is e-bad.
PAC learning: formalization (1) • X is the set of all possible examples • D is the distribution from which the examples are drawn • H is the set of all possible hypotheses, cH • m is the number of training examples • error(h) = Pr(h(x) c(x) | x is drawn from X with D) • h is approximately correctif error(h)
H Hbad c PAC learning: formalization (2) To show: after m examples, with high probability, all consistent hypotheses are approximately correct. All consistent hypotheses lie in an-ball around c.
Complexity analysis (1) • The probability that hypothesis hbadHbad is consistent with the first m examples: error(hbad) > by definition. The probability that it agrees with an example is thus (1- ) and with m examples(1- )m
Complexity analysis (2) • For Hbad to contain a consistent example, at least one hypothesis in it must be consistent. Probability(Hbad has a consistent hypothesis) |Hbad|(1- )m |H|(1- )m
Complexity analysis (3) • To reduce the probability of error below |H|(1- )m • This is possible when at least m examples m 1/(ln 1/ + ln |H|) are seen. • This is the sample complexity
Complexity analysis (4) • “at least m examples are necessary to build a consistent hypothesis h that is wrong at most times with probability 1- ” • Since |H| = 22^n, the complexity grows exponentially with the number of attributes n • Conclusion: learning any boolean function is no better in the worst case than table lookup!
PAC learning -- observations • “Hypothesis h(X) is consistent with m examples and has an error of at most with probability 1-” • This is a worst-case analysis. Note that the result is independent of the distribution D! • Growth rate analysis: • for 0, m proportionally • for 0, m logarithmically • for |H| , m logarithmically
PAC: comments • We only assumed that examples are i.i.d. • We have two independent parameters: • Accuracy e • Confidence d • No assumption about the likelihood of a hypothesis. • Hypothesis is tested on the same distribution as the sample.
PAC: non-feasible case • What happens if ct not in H • Needs to redefine the goal. • Let h* in H minimize the error b=error(h*) • Goal: find h in H such that error(h) error(h*) +e = b+e
Analysis* • For each h in H: • let obs-error(h) be the average error on the sample S. • Compute the probability that: Pr {|obs-error(h) - error(h) | < e/2} Chernoff bound: Pr < exp(-(e/2)2m) • Consider entire H : Pr < |H| exp(-(e/2)2m) • Sample size m > (4/e2) ln (|H|/d)
Correctness • Assume that for all h in H: • |obs-error(h) - error(h) | < e/2 • In particular: • obs-error(h*) < error(h*) + e/2 • error(h) -e/2 < obs-error(h) • For the output h: • obs-error(h) < obs-error(h*) • Conclusion: error(h) < error(h*)+e
Sample size issue • Due to the use of Chernoff boud: Pr {|obs-error(h) - error(h) | < e/2} Chernoff bound: Pr < exp(-(e/2)2m) and on entire H : Pr < |H| exp(-(e/2)2m) • It follows that the sample size m > (4/e2) ln (|H|/d) Not (1/e) ln (|H|/d) as before
Example: Learning OR of literals • Inputs: x1, … , xn • Literals : x1, • OR functions: • For each variable, target disjunction may contain xi or not, thus Number of disjunctions is 3n
ELIM: Algorithm for learning OR • Keep a list of all literals • For every example whose classification is 0: • Erase all the literals that are 1. • Example c(00110)=0 results in deleting • Correctness: • Our hypothesis h: An OR of our set of literals. • Our set of literals includes the target OR literals. • Every time h predicts zero: we are correct. • Sample size: m > (1/e) ln (3n/d)
Learning parity • Functions: x1 x7 x9 • Number of functions: 2n • Algorithm: • Sample set of examples • Solve linear equations (Matrix exists) • Sample size: m > (1/e) ln (2n/d)
max min Infinite Concept class • X=[0,1] and H={cq | q in [0,1]} • cq(x) = 0 iff x < q • Assume C=H: • Which cq should we choose in [min,max]?
Proof I • Define max = min{x|c(x)=1}, min = max{x|c(x)=0} • _Show that the probability that • Pr[ D([min,max]) > e ] < d • Proof: By Contradiction. • The probability that x in [min,max] at least e • The probability we do not sample from [min,max] Is (1-e)m • Needs m > (1/e) ln (1/d) There is something wrong
Proof II (correct): • Let max’ be : D([q,max’])=e/2 • Let min’ be : D([q,min’])=e/2 • Goal: Show that with high probability • X+ in [max’,q] and • X- in [q,min’] • In such a case any value in [x-,x+] is good. • Compute sample size!
Proof II (correct): • Pr{x1,x2,..,xm} is not in [q,min’])= (1- e/2)m < exp(-me/2) • Similarly with the other side • We require 2exp(-me/2) < δ • Thus, m > 2/e ln(2/δ)
Comments • The hypothesis space was very simple H={cq | q in [0,1]} • There was no noise in the data or labels • So learning was trivial in some sense (analytic solution)
Non-Feasible case: Label Noise • Suppose we sample: • Algorithm: • Find the function h with lowest error!
Analysis • Define: zi as an e/4 - net (w.r.t. D) • For the optimal h* and our h there are • zj : |error(h[zj]) - error(h*)| < e/4 • zk : |error(h[zk]) - error(h)| < e/4 • Show that with high probability: • |obs-error(h[zi])-error(h[zi])| < e/4
Exercise (Submission Mar 29, 04) • Assume there is Gaussian (0,σ) noise on xi. Apply the same analysis to compute the required sample size for PAC learning. Note: Class labels are determined by the non-noisy observations.
General e-net approach • Given a class H define a class G • For every h in H • There exist a g in G such that • D(g D h) < e/4 • Algorithm: Find the best h in H. • Computing the confidence and sample size.
Occam Razor W. Occam (1320) “Entities should not be multiplied unnecessarily” • Einstein “Simplify a problem as much as possible, but no simpler” Information theoretic ground?
Occam Razor Finding the shortest consistent hypothesis. • Definition: (a,b)-Occam algorithm • a >0 and b <1 • Input: a sample S of size m • Output: hypothesis h • for every (x,b) in S: h(x)=b (consistency) • size(h) < sizea(ct) mb • Efficiency.
Occam algorithm and compression A B S (xi,bi) x1, … , xm
Compression • Option 1: • A sends B the values b1 , … , bm • m bits of information • Option 2: • A sends B the hypothesis h • Occam: large enough m has size(h) < m • Option 3 (MDL): • A sends B a hypothesis h and “corrections” • complexity: size(h) + size(errors)
Occam Razor Theorem • A: (a,b)-Occam algorithm for C using H • D distribution over inputs X • ct in C the target function, n=size(ct) • Sample size: • with probability 1-d A(S)=h has error(h) < e
Occam Razor Theorem • Use the bound for finite hypothesis class. • Effective hypothesis class size 2size(h) • size(h) < na mb • Sample size: The VC dimension will replace 2size(h)
Exercise (Submission Mar 29, 04) 2. For an (a,b)-Occam algorithm, given noisy data with noise ~ (0, σ2) find the limitations on m. Hint (ε-net and Chernoff bound)
Learning OR with few attributes • Target function: OR of k literals • Goal: learn in time: • polynomial in k and log n • e and d constant • ELIM makes “slow” progress • disqualifies one literal per round • May remain with O(n) literals