460 likes | 585 Views
Università di Milano-Bicocca Laurea Magistrale in Informatica. Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational Learning Theory. Computational models of cognitive phenomena. Computing capabilities: Computability theory
E N D
Università di Milano-BicoccaLaurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Prof. Giancarlo Mauri Lezione 4 - Computational Learning Theory
Computational models of cognitive phenomena • Computing capabilities:Computability theory • Reasoning/deduction:Formal logic • Learning/induction:?
A theory of the learnable (Valiant ‘84) • […] The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn […] Learning machines must have all 3 of the following properties: • the machines can provably learn whole classes of concepts, these classes can be characterized • the classes of concepts are appropriate and nontrivial for general-purpose knowledge • the computational process by which the machine builds the desired programs requires a “feasible” (i.e. polynomial) number of steps
A theory of the learnable • We seek general laws that constrain inductive learning, relating: • Probability of successful learning • Number of training examples • Complexity of hypothesis space • Accuracy to which target concept is approximated • Manner in which training examples are presented
Probably approximately correct learning formal computational model which want shed light on the limits of what can be learned by a machine, analysing the computational cost of learning algorithms
What we want to learn CONCEPT =recognizing algorithmLEARNING =computational description of recognizing algorithms starting from: - examples - incomplete specifications That is: to determine uniformly good approximations of an unknown function from its value in some sample points • interpolation • pattern matching • concept learning
What’snew in p.a.c. learning? Accuracy of resultsand running timefor learning algorithmsare explicitly quantified and related A general problem:use of resources (time, space…) by computations COMPLEXITY THEORY Example Sorting: n·logn time (polynomial, feasible) Bool. satisfiability: 2ⁿ time (exponential, intractable)
Learningfromexamples LEARNER DOMAIN Concept EXAMPLES A REPRESENTATION OF A CONCEPT CONCEPT: subset of domain EXAMPLES: elements of concept (positive) REPRESENTATION: domain→expressions GOOD LEARNER ? EFFICIENT LEARNER ?
The P.A.C. model • A domain X (e.g. {0,1}ⁿ, Rⁿ) • A concept: subset of X, f ⊆ Xor f:X→{0,1} • A class of concepts F⊆2X • A probability distribution P on X Example 1 X ≡ a square F≡ triangles in the square
The P.A.C. model Example 2 X≡{0,1}ⁿF≡ family of boolean functions 1 if there are at least r ones in (x1,…,xn) fr(x1,…,xn) = 0 otherwise P a probability distribution on X Uniform Non uniform
TheP.A.C.model The learning process • Labeled sample((x0, f(x0)), (x1, f(x1)), …, (xn, f(xn)) • Hypothesisa function h consistent with the sample (i.e., h(xi) = f(xi) i) • Error probabilityPerr(h(x)≠f(x), xX)
TheP.A.C.model X, F X, fF LEARNER Examples generator with probability distribution p t examples Inference procedure A TEACHER (x1,f(x1)), … , (xt,f(xt))) The learning algorithm A is good if the hypothesis h is “ALMOST ALWAYS” “CLOSE TO” the target concept c Hypothesis h (implicit representation of a concept)
TheP.A.C.model “CLOSE TO” x random choice METRIC : given P f dp(f,h) = Perr = Px f(x)≠h(x) h Given an approximation parameter (0<≤1), h is an ε-approximation of f if dp(f,h)≤ “ALMOST ALWAYS” Confidence parameter (0 < ≤ 1) The “measure” of sequences of examples, randomly choosen according to P, such that h is an ε-approximation of f is at least 1-
Learning algorithm Generator of examples Learner h Fconcept class S set of labeled samples from a concept in F A: S Fsuch that: 0<,<1 fF mN S s.t. |S|≥m I) A(S) consistent with S II)P(Perr< ) > 1-
The efficiency issue Look for algorithms which use “reasonable” amount of computational resources COMPUTATIONAL RESOURCES SAMPLE SIZE (Statistical PAC learning) COMPUTATION TIME (Polynomial PAC learning) DEF 1: a concept class F = n=1Fn is statistically PAC learnable if there is a learning algorithm with sample size t = t(n, 1/, 1/) bounded by some polynomial function in n, 1/, 1/
The efficiency issue POLYNOMIAL PAC STATISTICAL PAC DEF 2: a concept class F = n=1Fn is polynomially PAC learnable if there is a learning algorithm with running time bounded by some polynomial function in n, 1/, 1/
Learning boolean functions n = {f: {0, 1}n {0, 1}} The set of boolean functions in n variables Fn nA class of concepts Example 1: Fn = clauses with literals in Example 2: Fn = linearly separable functions in n variables REPRESENTATION - TRUTH TABLE (EXPLICIT) - BOOLEAN CIRCUITS (IMPLICIT) BOOLEAN CIRCUITS BOOLEAN FUNCTIONS
Boolean functions and circuits • BASIC OPERATIONS • COMPOSITION [f(g1, … , gm)](x) = f(g1(x), … , gm(x)) in m variables in n variables CIRCUIT: Finite acyclic directed graph Output node Basic operations Input nodes Given an assignment {x1 … xn} {0, 1} to the input variables, the output node computes the corresponding value
Boolean functions and circuits Fn n Cn : class of circuits which compute all and only the functions inFn Algorithm A to learn F by C • INPUT (n,ε,δ) • The learner computes t = t(n, 1/, 1/) • (t=number of examples sufficient to learn with accuracy ε and confidence δ) • The learner asks the teacher for a labelled t-sample • The learner receives the t-sample S and computes C = An(S) • Output C (C= representation of the hypothesis) Note that the inference procedure A receives as input the integer n and a t-sample on 0,1nand outputsAn(S) = A(n, S)
Boolean functions and circuits An algorithm A is a learning algorithm with sample size t(n, 1/, 1/) for a concept class using the class of representations If for all n≥1, for all fFn, for all 0<, <1 and for every probability distribution p over 0,1n the following holds: If the inference procedureAnreceives as input a t-sample, it outputs a representationcCnof a functiongthat is probably approximately correct, that is with probability at least 1-a t-sample is chosen such that the function ginferred satisfies P{x f(x)≠g(x)} ≤ g is –good: g is an –approximation of f g is –bad: g is not an –approximation of f NOTE: distribution free
Statistical P.A.C. learning PROBLEM: Estimate upper and lower bounds on the sample size t = t(n, 1/, 1/) Upper bounds will be given forconsistent algorithmsLower bounds will be given forarbitrary algorithms DEF:An inference procedureAnfor the classFnis consistent if, given the targetfunctionfFn, for every t-sample S = (<x1,b1>, … , <xt,bt>), An(S) is a representation of a functiong“consistent” withS, i.e. g(x1) = b1, … , g(xt) = bt DEF:A learning algorithm A is consistent if its inference procedure is consistent
A simple upper bound THEOREM: t(n, 1/, 1/) ≤ -1ln(#Fn) +ln(1/) PROOF: Prob(x1, … , xt) g (g(x1)=f(x1), … , g(xt)=f(xt) g -bad) ≤ P(AUB)≤P(A)+P(B) ≤ Prob (g(x1) = f(x1), … , g(xt) = f(xt)) ≤ g ε-bad Independent events ≤ Prob (g(xi) = f(xi)) ≤ g ε-bad i=1, … , t g is ε-bad ≤ (1-)t ≤ #Fn(1-)t ≤ #Fne-t g ε-bad Impose #Fn e-t ≤ - #Fn must be finite NOTE
Vapnik-Chervonenkis approach (1971) Problem: uniform convergence of relative frequencies to their probabilities Xdomain F2Xclass of concepts S = (x1, … , xt)t-sample f S g iff f(xi) = g(xi) xi Sundistinguishable by S F (S) = #(F /S) index of F w.r.t. S S1 S2 MF (t) = maxF(S) S is a t-samplegrowth function
A general upper bound THEOREM Prob(x1, … , xt) g (g -bad g(x1) = f(x1), … , g(xt) = f(xt))≤ 2mF2te-t/2 FACT mF(t) ≤ 2t mF(t) ≤ #F (this condition gives immediately the simple upper bound) mF(t) = 2t and j<t mF(j) = 2j
Graph of the growth function #F ? d t DEFINITION d = VCdim(F) = max t mF(t) = 2t FUNDAMENTAL PROPERTY BOUNDED BY A POLYNOMIAL IN t !
Upper and lower bounds THEOREM If dn = VCdim(Fn) then t(n, 1/, 1/) ≤ max (4/ log(2/), (8dn/)log(13/) PROOF Impose2mFn2te-et/2 ≤ Number of examples which are necessary for arbitrary algorithms A lower bound ont(n, 1/, 1/): THEOREM For 0≤≤1/ and ≤1/100 t(n, 1/, 1/) ≥ max ((1- )/ ln(1/), (dn-1)/32)
An equivalent definition of VCdim F(S) = #(f-1(1)(x1, … , xt) | fF) I.e. the cardinality of the set of subsets of S that can be obtained by intersecting S with concepts in F If F(S) = 2S we say that S is shattered by F The Vapnik-Chervonenkis dimension ofF is the cardinality of the largest finite set of points S X that is shattered by F
Example 1 Learn the family f of circles contained in the square Sufficient! 24.000 690 Necessary!
Example 2 Learn the family of linearly separable boolean functions in n variables, Ln HS(x)= SIMPLE UPPER BOUND UPPER BOUND USING GROWS LINEARLY WITH n!
Example 2 Consider the class L2 of linearly separable functions in two variables No straight line can separate the green from the red points The green point cannot be separated from the other three
Classi di formule booleane Monomi x1x2 … xk DNF m1m2 … mj (mj monomi) Clausole x1x2 … xk CNF c1c2 … cj (cj clausole) k-DNF ≤ k letterali nei monomi k-term-DNF ≤ k monomi k-CNF ≤ k letterali nelle clausole k-clause-CNF ≤ k clausole Formule monotone non contengono letterali negati -formule ogni variabile appare al più una volta
I risultati Th. (Valiant) I monomi sono apprendibili da esempi positivi con 2(n+log ) esempi ( errore tollerato) ponendo in tutti gli es. in tutti gli es. N.B. L’apprendibilità è non monotona A B se B appr., allora A appr. Th.i monomi non sono apprendibili da esempi negativi
Risultati positivi 1) K-CNF apprendibili da soli esempi positivi 1b) K-DNF apprendibili da soli esempi negativi 2) (K-DNF K-CNF) apprendibili da es. (K-DNF K-CNF) positivi e negativi 3) la classe delle K-decision lists è apprendibile Th. Ogni K-DNF (o K-CNF)-formula è rappresentabile da una K-DL piccola
Risultati negativi 1) Le -formule non sono apprendibili 2) Le funzioni booleane a soglia non sono apprendibili 3) Per K ≥ 2, le formule K-term-DNF non sono apprendibili
Mistake bound model • So far: how many examples needed to learn ? • What about: how many mistakes before convergence ? • Let’s consider similar setting to PAC learning: • Instances drawn at random from X according to distribution D • Learner must classify each instance before receiving correct classification from teacher • Can we bound the number of mistakes learner makes before converging ?
Mistake bound model • Learner: • Receives a sequence of training examples x • Predicts the target value f(x) • Receives the correct target value from the trainer • Is evaluated by the total number of mistakes it makes before converging to the correct hypothesis • I.e.: • Learning takes place during the use of the system, not off-line • Ex.: prediction of fraudolent use of credit cards
Mistake bound for Find-S • Consider Find-S when H = conjunction of boolean literals FIND-S: • Initialize h to the most specific hypothesis in H: x1x1x2x2 … xnxn • For each positive training instance x • Remove from h any literal not satisfied by x • Output h
Mistake bound for Find-S • If C Hand training data noise free, Find-S converges to an exact hypothesis • How many errors to learncH(only positive examples can be misclassified)? • The first positive example will be misclassified, and n literals in the initial hypothesis will be eliminated • Each subsequent error eliminates at least one literal • #mistakes ≤ n+1 (worst case, for the “total” concept x c(x)=1)
Mistake bound for Halving • A version space is maintained and refined (e.g., Candidate-elimination) • Prediction is based on majority vote among the hypotheses in the current version space • “Wrong” hypotheses are removed (even if x is exactly classified) • How many errors to exactly learn cH (H finite)? • Mistake when the majority of hypotheses misclassifies x • These hypotheses are removed • For each mistake, the version space is at least halved • At most log2(|H|) mistakes before exact learning (e.g., single hypothesis remaining) • Note: learning without mistakes possible !
Optimal mistake bound • Question: what is the optimal mistake bound (i.e., lowest worst case bound over all possible learning algorithms A) for an arbitrary non empty concept class C, assuming H=C ? • Formally, for any learning algorithm A and any target concept c: • MA(c) = max #mistakes made by A to exactly learn c over all possible training sequences • MA(C) = maxcC MA(c) Note: Mfind-S(C) = n+1 MHalving(C) ≤ log2(|C|) • Opt(C) = minA MA(C) i.e., # of mistakes made for the hardest target concept in C, using the hardest training sequence, by the best algorithm
Optimal mistake bound • Theorem (Littlestone 1987) • VC(C) ≤ Opt(C) ≤ MHalving(C) ≤ log2(|C|) • There exist concept classes for which VC(C) = Opt(C) = MHalving(C) = log2(|C|) e.g., the power set 2X of X, for which it holds: VC(2X) = |X| = log2(|2X|) • There exist concept classes for which • VC(C) < Opt(C) < MHalving(C)
Weighted majority algorithm • Generalizes Halving • Makes predictions by taking a weighted vote among a pole of prediction algorithms • Learns by altering the weight associated with each prediction algorithm • It does not eliminate hypotheses (i.e., algorithms) inconsistent with some training examples, but just reduces its weight, so is able to accommodate inconsistent training data
Weighted majority algorithm • i wi := 1 • training example (x, c(x)) • q0 := q1 := 0 • prediction algorithm ai • If ai(x)=0 then q0 := q0 + wi • If ai(x)=1 then q1 := q1 + wi • if q1 > q0 then predict c(x)=1 if q1 < q0 then predict c(x)=0 if q1 > q0 then predict c(x)=0 or 1 at random • prediction algorithm ai do If ai(x)≠c(x) then wi := wi (0≤<1)
Weighted majority algorithm (WM) • Coincides with Halving for =0 • Theorem - D any sequence of training examples, A any set of n prediction algorithms, k min # of mistakes made by any ajA for D, =1/2. Then W-M makes at most 2.4(k+log2n) mistakes over D
Weighted majority algorithm (WM) • Proof • Since aj makes k mistakes (best in A) its final weight wj will be (1/2)k • The sum W of the weights associated with all n algorithms in A is initially n, and for each mistake made by WM is reduced to at most (3/4)W, because the “wrong” algorithms hold at least 1/2 of total weight, that will be reduced by a factor of 1/2. • The final total weight W is at most n(3/4)M, where M is the total number of mistakes made by WM over D.
Weighted majority algorithm (WM) • But the final weight wj cannot be greater than the final total weight W, hence: (1/2)k ≤ n(3/4)M from which M ≤ ≤ 2.4(k+log2n) • I.e., the number of mistakes made by WM will never be greater than a constant factor times the number of mistakes made by the best member of the pool, plus a term that grows only logarithmically in the size of the pool (k+log2 n) -log2 (3/4)