Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi:General Conditions for Predictivity in Learning Theory Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004

Motivation • SupervisedLearning • learn functional relationships from a finite set of labelled training examples • Generalization • How well does the learned function perform on unseen test examples? • Central question in supervised learning

What you will hear • New Idea: Stability implies predictivity • learning algorithm is stable if small pertubations of training set do not change hypothesis much • Conditions for generalization on learning map rather than hypothesis space • in contrast to VC-analysis

Agenda • Introduction • Problem Definition • Classical Results • Stability Criteria • Conclusion

Some Definitions 1/2 • Training Data: S = {z1=(x1,y1), ..., zn=(xn, yn)} • Z = X  Y • Unknown Distribution (x, y) • Hypothesis Space: H • Hypothesis fS  H: X  Y • Learning Algorithm: • Regression: fS is real-valued / Classification: fS is binary • symmetric learning algorithm (ordering irrelevant)

Some Definitions 2/2 • Loss Function: V(f, z) • e.g. V(f, z) = (f(x) – y)2 • Assume that V is bounded • Empirical Error (Training Error) • Expected Error (True Error)

Generalization and Consistency • Convergence in Probability • Generalization • Performance on training examples must be a good indicator of performance on future examples • Consistency • Expected error converges to most accurate one in H

Empirical Risk Minimization (ERM) • Focus of classical learning theory research • exact and almost ERM • Minimize training error over H: • take best hypothesis on training data • For ERM: Generalization  Consistency

What algorithms are ERM? • All these belong to class of ERM algorithms • Least Squares Regression • Decision Trees • ANN Backpropagation (?) • ... • Are all learning algorithms ERM? • NO! • Support Vector Machines • k-Nearest Neighbour • Bagging, Boosting • Regularization • ...

Vapnik asked What property must the hypothesis space H have to ensure good generalization of ERM?

Classical Results for ERM1 • Theorem: A necessary and sufficient condition for generalization and consistency of ERM is that H is a uniform Glivenko-Cantelli(uGC)class: • convergence of empirical mean to true expected value • uniform convergence in probability of loss functions induced by H and V 1 e.g. Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

VC-Dimension • Binary functions f: X{0, 1} • VC-dim(H) = size of largest finite set in X that can be shattered by H • e.g. linear separation in 2D yields VC-dim = 3 • Theorem: Let H be a class of binary valued hypotheses, then H is a uGC-class if and only if VC-dim(H) is finite1. 1 Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997

Achievements of Classical Learning Theory • Complete characterization of necessary and sufficient conditions for generalization and consistency of ERM • Remaining questions: • What about non-ERM algorithms? • Can we establish criteria not only for the hypothesis space?

Poggio et.al. asked What property must the learning map L have for good generalization of general algorithms? Can a new theory subsume the classical results for ERM?

Stability • Small pertubations of the training set should not change the hypothesis much • especially deleting one training example • Si = S \ {zi} • How can this be mathematically defined? Original Training Set S Perturbed Training Set Si Learning Map Hypothesis Space

Uniform Stability1 • A learning algorithm L is uniformly stableif • After deleting one training sample the change must be small at all points z  Z • Uniform stability implies generalization • Requirement is too strong • Most algorithms (e.g. ERM) are not uniformly stable 1 Bousquet, Elisseeff: Stability and Generalization, JMLR 2, 2001

CVloo stability1 • Cross-Validation leave-one-out stability • considers only errors at removed training points • strictly weaker than uniform stability remove zi error at xi 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Equivalence for ERM1 • Theorem: For good loss functions the following statements are equivalent for ERM: • L is distribution-independent CVloo stable • ERM generalizes and is universally consistent • H is a uGC class • Question: Does CVloo stability ensure generalization for all learning algorithms? 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

CVloo Counterexample1 • X be uniform on [0, 1] • Y  {-1, +1} • Target f *(x) = 1 • Learning algorithm L: • No change at removed training point  CVloo stable • Algorithm does not generalize at all! 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Additional Stability Criteria • Error (Eloo) stability • Empirical Error (EEloo) stability • Weak conditions, satisfied by most reasonable learning algorithms (e.g. ERM) • Not sufficient for generalization

CVEEEloo Stability • Learning Map L is CVEEEloo stable if it is • CVloo stable • and Eloo stable • and EEloo stable • Question: • Does this imply generalization for all L?

CVEEEloo implies Generalization1 • Theorem: If L is CVEEEloo stable and the loss function is bounded, then fSgeneralizes • Remarks: • Neither condition (CV, E, EE) itself is sufficient • Eloo and EEloo stability are not sufficient • For ERM CVloo stability alone is necessary and sufficient for generalization and consistency 1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003

Consistency • CVEEEloo stability in general does NOT guarantee consistency • Good generalization does NOT necessarily mean good prediction • but poor expected performance is indicated by poor training performance

CVEEEloo stable algorithms • Support Vector Machines and Regularization • k-Nearest Neighbour (k increasing with n) • Bagging (number of regressors increasing with n) • More results to come (e.g. AdaBoost) • For some of these algorithms a ´VC-style´ analysis is impossible (e.g. k-NN) • For all these algorithms generalization is guaranteed by the shown theorems!

Implications • Classical „VC-style“ conditions • Occams Razor: prefer simple hypotheses • CVloo stability • Incremental Change • online-algorithms • Inverse Problems: stability  well-posedness • condition numbers characterize stability • Stability-based learning may have more direct connections with brain‘s learning mechanisms • condition on learning machinery

Language Learning • Goal: learn grammars from sentences • Hypothesis Space: class of all learnable grammars • What is easier to characterize and gives more insight into real language learning? • Language learning algorithm • or Class of all learnable grammars? • Focus on algorithms shift focus to stability

Conclusion • Stability implies generalization • intuitive (CVloo) and technical (Eloo, EEloo) criteria • Theory subsumes classical ERM results • Generalization criteria also for non-ERM algorithms • Restrictions on learning map rather than hypothesis space • New approach for designing learning algorithms

Open Questions • Easier / other necessary and sufficient conditions for generalization • Conditions for general consistency • Tight bounds for sample complexity • Applications of the theory for new algorithms • Stability proofs for existing algorithms

Thank you!

Sources • T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General conditions for predictivity in learning theory, Nature Vol. 428, S. 419-422, 2004 • S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin: Statistical Learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, AI Memo 2002-024, MIT, 2003 • T. Mitchell: Machine Learning, McGraw-Hill, 1997 • C. Tomasi: Past Performance and future results, Nature Vol. 428, S. 378, 2004 • N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scale-sensitive Dimensions, Uniform Convergence, and Learnability, Journal of ACM 44(4), 1997

Michael Pfeiffer pfeiffer@igi.tugraz.at 25.11.2004