1.21k likes | 1.82k Views
Machine Learning Algorithms in Computational Learning Theory. TIAN HE JI GUAN WANG. Shangxuan Xiangnan Kun Peiyong Hancheng. 25 th Jan 2013. Outlines. Introduction Probably Approximately Correct Framework (PAC) PAC Framework Weak PAC-Learnability Error Reduction
E N D
Machine LearningAlgorithms inComputational Learning Theory TIAN HE JI GUAN WANG Shangxuan XiangnanKun PeiyongHancheng 25th Jan 2013
Outlines • Introduction • Probably Approximately Correct Framework (PAC) • PAC Framework • Weak PAC-Learnability • Error Reduction • Mistake Bound Model of Learning • Mistake Bound Model • Predicting from Expert Advice • The Weighted Majority Algorithm • Online Learning from Examples • The Winnow Algorithm • PAC versus Mistake Bound Model • Conclusion • Q & A
Machine Learning Machine cannot learn but can be trained.
Machine Learning • Definition "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E". ---- Tom M.Mitchell • Algorithm types • Supervised learning : Regression, Label • Unsupervised learning : Clustering, Data Mining • Reinforcement learning : Act better with observations.
Machine Learning • Other Examples • Medical diagnosis • Handwritten character recognition • Customer segmentation (marketing) • Document segmentation (classifying news) • Spam filtering • Weather prediction and climate tracking • Gene prediction • Face recognition
Computational Learning Theory • Why learning works • Under what conditions is successful learning possible and impossible? • Under what conditions is a particular learning algorithm assured of learning successfully? • We need particular settings (models) • Probably approximately correct (PAC) • Mistake bound models
Probably Approximately Correct Framework (PAC) • PAC Learnability • Weak PAC-Learnability • Error Reduction • Occam’s Razor
PAC Learning • PAC Learning • Any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be wrong. • Stationarity : The future being like the past. • Concept: An efficiently computable function of a domain. Function : {0,1} n -> {0,1} . • A concept class is a collection of concepts.
PAC Learnability • Learnability • Requirements for ALG • ALG must, with arbitrarily high probability (1-d), output a hypothesis having arbitrarily low error(e). • ALG must do as efficiently as in time that grows at most polynomially with 1/d and 1/e.
PAC Learning for Decision Lists • A Decision List (DL) is a way of presenting certain class of functions over n-tuples. • Example if x4 = 1 then f(x) = 0 else if x2 = 1 then f(x) = 1 else f(x) = 0. Upper bound on the number of all possible boolean decision lists on n variables is : n!4 n= O(n n )
PAC Learning for Decision Lists • Algorithms : A greedy approach (Rivest, 1987) • If the example set S is empty, halt. • Examine each term of length k until a term t is found s.t. all examples in S which make t true are of the same type v. • Add (t, v) to decision list and remove those examples from S. • Repeat 1-3. • Clearly, it runs in polynomial time.
What does PAC do? • A supervised learning framework to classify data
How can we use PAC? • Use PAC as a general framework to guide us on efficient sampling for machine learning • Use PAC as a theoretical analyzer to distinguish hard problems from easy problems • Use PAC to evaluate the performance of some algorithms • Use PAC to solve some real problems
What we are going to cover? • Explore what PAC can learn • Apply PAC to real data with noise • Give a probabilistic analysis on the performance of PAC
PAC Learning for Decision Lists • Algorithms : A greedy approach
Analysis of Greedy Algorithm • The output • Performance Guarantee
PAC Learning for Decision Lists 1. For a given S, by partitioning the set of all concepts that agree with f on S into a “bad” set and a “good”, we want to achieve 2. Consider any h, the probability that we pick S such that h ends up in bad set is 3. 4. Putting together
The Limitation of PAC for DLs • What if the examples are like below?
Other Concept Classes • Decision tree : Dts of restricted size are not PAC-learnable, although those of arbitrary size are. • AND-formulas: PAC-learnable. • 3-CNF formulas: PAC-learnable. • 3-term DNF formulas: In fact, it turns out that it is an NP-hard problem, given S, to come up with a 3-term DNF formula that is consistent with S. Therefore this concept class is not PAC-learnable—but only for now, as we shall soon revisit this class with a modified definition of PAC-learning.
Weak PAC-Learnability Benefits: • To loose the requirements for a highly accurate algorithm • To reduce the running time as |S| can be smaller • To find a “good” concept using the simple algorithm A
Error Reduction by Boosting • The basic idea exploits the fact that you can learn a little on every distribution and with more iterations we can get much lower error rate.
Error Reduction by Boosting • Detailed Steps: 1. Some algorithm A produces a hypothesis that has an error probability of no more than p = 1/2−γ (γ>0). We would like to decrease this error probability to 1/2−γ′ with γ′> γ. 2. We invoke A three times, each time with a slightly different distribution and get hypothesis h1, h2 and h3, respectively. 3. Final hypothesis then becomes h=Maj(h1, h2,h3).
Error Reduction by Boosting • Learn h1 from D1 with error p • Modify D1 so that the total weight of incorrectly marked examples are 1/2, thus we get D2. Pick sample S2 from this distribution and use A to learn h2. • Modify D2 so that h1 and h2 always disagree, thus we get D3. Pick sample S3 from this distribution and use A to learn h3.
Error Reduction by Boosting • The total error probability h is at most 3p^2−2p^3, which is less than p when p∈(0,1/2). The proof of how to get this probability is shown in [1]. • Thus there exists γ′> γ such that the error probability of our new hypothesis is at most 1/2−γ′. [1] http://courses.csail.mit.edu/6.858/lecture-12.ps
Adaboost • Defines a classifier using an additive model:
Error Reduction by Boosting Fig. Error curves for boosting C4.5 on the letter dataset as reported by Schapire et al.[]. Training and test error curves are lower and upper curves respectively.
PAC learning conclusion • Strong PAC learning • Weak PAC learning • Error reduction and boosting
Mistake Bound Model of Learning • Mistake Bound Model • Predicting from Expert Advice • The Weighted Majority Algorithm • Online Learning from Examples • The Winnow Algorithm
Mistake Bound Model of Learning | Basic Settings • x – examples • c – the target function, ct ∈ C • x1, x2… xt an input series • at the tth stage • The algorithm receives xt • The algorithm predicts a classification for xt, bt • The algorithm receives the true classification, ct(x). • a mistake occurs if ct(xt) ≠ bt
Mistake Bound Model of Learning | Basic Settings • A hypotheses class C has an algorithm A with mistake M: • if for any concept c ∈ C, and • for any ordering of examples, • the total number of mistakes ever made by A is bounded by M.
Mistake Bound Model of Learning | Basic Settings • Predicting from Expert Advice • The Weighted Majority Algorithm • Online Learning from Examples • The Winnow Algorithm
Predicting from Expert Advice • The Weighted Majority Algorithm • Deterministic • Randomized Predicting from Expert Advice
Predicting from Expert Advice | Basic Flow Combining Expert Advice Truth Prediction Assumption: prediction ∈ {0, 1}.
Predicting from Expert Advice | Trial (1) Receiving prediction from experts (2) Making its own prediction (3) Being told the correct answer
Predicting from Expert Advice | An Example • Task : predicting whether it will rain today. • Input : advices of n experts ∈ {1 (yes), 0 (no)}. • Output : 1 or 0. • Goal: make the least number of mistakes.
The Weighted Majority Algorithm | Deterministic 1 0 1 1 1 1 1 2 1 1 0 1 0 1 0.50 1 2 0.50 0 1 1 0 1 0.50 0.50 0.50 0.50 1 1 1 0 1 1 0.50 0.25 0.50 0.50 0.75 1 1 1 0 1 0.25 0.25 0.50 0.25 0.75 1 1