190 likes | 201 Views
Explore the problem of combining expert advice in machine learning theory. Learn about strategies like the Weighted Majority Algorithm and Randomized Weighted Majority for effective decision-making. Discover the applications of machine learning across various domains and the goals of Machine Learning Theory in developing and analyzing models. Dive into supervised classification, algorithm design, confidence bounds, and generalization guarantees.
E N D
Machine Learning Theory Plan for today: - problem of “combining expert advice” - course retrospective and open questions Maria Florina Balcan 04/29/10
Using “expert” advice Say we want to predict the stock market. • We solicit n “experts” for their advice. (Will the market go up or down?) • We then want to use their advice somehow to make our prediction. E.g., Can we do nearly as well as best in hindsight? [“expert” ´ someone with an opinion. Not necessarily someone who knows anything.]
Simpler question • We have n “experts”. • One of these is perfect (never makes a mistake). We just don’t know which one. • Can we find a strategy that makes no more than lg(n) mistakes? • Answer: sure. Just take majority vote over all experts that have been correct so far. • Each mistake cuts # available by factor of 2. • Note: this means ok fornto be very large. “halving algorithm”
Using “expert” advice If one expert is perfect, can get · lg(n) mistakes with halving alg. But what if none is perfect? Can we do nearly as well as the best one in hindsight? • Strategy #1: • Iterated halving algorithm. Same as before, but once we've crossed off all the experts, restart from the beginning. • Makes at most log(n)*[OPT+1] mistakes, where OPT is #mistakes of the best expert in hindsight. • Seems wasteful. Constantly forgetting what we've “learned”. Can we do better?
Weighted Majority Algorithm Intuition: Making a mistake doesn't completely disqualify an expert. So, instead of crossing off, just lower its weight. Weighted Majority Alg: • Start with all experts having weight 1. • Predict based on weighted majority vote. • Penalize mistakes by cutting weight in half.
Analysis: do nearly as well as best expert in hindsight • M = # mistakes we've made so far. • m = # mistakes best expert has made so far. • W = total weight (starts at n). • After each mistake, W drops by at least 25%. So, after M mistakes, W is at most n(3/4)M. • Weight of best expert is (1/2)m. So, constant ratio
Randomized Weighted Majority 2.4(m + lg n) not so good if the best expert makes a mistake 20% of the time. Can we do better? Yes. • Instead of taking majority vote, use weights as probabilities. (e.g., if 70% on up, 30% on down, then pick 70:30) Idea: smooth out the worst case. • Also, generalize ½ to 1- e. unlike most worst-case bounds, numbers are pretty good.
Analysis • Say at time t we have fraction Ft of weight on experts that made mistake. • So, we have probability Ft of making a mistake, and we remove an eFt fraction of the total weight. • Wfinal = n(1-e F1)(1 - e F2)... • ln(Wfinal) = ln(n) + åt [ln(1 - e Ft)] · ln(n) - eåt Ft (using ln(1-x) < -x) = ln(n) - e M. (å Ft = E[# mistakes]) • If best expert makes m mistakes, then ln(Wfinal) > ln((1-e)m). • Now solve: ln(n) - e M > m ln(1-e).
Summarizing • At most (1+e) times worse than best expert in hindsight, with additive e-1log(n). • If have prior, can replace additive term with e-1log(1/pi). [e-1x number of bits] • Often written in terms of additive loss. If running T time steps, set epsilon to get additive loss (2T log n)1/2
What can we use this for? • Can use to combine multiple algorithms to do nearly as well as best in hindsight. • Can apply RWM in situations where experts are making choices that cannot be combined. • E.g., repeated game-playing. • E.g., online shortest path problem [OK if losses in [0,1]. Replace Ft with Pt¢Lt and penalize expert i by (1-e)loss(i) ] • Extensions: • “bandit” problem. • efficient algs for some cases with many experts. • Sleeping experts / “specialists” setting.
Machine Learning Incredibly useful in many domains across computer science, engineering, and science. • Image Classification • Spam Detection • Document Categorization • Fraud Detection • Speech Recognition • Protein Classification • Computational Advertising • Branch Prediction • Etc
Goals of Machine Learning Theory Develop and analyze models to understand: • what kinds of tasks we can hope to learn, and from what kind of data • what types of guarantees might we hope to achieve • prove guarantees for practically successful algs (when will they succeed, how long will they take?); • develop new algs that provably meet desired criteria
Example: Supervised Classification Decide which emails are spam and which are important. Supervised classification Not spam spam Goal: use emails seen so far to produce good prediction rule for future data.
Two Main Aspects of Supervised Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. Confidence Bounds, Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. Well understood for passive supervised learning.
Other Protocols for Supervised Learning • Semi-Supervised Learning Using cheap unlabeled data in addition to labeled data. • Active Learning The algorithm interactively asks for labels of informative examples. Theoretical understanding severely lacking until a couple of years ago. Lots of progress recently. We will cover some of these. • Learning with Membership Queries • Statistical Query Learning
Topics we covered • Basic models for supervised learning: PAC and SLT. • Simple algos and hardness results for supervised learning. • Standard Sample Complexity Results (VC dimension) • Weak-learning vs. Strong-learning • Classic, state of the art algorithms: AdaBoost and SVM.
Structure of the Class • Modern Sample Complexity Results • Rademacher Complexity • Margin analysis of Boosting and SVM • Incorporating Unlabeled Data in the Learning Process. • Incorporating Interaction in the Learning Process: • Active Learning • Learning with Membership Queries • Classification noise and the Statistical-Query model • Learning Real Valued Functions
Open Questions • In the classic PAC model • learning decision trees, DNF • learning functions with a few relevant vars (junta problem) • Active learning and SSL • right sample complex quantities • interesting positive algorithmic results • Models and algorithms for exciting new paradigms • e.g., transfer learning, multi-agent learning, never ending learning