620 likes | 913 Views
A Black-Box approach to machine learning. Yoav Freund. Why do we need learning?. Computers need functions that map highly variable data: Speech recognition: Audio signal -> words Image analysis: Video signal -> objects Bio-Informatics: Micro-array Images -> gene function
E N D
A Black-Box approach to machine learning Yoav Freund
Why do we need learning? • Computers need functions that map highly variable data: • Speech recognition: Audio signal -> words • Image analysis: Video signal -> objects • Bio-Informatics: Micro-array Images -> gene function • Data Mining: Transaction logs -> customer classification • For accuracy, functions must be tuned to fit the data source. • For real-time processing, function computation has to be very fast.
Trivial performance The complexity/accuracy tradeoff Error Complexity
The speed/flexibility tradeoff Matlab Code Java Code Flexibility Machine code Digital Hardware Analog Hardware Speed
Theory Vs. Practice • Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in “all” situations. - I prove theorems. • Practitioner: I want a real-time algorithm that performs well on my problem. - I experiment. • My approach: I want combining algorithms whose performance and speed is guaranteed relative to the performance and speed of their components. - I do both.
Plan of talk • The black-box approach • Boosting • Alternating decision trees • A commercial application • Boosting the margin • Confidence rated predictions • Online learning
The black-box approach • Statistical models are not generators, they are predictors. • A predictor is a function from observationX to actionZ. • After actionis taken, outcomeY is observed which implies lossL (a real valued number). • Goal: find a predictor with small loss(in expectation, with high probability, cumulative…)
A predictor x z Training examples A learner Main software components We assume the predictor will be applied to examples similar to those on which it was trained
Training Examples predictor Target System feedback Learning in a system Learning System Sensor Data Action
OutcomeY - finite set {1,..,K} PredictionZ - {1,…,K} Special case: Classification Observation X - arbitrary (measurable) space Usually K=2 (binary classification)
Data distribution: Generalization error: Training set: Training error: batch learning for binary classification
Boosting Combining weak learners
Feature vectors Binary labels {-1,+1} Positive weights A weighted training set
A weak rule h instances predictions The weak requirement: A weak learner Weighted training set Weak Leaner h
weak learner h1 (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,1/n), … (xn,yn,1/n) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) weak learner h2 h3 h4 h5 h6 h7 h8 h9 hT Finalrule: The boosting process
Main property of Adaboost If advantages of weak rules over random guessing are: g1,g2,..,gTthen training error of final rule is at most
Strong Learner Accurate Rule Weak Learner Weak rule Example weights Booster Boosting block diagram
What is a good weak learner? The set of weak rules (features) should be: • flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. • Simple enough to allow efficient search for a rule with non-trivial weighted training error. • Small enough to avoid over-fitting. Calculation of prediction from observations should be very fast.
Alternating decision trees Freund, Mason 1997
Y no no yes yes 5 X 3 Decision Trees -1 +1 X>3 -1 Y>5 -1 -1 +1
Y -1 +1 X>3 no yes -1 sign Y>5 no yes X A decision tree as a sum of weak rules. -0.2 -0.1 +0.1 +0.2 -0.2 +0.1 -0.1 -0.3 -0.3 +0.2
Y +0.2 Y<1 -0.1 +0.1 -1 +1 no yes X>3 +0.7 0.0 no yes -0.3 +0.1 -0.1 -1 sign Y>5 no yes -0.3 +0.2 +1 X An alternating decision tree -0.2 +0.7
Example: Medical Diagnostics • Cleve dataset from UC Irvine database. • Heart disease diagnostics (+1=healthy,-1=sick) • 13 features from tests (real valued and discrete). • 303 instances.
AD-tree for heart-disease diagnostics >0 : Healthy <0 : Sick
AT&T “buisosity” problem Freund, Mason, Rogers, Pregibon, Cortes 2000 • Distinguish business/residence customers from call detail information. (time of day, length of call …) • 230M telephone numbers, label unknown for ~30% • 260M calls / day • Required computer resources: • Huge:counting log entries to produce statistics -- use specialized I/O efficient sorting algorithms (Hancock). • Significant: Calculating the classification for ~70M customers. • Negligible:Learning (2 Hours on 10K training examples on an off-line computer).
Precision/recall: Accuracy Score Quantifiable results • For accuracy 94% increased coverage from 44% to 56%. • Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.
Adaboost’s resistance to over fitting Why statisticians find Adaboost interesting.
A very curious phenomenon Boosting decision trees Using <10,000 training examples we fit >2,000,000 parameters
Large margins Thesis: large margins => reliable predictions Very similar to SVM.
C Theorem Schapire, Freund, Bartlett & Lee / Annals of statistics 1998 H: set of binary functions with VC-dimension d No dependence on no. of combined functions!!!
Confidence rated predictions Agreement gives confidence
Unsure - - - - - + - + - + - + + - - + - - + - - - + - + + + + - - - + - + + + + - - + + + - - + + + + + + + - - - + - - + - + + - - + + - + + - - + + + + - + + + Unsure - - - - - - - - - + - - - - + - - - - + - - - - - A motivating example ? ? ?
Parameters Hypothesis weight: Empirical Log Ratio: Prediction rule: The algorithm Freund, Mansour, Schapire 2001
Suggested tuning Suppose H is a finite set. Yields:
Training examples Confidence-rated Rule Candidate Rules Rater- Combiner Confidence Rating block diagram
Face Detection Viola & Jones 1999 • Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second).
All boxes Might be a face Definitely not a face Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. Feature 1 Feature 2
Confident Predictions Confident Predictions Co-training Blum and Mitchell 98 Partially trained B/W based Classifier Raw B/W Hwy Images Diff Image Partially trained Diff based Classifier
Raw Image detector Difference Image detector Before co-training After co-training Co-Training Results Levin, Freund, Viola 2002
Sample of unconfident examples Labeled examples Selective sampling Unlabeled data Partially trained classifier Query-by-committee, Seung, Opper & Sompolinsky Freund, Seung, Shamir & Tishby
Online learning Adapting to changes
An expert is an algorithm that maps the past to a prediction Online learning So far, the only statistical assumption was that data is generated IID. Can we get rid of that assumption? Yes, if we consider prediction as a repeating game Suppose we have a set of experts, we believe one is good, but we don’t know which one.