A Black-Box approach to machine learning

A Black-Box approach to machine learning Yoav Freund

Why do we need learning? • Computers need functions that map highly variable data: • Speech recognition: Audio signal -> words • Image analysis: Video signal -> objects • Bio-Informatics: Micro-array Images -> gene function • Data Mining: Transaction logs -> customer classification • For accuracy, functions must be tuned to fit the data source. • For real-time processing, function computation has to be very fast.

Trivial performance The complexity/accuracy tradeoff Error Complexity

The speed/flexibility tradeoff Matlab Code Java Code Flexibility Machine code Digital Hardware Analog Hardware Speed

Theory Vs. Practice • Theoretician: I want a polynomial-time algorithm which is guaranteed to perform arbitrarily well in “all” situations. - I prove theorems. • Practitioner: I want a real-time algorithm that performs well on my problem. - I experiment. • My approach: I want combining algorithms whose performance and speed is guaranteed relative to the performance and speed of their components. - I do both.

Plan of talk • The black-box approach • Boosting • Alternating decision trees • A commercial application • Boosting the margin • Confidence rated predictions • Online learning

The black-box approach • Statistical models are not generators, they are predictors. • A predictor is a function from observationX to actionZ. • After actionis taken, outcomeY is observed which implies lossL (a real valued number). • Goal: find a predictor with small loss(in expectation, with high probability, cumulative…)

A predictor x z Training examples A learner Main software components We assume the predictor will be applied to examples similar to those on which it was trained

Training Examples predictor Target System feedback Learning in a system Learning System Sensor Data Action

OutcomeY - finite set {1,..,K} PredictionZ - {1,…,K} Special case: Classification Observation X - arbitrary (measurable) space Usually K=2 (binary classification)

Data distribution: Generalization error: Training set: Training error: batch learning for binary classification

Boosting Combining weak learners

Feature vectors Binary labels {-1,+1} Positive weights A weighted training set

A weak rule h instances predictions The weak requirement: A weak learner Weighted training set Weak Leaner h

weak learner h1 (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,1/n), … (xn,yn,1/n) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) weak learner h2 h3 h4 h5 h6 h7 h8 h9 hT Finalrule: The boosting process

Adaboost

Main property of Adaboost If advantages of weak rules over random guessing are: g1,g2,..,gTthen training error of final rule is at most

Strong Learner Accurate Rule Weak Learner Weak rule Example weights Booster Boosting block diagram

What is a good weak learner? The set of weak rules (features) should be: • flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. • Simple enough to allow efficient search for a rule with non-trivial weighted training error. • Small enough to avoid over-fitting. Calculation of prediction from observations should be very fast.

Alternating decision trees Freund, Mason 1997

Y no no yes yes 5 X 3 Decision Trees -1 +1 X>3 -1 Y>5 -1 -1 +1

Y -1 +1 X>3 no yes -1 sign Y>5 no yes X A decision tree as a sum of weak rules. -0.2 -0.1 +0.1 +0.2 -0.2 +0.1 -0.1 -0.3 -0.3 +0.2

Y +0.2 Y<1 -0.1 +0.1 -1 +1 no yes X>3 +0.7 0.0 no yes -0.3 +0.1 -0.1 -1 sign Y>5 no yes -0.3 +0.2 +1 X An alternating decision tree -0.2 +0.7

Example: Medical Diagnostics • Cleve dataset from UC Irvine database. • Heart disease diagnostics (+1=healthy,-1=sick) • 13 features from tests (real valued and discrete). • 303 instances.

AD-tree for heart-disease diagnostics >0 : Healthy <0 : Sick

Commercial Deployment.

AT&T “buisosity” problem Freund, Mason, Rogers, Pregibon, Cortes 2000 • Distinguish business/residence customers from call detail information. (time of day, length of call …) • 230M telephone numbers, label unknown for ~30% • 260M calls / day • Required computer resources: • Huge:counting log entries to produce statistics -- use specialized I/O efficient sorting algorithms (Hancock). • Significant: Calculating the classification for ~70M customers. • Negligible:Learning (2 Hours on 10K training examples on an off-line computer).

AD-tree for “buisosity”

AD-tree (Detail)

Precision/recall: Accuracy Score Quantifiable results • For accuracy 94% increased coverage from 44% to 56%. • Saved AT&T 15M$ in the year 2000 in operations costs and missed opportunities.

Adaboost’s resistance to over fitting Why statisticians find Adaboost interesting.

A very curious phenomenon Boosting decision trees Using <10,000 training examples we fit >2,000,000 parameters

Large margins Thesis: large margins => reliable predictions Very similar to SVM.

Experimental Evidence

C Theorem Schapire, Freund, Bartlett & Lee / Annals of statistics 1998 H: set of binary functions with VC-dimension d No dependence on no. of combined functions!!!

Idea of Proof

Confidence rated predictions Agreement gives confidence

Unsure - - - - - + - + - + - + + - - + - - + - - - + - + + + + - - - + - + + + + - - + + + - - + + + + + + + - - - + - - + - + + - - + + - + + - - + + + + - + + + Unsure - - - - - - - - - + - - - - + - - - - + - - - - - A motivating example ? ? ?

Parameters Hypothesis weight: Empirical Log Ratio: Prediction rule: The algorithm Freund, Mansour, Schapire 2001

Suggested tuning Suppose H is a finite set. Yields:

Training examples Confidence-rated Rule Candidate Rules Rater- Combiner Confidence Rating block diagram

Face Detection Viola & Jones 1999 • Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second).

All boxes Might be a face Definitely not a face Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. Feature 1 Feature 2

Using confidence to train car detectors

Original Image Vs. difference image

Confident Predictions Confident Predictions Co-training Blum and Mitchell 98 Partially trained B/W based Classifier Raw B/W Hwy Images Diff Image Partially trained Diff based Classifier

Raw Image detector Difference Image detector Before co-training After co-training Co-Training Results Levin, Freund, Viola 2002

Sample of unconfident examples Labeled examples Selective sampling Unlabeled data Partially trained classifier Query-by-committee, Seung, Opper & Sompolinsky Freund, Seung, Shamir & Tishby

Online learning Adapting to changes

An expert is an algorithm that maps the past to a prediction Online learning So far, the only statistical assumption was that data is generated IID. Can we get rid of that assumption? Yes, if we consider prediction as a repeating game Suppose we have a set of experts, we believe one is good, but we don’t know which one.

A Black-Box approach to machine learning

A Black-Box approach to machine learning

Presentation Transcript

A Constructivistic Approach to Learning

Learning to Rank: A Machine Learning Approach to Static Ranking

A Black-Box Approach to Query Cardinality Estimation

A Financial Approach to Machine Learning with Applications to Credit Risk

Software Process Evaluation: A Machine Learning Approach

Black-box and Gray-box Strategies for Virtual Machine Migration

Black Box Testing

An Integrated Machine Learning Approach to Stroke Prediction

Recognition: A machine learning approach

Black-box and Gray-box Strategies for Virtual Machine Migration

A Scalable Machine Learning Approach to Go

Black Box

A Machine Learning Approach to Android Malware Detection

The Fight against Spam - A Machine Learning Approach

A Machine Learning Approach to Programming

A Machine Learning Approach for Improved BM25 Retrieval

Matter as a Black Box

Junction accidents (Machine learning approach)

A Machine Learning Approach to Coreference Resolution of Noun Phrases

A Scalable Machine Learning Approach to Go