320 likes | 413 Views
Learning Theory Put to Work. Isabelle Guyon isabelle@clopinet.com. What is the process of Data Mining / Machine Learning?. Learning algorithm. Trained machine. TRAINING DATA. Answer. ?. Query. For which tasks ?. Classification (binary/categorical target)
E N D
Learning Theory Put to Work Isabelle Guyon isabelle@clopinet.com
What is the process ofData Mining / Machine Learning? Learning algorithm Trained machine TRAINING DATA Answer ? Query
For which tasks ? • Classification(binary/categorical target) • Regressionandtime series prediction(continuous targets) • Clustering(targets unknown) • Rule discovery
For which applications ? training examples Customer knowledge Quality control Market Analysis 106 OCR HWR Machine vision 105 Text Categorization 104 103 System diagnosis Bioinformatics 102 10 inputs 10 102 103 104 105
Banking / Telecom / Retail • Identify: • Prospective customers • Dissatisfied customers • Good customers • Bad payers • Obtain: • More effective advertising • Less credit risk • Fewer fraud • Decreased churn rate
Biomedical / Biometrics • Medicine: • Screening • Diagnosis and prognosis • Drug discovery • Security: • Face recognition • Signature / fingerprint / iris verification • DNA fingerprinting
Computer / Internet • Computer interfaces: • Troubleshooting wizards • Handwriting and speech • Brain waves • Internet • Hit ranking • Spam filtering • Text categorization • Text translation • Recommendation
From Statistics to Machine Learning… and back! • Old textbook statistics were descriptive: • Mean, variance • Confidence intervals • Statistical tests • Fit data, discover distributions (past data) • Machine learning (1960’s) is predictive : • Training / validation / test sets • Build robust predictive models (future data) • Learning theory (1990’s) : • Rigorous statistical framework for ML • Proper monitoring of fit vs. robustness
Some Learning Machines • Linear models • Polynomial models • Kernel methods • Neural networks • Decision trees
w a Conventions n attributes/features X={xij} m samples /customers /patients y ={yj} xi
Linear Models f(x) = Sj=1:n wj xj + b Linear discriminant (for classification): • F(x) = 1 if f(x)>0 • F(x) = -1 if f(x)0 LINEAR = WEIGHTED SUM
Non-linear models Linear models (artificial neurons) • f(x) = Sj=1:n wj xj + b Models non-linear in their inputs, butlinear in their parameters • f(x) = Sj=1:N wjfj(x) + b(Perceptron) • f(x) = Si=1:maik(xi,x) + b(Kernel method) Other non-linear models • Neural networks / multi-layer perceptrons • Decision trees
x2 f(x) < 0 f(x) = 0 f(x) > 0 x1 Linear Decision Boundary hyperplane
x2 f(x) < 0 f(x) = 0 f(x) > 0 x1 NL Decision Boundary
x2 x1 Fit / Robustness Tradeoff x2 x1
Predictions: F(x) Predictions: F(x) Predictions: F(x) Predictions: F(x) Cost matrix Cost matrix Cost matrix Cost matrix Class +1 Class +1 Total Total Class +1 / Total Class -1 Class -1 Class +1 Class +1 Total Class +1 / Total Class -1 Class -1 Truth: y Truth: y Class -1 Class -1 fp fp tn tn neg=tn+fp neg=tn+fp False alarm = fp/neg Truth: y Truth: y Class -1 Class -1 fp fp tn tn neg=tn+fp False alarm = fp/neg tp tp pos=fn+tp pos=fn+tp Hit rate = tp/pos Class +1 Class +1 fn fn tp tp pos=fn+tp Hit rate = tp/pos Class +1 Class +1 fn fn m=tn+fp +fn+tp m=tn+fp +fn+tp Frac. selected = sel/m Total Total rej=tn+fn rej=tn+fn sel=fp+tp sel=fp+tp m=tn+fp +fn+tp Frac. selected = sel/m Total rej=tn+fn sel=fp+tp Class+1 /Total Precision= tp/sel Performance Assessment False alarm rate =type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power • Compare F(x) = sign(f(x)) to the target y, and report: • Error rate = (fn + fp)/m • {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2 • F measure = 2 precision.recall/(precision+recall) • Vary the decision threshold q in F(x) = sign(f(x)+q), and plot: • ROC curve: Hit ratevs.False alarm rate • Lift curve: Hit ratevs. Fraction selected • Precision/recall curve: Hit ratevs. Precision
ROC Curve Ideal ROC curve (AUC=1) 100% Hit rate = Sensitivity Patients diagnosed by putting a threshold on f(x). For a given threshold you get a point on the ROC curve. Actual ROC Random ROC (AUC=0.5) 0 AUC 1 0 100% False alarm rate = 1 - Specificity
Lift Curve Ideal Lift Customers ranked according to f(x); selection of the top ranking customers. 100% Hit rate = Frac. good customers select. Actual Lift M O Random lift Gini=2AUC-1 0 Gini 1 100% 0 Fraction of customers selected
What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: • Classification: • Error rate:(1/m) Si=1:m1(F(xi)yi) • 1- AUC(Gini Index = 2 AUC-1) • Regression: • Mean square error:(1/m) Si=1:m(f(xi)-yi)2
R[f(x,w)] Parameter space (w) w* How to train? • Define a risk functional R[f(x,w)] • Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.)
Theoretical Foundations • Structural Risk Minimization • Regularization • Weight decay • Feature selection • Data compression Training powerful models, without overfitting
Ockham’s Razor • Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est ponenda sine neccesitate”. • Of two theories providing similarly good predictions, prefer the simplest one. • Shave off unnecessary parameters of your models.
loss function unknown distribution Risk Minimization • Learning problem: find the best function f(x; a) minimizing a risk functional • R[f] = L(f(x; w), y) dP(x, y) • Examples are given: (x1, y1), (x2, y2), … (xm, ym)
Approximations of R[f] • Empirical risk: Rtrain[f] = (1/n)i=1:m L(f(xi; w), yi) • 0/1 loss 1(F(xi)yi) : Rtrain[f] = error rate • square loss (f(xi)-yi)2 : Rtrain[f] = mean square error • Guaranteed risk: With high probability (1-d), R[f]Rgua[f] Rgua[f] =Rtrain[f]+ e(d,C)
Vapnik, 1974 Nested subsets of models, increasing complexity/capacity: S1 S2 … SN S3 Increasing complexity Ga, Guaranteed risk Ga= Tr + e(C) S2 e, Function of Model Complexity C S1 Tr, Training error Complexity/Capacity C Structural Risk Minimization
S1 S2 … SN R capacity SRM Example • Rank with ||w||2 = Si wi2 Sk = { w | ||w||2< wk2 }, w1<w2<…<wk • Minimization under constraint: min Rtrain[f] s.t. ||w||2< wk2 • Lagrangian: Rreg[f,g] = Rtrain[f] + g ||w||2
Multiple Structures • Shrinkage (weight decay, ridge regression, SVM): Sk = { w | ||w||2< wk }, w1<w2<…<wk g1 > g2 > g3 >… > gk (gis the ridge) • Feature selection: Sk = { w | ||w||0< sk }, s1<s2<…<sk (sis the number of features) • Data compression: k1<k2<…<kk (kmay be the number of clusters)
y X Training data: Make K folds Test data Prospective study / “real” validation Hyper-parameter Selection • Learning = adjusting: parameters(w vector). • hyper-parameters(g, s, k). • Cross-validation with K-folds: For various values of g, s, k: - Adjust w on a fraction (K-1)/K of training examples e.g. 9/10th. - Test on 1/K remaining examples e.g. 1/10th. - Rotate examples and average test results (CV error). - Select g, s, k to minimize CV error. - Re-compute w on all training examples using optimal g, s, k.
Summary • SRM provides a theoretical framework for robust predictive modeling (overfitting avoidance), using the notions of guaranteed risk and model capacity. • Multiple structures may be used to control the model capacity, including: feature selection, data compression, ridge regression.
k k x y KXEN (simplified) architecture k, s D D L a t Class of Models a o a t s w a s P r E C e n r i p c t a o e Learning r r g d i a i t o Algorithm n i n o g n
KXEN: SRM put to work Ideal Lift CV lift Customers ranked according to f(x); selection of the top ranking customers. 100% Fraction of good customers selected Training lift O M Test lift Random lift G 100% Fraction of customers selected
Want to Learn More? • Statistical Learning Theory, V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031. • Pattern Classification,R. Duda, P. Hart, and D. Stork.Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html • The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ • Feature Extraction: Foundations and Applications. I. Guyon et al, Eds.Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material.http://clopinet.com/fextract-book