1 / 35

Lecture 1: Introduction to Machine Learning

Lecture 1: Introduction to Machine Learning . Isabelle Guyon isabelle@clopinet.com. What is Machine Learning?. Learning algorithm. Trained machine. TRAINING DATA. Answer. ?. Query. What for?. Classification Time series prediction Regression Clustering. Market Analysis.

lance
Download Presentation

Lecture 1: Introduction to Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 1: Introduction to Machine Learning Isabelle Guyon isabelle@clopinet.com

  2. What is Machine Learning? Learning algorithm Trained machine TRAINING DATA Answer ? Query

  3. What for? • Classification • Time series prediction • Regression • Clustering

  4. Market Analysis training examples Ecology OCR HWR 105 Machine Vision 104 Text Categorization 103 System diagnosis 102 Bioinformatics 10 inputs 10 102 103 104 105 Applications

  5. Banking / Telecom / Retail • Identify: • Prospective customers • Dissatisfied customers • Good customers • Bad payers • Obtain: • More effective advertising • Less credit risk • Fewer fraud • Decreased churn rate

  6. Biomedical / Biometrics • Medicine: • Screening • Diagnosis and prognosis • Drug discovery • Security: • Face recognition • Signature / fingerprint / iris verification • DNA fingerprinting 6

  7. Computer / Internet • Computer interfaces: • Troubleshooting wizards • Handwriting and speech • Brain waves • Internet • Hit ranking • Spam filtering • Text categorization • Text translation • Recommendation 7

  8. Conventions n X={xij} y ={yj} m xi a w

  9. Learning problem Data matrix: X m lines = patterns (data points, examples): samples, patients, documents, images, … n columns = features: (attributes, input variables): genes, proteins, words, pixels, … Unsupervised learning Is there structure in data? Supervised learning Predict an outcome y. Colon cancer, Alon et al 1999

  10. Some Learning Machines • Linear models • Kernel methods • Neural networks • Decision trees

  11. Linear Models • f(x) = wx+b = Sj=1:n wj xj +b Linearity in the parameters, NOT in the input components. • f(x) = w F(x)+b = Sj wjfj(x) +b (Perceptron) • f(x) = Si=1:maik(xi,x) +b (Kernel method)

  12. x1 w1 x2 w2 S f(x) wn xn b 1 Artificial Neurons Cell potential Axon Activation of other neurons Activation function Dendrites Synapses f(x) = w x + b McCulloch and Pitts, 1943

  13. hyperplane x2 x3 x1 x2 x1 Linear Decision Boundary

  14. x1 f1(x) w1 f2(x) x2 w2 S f(x) wN xn fN(x) b f(x) = w F(x) + b 1 Perceptron Rosenblatt, 1957

  15. x2 x3 x2 x1 x1 NL Decision Boundary

  16. x1 k(x2,x) k(x1,x) k(xm,x) a1 x2 a2 S am xn b f(x) = Siaik(xi,x) + b 1 k(. ,. ) is a similarity measure or “kernel”. Kernel Method Potential functions, Aizerman et al 1964

  17. What is a Kernel? A kernel is: • a similarity measure • a dot product in some feature space: k(s, t) = F(s) F(t) But we do not need to know the F representation. Examples: • k(s, t) = exp(-||s-t||2/s2) Gaussian kernel • k(s, t) = (s t)qPolynomial kernel

  18. Activation of another neuron xj wj y S Dendrite Synapse Hebb’s Rule wj wj + yi xij Axon Link to “Naïve Bayes”

  19. Kernel “Trick” (for Hebb’s rule) • Hebb’s rule for the Perceptron: w = Si yiF(xi) f(x) = w F(x) = Si yiF(xi) F(x) • Define a dot product: k(xi,x) = F(xi) F(x) f(x) = Si yi k(xi,x)

  20. Kernel “Trick” (general) • f(x) = Siai k(xi, x) • k(xi, x) = F(xi)  F(x) • f(x) = w  F(x) • w = SiaiF(xi) Dual forms

  21. Simple Kernel Methods f(x) = Sai k(xi, x) k(xi, x) = F(xi).F(x) Potential Function algorithm aiai + yi if yif(xi)<0 (Aizerman et al 1964) Dual minover aiai + yi for min yif(xi) Dual LMS aiai +  (yi - f(xi)) f(x) = w • F(x) Perceptron algorithm ww + yiF(xi) if yif(xi)<0 (Rosenblatt 1958) Minover (optimum margin) ww + yiF(xi)for min yif(xi) (Krauth-Mézard 1987) LMS regression ww +  (yi- f(xi)) F(xi) w = SaiF(xi) i i (ancestor of SVM 1992, similar to kernel Adatron, 1998, and SMO, 1999)

  22. S xj S S internal “latent” variables “hidden units” Multi-Layer Perceptron Back-propagation, Rumelhart et al, 1986

  23. Chessboard Problem

  24. f2 All the data f1 At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. Choose f2 Choose f1 Tree Classifiers CART (Breiman, 1984)orC4.5 (Quinlan, 1993)

  25. Iris Data (Fisher, 1936) Figure from Norbert Jankowski and Krzysztof Grabczewski Linear discriminant Tree classifier versicolor setosa virginica Gaussian mixture Kernel method (SVM)

  26. x2 x1 Fit / Robustness Tradeoff x2 x1 15

  27. x2 x1 Performance evaluation f(x) < 0 f(x) < 0 x2 f(x) = 0 f(x) = 0 f(x) > 0 f(x) > 0 x1

  28. x2 x1 Performance evaluation f(x) < -1 f(x) < -1 x2 f(x) = -1 f(x) = -1 f(x) > -1 f(x) > -1 x1

  29. x2 x1 Performance evaluation f(x) < 1 f(x) < 1 x2 f(x) = 1 f(x) = 1 f(x) > 1 f(x) > 1 x1

  30. ROC Curve For a given threshold on f(x), you get a point on the ROC curve. Ideal ROC curve 100% Actual ROC Positive class success rate (hit rate, sensitivity) Random ROC 0 100% 1 - negative class success rate (false alarm rate, 1-specificity)

  31. ROC Curve For a given threshold on f(x), you get a point on the ROC curve. Ideal ROC curve (AUC=1) 100% Actual ROC Positive class success rate (hit rate, sensitivity) Random ROC (AUC=0.5) 0  AUC  1 0 100% 1 - negative class success rate (false alarm rate, 1-specificity)

  32. What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Examples: • Classification: • Error rate:(1/m) Si=1:m1(F(xi)yi) • 1- AUC • Regression: • Mean square error:(1/m) Si=1:m(f(xi)-yi)2

  33. R[f(x,w)] Parameter space (w) w* How to train? • Define a risk functional R[f(x,w)] • Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.) (… to be continued in the next lecture)

  34. Summary • With linear threshold units (“neurons”) we can build: • Linear discriminant (including Naïve Bayes) • Kernel methods • Neural networks • Decision trees • The architectural hyper-parameters may include: • The choice of basis functions f (features) • The kernel • The number of units • Learning means fitting: • Parameters (weights) • Hyper-parameters • Be aware of the fit vs. robustness tradeoff

  35. Want to Learn More? • Pattern Classification,R. Duda, P. Hart, and D. Stork.Standard pattern recognition textbook. Limited to classification problems. Matlab code. http://rii.ricoh.com/~stork/DHS.html • The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ • Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork,In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000.http://clopinet.com/isabelle/Papers/guyon_stork_nips98.ps.gz • Feature Extraction: Foundations and Applications. I. Guyon et al, Eds.Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material.http://clopinet.com/fextract-book

More Related