340 likes | 363 Views
Machine Learning Neural Networks, Support Vector Machines. Georg Dorffner Section for Artificial Intelligence and Decision Support CeMSIIS – Medical University of Vienna. Machine Learning – possible definitions.
E N D
Machine LearningNeural Networks, Support Vector Machines Georg Dorffner Section for Artificial Intelligence and Decision SupportCeMSIIS – Medical University of Vienna
Machine Learning – possible definitions • Computer programs that improve with experience (Mitchell 1997)(artificial intelligence) • To find non-trivial structures in data based on examples (pattern recognition, data mining) • To estimate a model from data, which describes them(statistical data analysis)
Some prerequisites • Features • Describe the cases of a problem • Measurements, data • Learner (Version Space) • A class of models • Learning rule • An algorithm that finds the best model • Generalisation • Model is supposed to describe new data well
Example learner: Perceptron • Features: 2 numerical values (drawn as points in a plane) • Task: Separate into 2 classes (white and black) • Learner (version space):Straight line through origin • Learning rule: • Take nornal vector • Add point vector of a falsely classified example • Turn line such that the new vector becomes normal vector • Repeat until everything correctly classified • Generalisation: new points are correctly classified • Convergence is guaranteed, if problem is solvable (Rosenblatt 1962)
Types of learning • supervised learning • Classes of all training samples known („labeled data“) • Find the relationship with input • Examples: medical diagnosis, forecasting • unsupervised learning • Classes not known („unlabeled data“) • Find inherent structure in data • Examples: segmentation, visualisation • Reinforcement Learning • Find relationships based on global feedback • Examples: robot arm control, learning games
Neural networks: The simple mathematical model activation, output weight • Propagation rule: • Weighted sum • Euclidian distance • Transfer function f: • Threshold function(McCulloch & Pitts) • Linear fct. • Sigmoid fct. w1 unit (neuron) yj f xj w2 … (net-) input wi
Perceptron as neural network • Inputs are random „feature“ detectors • Binary codes • Perceptron learns classification • Learning rule = weight adaptation • Model of perception / object recongition • But can solve only linearly separable problems Neuron.eng.wayne.edu
Multilayer perceptron (MLP) • 2 (or more) layers (= connections) Output units (typically linear) Hidden units (typically sigmoid) Input units
Learning rule (weight adaptation): Backpropagation • Generalised delta rule yout, xout Wout yhid, xhid Whid • Error is being propagated back • „Pseudo-error“ for the hidden units
Backpropagation as gradient descent • Define (quadratic) error (for pattern l): • Minimize error • Change weights in the direction of the gradient • Chain rule leads to backpropagation (partial derivative by weight)
Limits of backpropagation • Gradient descent can get stuck in local minimum(depends on initial values) • it is not guranteed that backpropagation can find an existing solution • Further problems: slow, can oscillate • Solution: conjugent gradient, quasi-Newton
The power of NN: Arbitrary classifications • Each hidden unit separates space into 2 halves (perceptron) • Output units work like “AND” • Sigmoids: smooth transitions
Example • MLP with 5 hidden und 2 output units • Linear transfer function at output • Quadratic error
MLP to produce probabilities • MLP can approximate the Bayes posterior • Activation function: Softmax • Prior probabilities:Distribution in training set
Distribution with expected value f(xi) Regression • To model the data generator: estimate joint distribution • Likelihood: Maschinelles Lernen und Neural Computation
Gaussian noise • Likelihood: • Maximize = -logL minimize(constant terms can be dropped incl. p(x)) • Corresponds to the quadratic error(see backpropagation)
Gradient der Fehlerfunktion • Optimierung basiert auf Gradienteninformation: • Backpropagation (nach Bishop 1995):effiziente Berechnung des Gradienten (Beitrag des Netzes): O(W) statt O(W2), siehe p.146f • ist unabhängig von der gewählten Fehlerfunktion Beitrag des Netzes Beitrag der Fehlerfunktion
Gradientenabstiegsverfahren • Einfachstes Verfahren:Ändere Gewichte direkt proportional zum Gradienten klassische „Backpropagation“ (lt. NN-Literatur) • Langsam, Oszillationen und sogar Divergenz möglich Endpunkt nach 100 Schritten: [-1.11, 1.25], ca. 2900 flops
Line Search • Ziel: Schritt bis ins Minimum inder gewählten Richtung • Approximation durch Parabel (3 Punkte) • Ev. 2-3 mal wiederholen Endpunkt nach 100 Schritten: [0.78, 0.61], ca. 47000 flops
dt dt+1 wt+1 wt Konjugierte Gradienten • Problem des Line Search: neuer Gradient ist normal zum alten • Nimm Suchrichtung, die Minimierung in vorheriger Richtung beibehält • Wesentlich gezielteres Vorgehen • Variante: skalierter konjugierter Gradient Endpunkt nach 18 Schritten: [0.99, 0.99], ca. 11200 flops
move (bias) stretch, mirror MLP as universal function approximator • E.g: 1 Input, 1 Output, 5 Hidden • MLP can approximate arbitray functions (Hornik et al. 1990) • Through superposition of sigmoids • Complexity by combining simple elements
50 samples, 15 H.U. Overfitting • If too few training data: NN tries to model the noise • Overfitting: worse performance on new data (quadratic error becomes bigger)
Avoiding overfitting • As much data as possible(good coverage of distribution) • Model (network) as small as possible • More generally: regularisation (= limit the effective number of degrees of freedom): • Several training runs, average • Penalty for large networks, e.g.: • „Pruning“ (remove connections) • Early stopping
The important steps in practice Owing to their power and characteristics, neural network require a sound and careful strategy: • Data inspection (visualisation) • Data preprocessing • Feature selection • Model selection (pick best network size) • Comparison with simpler methods • Testing on independent data • Interpretation of results
Model selection • Strategy for the optimal choice of model complexity: • Start small (e.g. 1 or 2 hidden units) • n-fold cross-validation • Add hidden units one by one • Accept as long as there is a significant improvement (test) • No regularization necessaryoverfitting is captured by cross-validation (averaging) • Too many hidden units too large variance no statistical significance • The same method can also be used for feature selection (“wrapper”)
Support Vector Machines: Returning to the perceptron • Advantage of (linear) perecptron: • Global solution guaranteed (no local minima) • Easy to solve / optimize • Disadvantage: • Restricted to linear separability • Idea: • Transformation of data to a highdimensional space, such that problem becomes linearly separable
Mathematical formulation of perceptron learning rule • Perceptron (1 Output): • ti = +1/-1: • Data is described in terms of inner products („dual form“) Inner product(dot product)
Kernels • The goal is a certain transformation xi→Φ(xi), such that problem becomes linearly separable (can be high-dimensional) • Kernel: Function that is depictable as inner product of Φs: • Φdoes not have to be explicitly known
Example: polynomial kernel • 2 dimensions: • Kernel is indeed an inner product of vectors after transformation („preprocessing“)
The effect of the „kernel trick“ • Use of the kernel, e.g: • 16x16-dimensional vectors (e.g. pixel images), 5th degree polynomial: dimension = 1010 • Inner product of two 10000000000-dim. vectors • Calculation is done in low-dimensional space: • Inner Product of two 256-dim. vectors • To the power of 5
Large Margin Classifier • Highdimensional space: Overfitting easily possible • Solution: Search for decision border (hyperplabe) with largest distance to closest points • Optimization:Minimize(Maximize )Boundary condition: distance maximal w
Optimization of large margin classifier • Quadratic optimization problem, Lagrange multiplier approach, leads to: • „Dual“ form • Important: Data is again denoted in terms of inner products • Kernel trick can be used again
Support Vectors • Support-Vectors: Points at the margin (closest to decision border • Determine the solution, all other points could be omitted Kernel function Back projection support vectors
Summary • Neural networks are powerful machine learners for numerical features, initally inspired by neurophysiology • Nonlinearity through interplay of simpler learners (perceptrons) • Statistical/probabilistic framework most appropriate • Learning = Maximum Likelihood, minimizing error function with efficient gradient-based method (e.g. conjugent gradient) • Power comes with downsides (overfitting) -> careful validation necessary • Support vector machines are interesting alternatives, simplify learning problem through „Kernel trick“