Supervised Learning Artificial Neural Networks Support Vector Machines

CCEB Supervised LearningArtificial Neural NetworksSupport Vector Machines John H. Holmes, Ph.D. Center for Clinical Epidemiology and Biostatistics University of Pennsylvania School of Medicine

What’s on the agenda for today • Review of classification • Introduction to machine learning • Artificial neural networks • Support vector machines

The Classification Problem A+ A- Separating Surface: Find surface to best separate two classes.

To do classification or prediction, you need to have the right data • Pool of data • Training data • Testing data • A class attribute • Categories must be mutually exclusive • Predictor attributes

What is a class? • Defines or partitions a relation • May be dichotomous or polytomous • Is not continuous • Examples • Clinical status (Ill/Well, Dead/Alive) • Biological classification (varieties of genus, species, or order)

Mining class comparisons • Goal: to discover descriptions in the data that distinguish one class from another • These descriptions are concepts! • Data in the classes must be comparable • Same attributes • Same value-system for each attribute • Same dimensions

Some mechanistic details… • Training • Phase during which a system is trained • Focus on generalization • Testing • Phase during which a system is tested on novel cases

A split-sample method Dataset Randomly select cases for the training set The balance is the testing set Training Set Testing Set

Cross-validation Dataset Balance of dataset used for training set Candidate fold for use in testing Fold N-n Fold n Repeat for N-folds (usually 10)

Generic Machine Learning Model

Instance-based learning • Testing cases (unknown class) are compared to training cases, one at a time • The training case closest to the testing case is used to output a predicted class for the testing case • You need a distance function • Euclidean most common • Compares differences between sums of squares for each attribute-value pair • Manhattan distance • Compares differences between each attribute-value pair, without squaring

Kernel-based learning • Kernel • Similarity function that maps a non-linear problem to a linear classifier • The idea is that non-linear data can be classified by a linear classifiers • Linear classifiers are based on the dot product between two vectors • If you substitute the dot product with a kernel function, a linear classifier can be transformed to a non-linear classifier!

Neural Networks • Set of connected input/output units • Three layers: Input, Hidden, Output • Connections have weights that indicate the strength of the link between their units (neurons) • Neural networks learn by adjusting the connection weights as a result of exposure to training cases Methods: Backpropagation, self-organization

A Simple Neural Network Hidden layer Inputs Output x1 (Bitten) x2 (Rabies present) Treat Yes/No x3 (Animal captured) x4 (Animal vaccinated)

Characteristics of neural networks • Neurons are all-or-none devices • Firing depends on reaching some threshold • Networks rely on connections made between axons and dendrites • Synapses • Neurotransmitters • “Wiring”

A biologic neuron

A simulated neuron

Neuronal structure and function • Input • Always 0 or 1, until multiplied by a: • Weight • Determines a neuron’s effect on another in a connection • Inputs multiplied by their weights are processed by an: • Adder • Sums the weighted inputs from all connected neurons for processing through a: • Threshold function • Determines the output of the neuron based on the summed, weighted inputs

So... • Neural nets are arithmetic constraint networks • Operation frames denote arithmetic constraints • Demon procedures propagate stimuli through the net

Perceptrons: The simplest type of neural net

Perceptrons, contd. • Threshold logic unit or ADALINE • One neuron • Only binary (0/1) inputs are allowed • Logic boxes intercede between inputs and weights to interpret the environment • The neuron sums the weighted inputs and reports an output based on the threshold • The main task is to learn the weights

How an output is produced in a perceptron

Thus... If I1=2, I2=1, w1=.5, w2=.3, and -1, the output 0 of the perceptron is 1 because: 0=(2)(.5)+(1)(.3)-1=.3 (>0)

Weight and threshold adjustment in perceptrons • Adjustments are made only when an error occurs in the output • Weight adjustment wi(t+1)=wi(t)+wi(t) where wi(t)=(D-0)Ii

Weight and threshold adjustment in perceptrons, contd. • Threshold adjustment i(t+1)=i(t)+i(t) where i(t)=(D-0)

Types of error

How the adjustment works... • If output O is correct, no change is made • If a false positive error • Each weight is adjusted by subtracting the corresponding value in the input pattern • Threshold is adjusted by subtracting 1 • If a false negative error • Each weight is adjusted by adding the corresponding value in the input pattern • Threshold is adjusted by adding 1

Thus... • If a false positive error was made on the example (I1=2, I2=1, w1=.5, w2=.3, and -1), the weights would have been adjusted as: w1(t+1)=.5+(0-1)(2)=-1.5 w2(t+1)=.3+(0-1)(1)=-.7 (t+1)=-1+(0-1)=-2

Training a perceptron: The pseudocode Do while output incorrect For each training case x If ox incorrect (dx- ox0) If dx - ox =1 Add logic box output vector to weight vector Else Subtract logic box output vector from weight vector x=x+1 EndDo

Two-layer, multiple-output network

Multiple-layer, single-output network

Multiple-layer, multiple-output network

Backpropagation • Most common implementation of neural nets • Two-stage process • feed-forward activation from input to output layer • propagation of errors in the output backward to the input layer • Change w in proportion to the effect on the error observed at the outputs • Error=d-o • Where d=known class value, o=output from ANN

Backpropagation requires hidden layers • Middle layers build internal model of the way input patterns are related to the desired outputs • The knowledge representation is implicit in this model- it is the synapses (connectivity) that is the representation

Hidden layers • As the number of hidden layers increases, the training error rate decreases • Due to increased flexibility in the network to fit the data

Calculating the output of a hidden unit • Logistic function • Output will be any real number between 0 and 1

Weight and threshold adjustment in backpropagation • Training involves adjustment of the weight that is proportional to the product of a learning rate (lrate), an error derivative (errdrv), and the the input I • Because there are multiple layers, the input to unit j may be the output of a unit in the previous hidden layer, Oi

Weight adjustment in backpropagation wij(t+1)=wij(t)+wij(t) where wij(t)=(lrate)(errdrv)jOi

Threshold adjustment in backpropagation j(t+1)=j(t)+j(t) where j(t)=(lrate)(errdrv)j

Calculating the error derivatives • Output units in the output layer:(errdrv)j=Oj(1-Oj)(Dj-Oj) • Output units in a hidden layer • Sum the error derivatives of all k units connected to unit j in the next higher layer

How backpropagation works:The pseudocode Initialize weights For each training case i Present input i Generate output oi Calculate error (di - oi ) Do while output incorrect For each layer j Pass error back to each neuron n in layer j Modify weight in each neuron n in layer j EndDo i=i+1 or more typically, i=Rnd(i)

Problems with backpropagation • Gradient descent (ascent) is hill-climbing • Add a momentum term to the generalized delta rule • Allows sliding over local minima/maxima • Scaling to the problem domain • Increasing the number of hidden layers can cause degradation in performance

Problems with backpropagation, contd. • Biological implausibility • Some believe that reverse neural pathways do not exist simultaneously with feed-forward pathways • Relies more on distant neurons for information than on local neurons

Neural Networks: Summary • Advantages • Excellent performance on many databases • Good choice for predictive mining • Resistant to noisy data • Disadvantages • Require a priori knowledge model • Require substantial parameterization • Training can require long periods • Knowledge not easily represented

Let’s look at an example

Support Vector Machines • Blend linear models with instance-based learning • Basic principle • Select a number of critical boundary instances (support vectors) from each class • Build a linear discriminant function that separates the SVs as much as possible

But SVMs are more than just linear models! • Other, non-linear terms can be added • Decision boundaries not linearly constrained • Quadratic, cubic, polynomial boundaries now possible! • How? • Use non-linear functions to transform input • Thus, SVMs use linear models to implement non-linear class boundaries

Linear Classifiers How would you classify these data?

Linear Classifiers, contd. How would you classify these data?

Supervised Learning Artificial Neural Networks Support Vector Machines

Supervised Learning Artificial Neural Networks Support Vector Machines

Presentation Transcript

Machine Learning Artificial Neural Networks

Artificial Neural Networks

Supervised Learning Evolutionary Computation Artificial Neural Networks Support Vector Machines

Artificial Neural Networks

Chap 9: Supervised Learning Neural Networks

Artificial Neural Networks

Artificial Neural Networks

Artificial Neural Network Supervised Learning

Artificial Neural Networks

Artificial Neural Networks

Artificial Neural Networks

Artificial neural networks

Artificial Neural Networks

Artificial Neural Networks

Artificial Neural Networks

Artificial Neural Networks

Supervised Learning in Neural Networks

Lecture 7 Artificial neural networks: Supervised learning

Chapter 9: Supervised Learning Neural Networks

Machine Learning Neural Networks, Support Vector Machines

Artificial neural networks – Unsupervised learning