360 likes | 580 Views
Information Driven Healthcare: Data Visualization & Classification Lecture 5: Introduction to Neural Networks Centre for Doctoral Training in Healthcare Innovation. Dr. Gari D. Clifford, University Lecturer & Associate Director, Centre for Doctoral Training in Healthcare Innovation,
E N D
Information Driven Healthcare:Data Visualization & Classification Lecture 5: Introduction to Neural Networks Centre for Doctoral Training in Healthcare Innovation Dr. Gari D. Clifford, University Lecturer & Associate Director, Centre for Doctoral Training in Healthcare Innovation, Institute of Biomedical Engineering, University of Oxford
Overview • What are Artificial Neural Networks (ANNs)? • How do you construct them? • Choosing architecture • Pruning • How do you train them? • Normalizing and removing outliers • Partitioning data (train / validate / test) • Balancing data • Overfitting, early stopping • Optimization • Algorithms • Cost functions • Local minima, momentum & simulated annealing • N-fold Cross Validation
What is an Artificial Neural Network? • An ANN - a mathematical model inspired by biological neural networks. INPUTS . . . . . . INPUTS OUTPUT
Action Potential Propagation • The cell membrane in the axon and soma contain voltage-gated ion channels which allow the neuron to generate and propagate an electrical impulse (an action potential). • The conduction of nerve impulses is an example of an all-or-none response; if a neuron responds at all, then it must respond completely. • Activation amplitude of potential and activation function determines if neuron fires and how often. Source: http://en.wikipedia.org/wiki/Image:Neurons_big1.jpg
Network of neurons • The outputs of each neuron are connected to the inputs of other neurons From "Texture of the Nervous System of Man and the Vertebrates" by Santiago Ramón y Cajal. The figure illustrates the diversity of neuronal morphologies in the auditory cortex.
How does a neuron (or network of neurons) learn? • Reinforcement – iteratively repeated inputs associated with an outcome • Update level at which an activation leads to an output • Rate of update is related to error; difference between expectation (desired result) and network output • Example of child learning to catch a ball
Artificial Neural Network (ANN);The Multi-Layer Perceptron (MLP) • Composed of many connected neurons • Three general layers; Input (i), hidden (j) and output (k) • Signals are presented to each input ‘neuron’ or node • Each signal is multiplied by a learned weighting factor (specific to each connection between each layer) • … and by a global activation function, • This is repeated in output layer to map the hidden node values to the output • (The bias weight, b, is fixed to +1, and allows more flexibility in the class boundaries) wij fa b
Activation Function & learning rates Activation function; fa • A normalization function that limits the range of the output • Usually nonlinear • Sigmoid (logistic) • Softmax • Linear For the traditional tanh activation fn, =1, and data is assumed to be heavy tailed
Iterative learning • 1. Calculate Output – • 2. Measure Error – • (MSE cost function) • 3. Update weights – • 4. Repeat (1) until converges to a limit
Gradient descent to find W • Given a cost function, , we update each element of W ( ) at each step, , • … and recalculate cost function • is the learning rate (0.01 -0.1), and speeds up convergence. • Sometimes a momentum term (0.5< < 0.99) is also used in the iterative weight updates. wij
Gradient descent min ( ) And now … a movie break rbk1-steep.gif
Gradient Descent • At each step, move down the steepest gradient of the error space • Add some noise to jump out of local minima – simulated annealing
Cost function & metric • Usually Euclidean mean square error, also: • Mahalanobis / cityblock • Cosine • Hamming (binary) • Mean log square error • Minimum free energy • Maximum likelihood • Minimum mutual information Choice depends on how you define independence between outputs, or underlying data source distributions
Learning Paradigms • Supervised learning • Each output node, y, represents the probability of belonging to a given class; weights are iteratively adjusted so that the x map labelled patterns to the correct class. • Unsupervised learning • Value of each output node, y, is compared to each of the other output nodes, and weights are iteratively adjusted to minimize some cost function (e.g. an auto-associate MLP)
Auto-associative MLP • The AAMLP has as many outputs as inputs • The target vector, t, is the input vector!
Variable types and input/output encoding • Categorical: discrete e.g. red, green, blue. Represent as [1 0 0], [0 1 0] and [0 0 1] … not 1, 2 & 3, as the latter representation implies ordering/distance • Continuous: apply directly to the input nodes, If units of variables differ, apply normalization (zero mean, unit var). (not generally needed for MLP unless data >> +/- 10.0 .. See Tarassenko p84 … for RBFs you must normalize) • Ordinal: continuous discretized variables (e.g. weight rounded to each kilogram). Treat as continuous. • Periodic: e.g. day of the week. Transform to an angle to retain periodicity.
Other data preparation • Balance classes – equal numbers of each, or you will learn the prior probability of a given class. (If you don’t you can compute the true posterior probabilities from the network output – see Tarassenko, appendix B) • Normalize different parameter data into same units (zero mean, unit variance) • Divide into train, validate and test sets. You cannot report your classifier or prediction accuracy on your training data – see later • Initialize weights with small random values of order +/-0.01 • Present data to ANN in random order
Initialization in Netlab: netopt.m & mlp.m • Parameters you can adjust: • nin=[]; % Number of input nodes • nhidden=[]; % Number of hidden nodes • nout=[]; % Number of output nodes • alpha = []; % Weight decay or momentum • ncycles = []; % Number of training cycles b4 stop • activfn = {'linear','logistic','softmax'}; % Choice of activation function on input layer • optType = {'quasinew','conjgrad','scg'}; % Choice of algorithm for optimization
Choosing architecture: I-J-K • I: # of input units/ neurons = number of observation parameters • J: # neurons in hidden layer = min dimensionality of space which allows you to separate the classes of data • K: # neurons in hidden layer = # classes you wish to discover/learn e.g. the I:J:K architecture for the Iris dataset would be: 4-J-3 where J could be larger or smaller than I=4, depending on the separability of classes
Choosing #nodes in hidden layer • If data is not separable in the original J dimensions, then K>J • If data can be dimensionally reduced (K<J) then use PCA to estimate # hidden nodes. Using PCA Noise floor of all training vectors from PCA at 2nd or 6th eigenvector So hidden layer can have 6 neurons (plus the bias)
Choosing a cost function • Sigmoid non-Gaussian sources • Softmax non-Gaussian sources • Linear Gaussian sources (=PCA) • tanh(a) non-Gaussian sources for > 0 For the traditional tanh activation fn, =1, and data is assumed to be heavy tailed. Varying reflects our changing beliefs in the underlying source distributions. In the limit of large , the nonlinearity becomes a step function and the underlying source becomes biexponential. As , tends to zero, the underlying sources approach a more Gaussian distribution
Then seed and train … • Choose a learning rate (~0.1-0.2) • Randomly seed weights, • … and minimize the cost function
Optimization routines • All of the algorithms involve a line search given by the following equation: x(k+1) = x(k) – aH(k) x(k) • Conjugate Gradient - ‘conjgrad’: For gradient search the update matrix H(k) is I, the unit matrix • For Newton's method H(k) is the inverse of the Hessian matrix, H-1; and a=1. • For quasi-Newton ‘quasinew’ methods H(k) is a series of matrices beginning with the unit matrix, I, and ending with the inverse of the Hessian matrix, H-1 • Scaled Conjugate gradient ‘scg’ – Similar to Levenberg-Marquardt method (see lsqnonlin.m)
The Hessian • Ludwig Otto Hesse (22 April 1811 – 4 August 1874) - German mathematician, after whom the Hessian matrix, H, was named • It is the square matrix of second-order partial derivatives of a function • So it describes the local curvature of a function of many variables • Sometimes defined as determinant of H(f)
Choosing an optimization routine • Advantage of conjugate gradient methods -they use relatively little memory for large-scale problems and require no numerical linear algebra - so each step is relatively quick • Disadvantage of conjugate gradient methods: they typically converge much more slowly than Newton or quasi-Newton methods • In the end it depends on your search space – If steps are large, the SCG is often better
Break … • … part 2 of neural networks after the break.
Appendix – some nuances • A MLP can be made to perform PCA if the cost function is MSE, and the activation function is linear. You can demonstrate this mathematically. • A latent variable interpretation of the activation functions • More on the bias weights
MLP performs PCA ( f=const ) • Consider the error function for p hidden nodes and a linear activation function: • The matrix form is:
MLP=PCA ( f=const ) Using equations (90) through (95)
Side note: The bias weights • Two different kinds of parameters can be adjusted during the training of an ANN, the weights and a parameter in the activation function. This is impractical and it is preferable to adjust only one of the parameters. • To cope with this problem a bias neuron is invented. The bias neuron lies in one layer, is connected to all the neurons in the next layer, but none in the previous layer and it always emits a value of +1. • Since the bias neuron emits 1 the weights, connected to the bias neuron, are added directly to the combined sum of the other weights, just like the parameter in the activation function. b
Radial Basis Functions • Radial basis functions (RBFs) are the natural generalization of coarse coding to continuous-valued features • Rather than each feature being either 0 or 1, it can be anything in the interval [0,1] , reflecting various degrees to which the feature is present • A typical RBF feature, i, has a Gaussian (bell-shaped) response • dependent only on the distance between the state, s, and the feature's prototypical or center state, , and relative to the feature's width,
RBFs - why / why not • An RBF network is simply a linear function approximator using RBFs for its features. • Learning is exactly the same as for other linear function approximators – gradient descent • The primary advantage of RBFs over binary features is that they produce approximate functions that vary smoothly and are differentiable • Also, some learning methods for RBF networks change the centers and widths of the features as well. Such nonlinear methods may be able to fit the target function much more precisely • The downside to RBF networks, and to nonlinear RBF networks especially, is greater computational complexity and, often, more manual tuning before learning is robust and efficient • MLPs are more complex still, but often give a better performance
Acknowledgements • Overfitting, Cross-validation and bootstrapping slides adapted from notes by Andrew W. Moore, School of Computer Science Carnegie Mellon University: www.cs.cmu.edu/~awm including “Cross-validation for detecting and preventing overfitting” - http://www.autonlab.org/tutorials/index.html