Deep Neural Networks: Maximizing Efficiency Through Advanced Training Techniques

Deep neural networks (DNNs)

Conventional and deep networks • What is the difference between a conventional and a deep network? • The structural difference is thatdeep networks have more hidden layers(5-10 instead of 1-2, but nowadays even 100-150) • Sounds simple – why it took so long?? • The training of deep networks requiresnew algorithms • The first one: DBN pre-training, 2006 • The advantages of deep learning show uponly for huge amounts of data • These were not available on the 80`s • Training deep networks is slow • This is solved by GPUs now • All three factors – new algorithms, access to a lot of data, invention of GPUs – contributed to the current success of deep learning

Why deep networks are more efficient? • We saw a proof earlier that we can solve any task with a network of 2 hidden layers • However, it is only true if we have infinitely many neurons, infinite amount of training data and a training algorithm that guarantees global optimum • Having a fixed and finite number of neurons, it is more efficient to arrange them into several smaller layers rather than 1-2 „wide” layers • This allows the network to process the data hierarchically • In image recognition tasks the higher layers indeed learn more and more abstract notions • Pixeledge mouth, nose facehuman, …

Training deep neural networks • Training deep networks is more difficult than “shallow” networks • Backpropagation propagates the error from the output to hidden layers • The more layers we go back the larger the chance that the gradient “vanishes” so the deeper layers will not learn • Solution approaches for training deep networks • Pre-training with unlabelled data (DBN pre-training using the contrastive divergence (CD) cost function) • This was the first solution, mathematically involved and slow • Build and train the network adding layer after layer • Much simpler, requires only the backpropagation algorithm • We can use newer types of activation functions • The simplest solution, we will look at this first • Training very deep network requires further tricks (batch normalization, highway networks, etc.)

Modifying the activation function • The sigmoid activation function has been used for 30 years… • The tanh activation function is equivalent with this (sigmoid returnsvalues within [0-1], tanh returns values in [-1,1] ) • Problem: the two ends are very „flat”  derivative is practically 0 the gradient may easily vanish • To avoid this, the rectifier activation function was proposed first, and since then many new activation functions have been recommended

The rectifier activation function • In comparison with the tanh activation function: • Formally: • For positive input the derivative is always 1, never vanishes • For negative input the output and the derivative are both 0 • The nonlinearity is very important! (compare it with a linear activation) • It works fine in spite of the 0 derivative for negative input, but there have been improved versions proposed for which the derivative in never 0 • Networks built out of „rectified linear” (ReLU) neurons is currently the de facto standard in deep learning

Even newer activation functions • See also: https://en.wikipedia.org/wiki/Activation_function • Linear, sigmoid, tanh, ReLU, ELU, SELU, Softplus… • Examples: • These sometimes give slightly better results then the ReLU function, but none of them resulted in a general breakthrough

Therestricted Boltzmann machine (RBM) Very similar to a pair of network layers - but works with binary values Training: contrastive divergence (CD) - It is an unsupervised method (there are no class labels) - Seeks to reconstruct the input from the hidden representation - It can be interpreted as an approximation of the Maximum Likelihood cost function - Iterative, similar to backpropagation

Deep Belief Network Deep Belief Network: Restricted Boltzmann machines stacked on each other - training: CDalgorithm, adding layer after layer

Conversion into a deep neural network • DBNs were proposed for the initialization („pre-training”) of deep networks • After training, the DBN can be converted into a deep network • RBMs are turned to conventional sigmoid neurons (with same weights) • A softmax output layer is added on top • So we can continue with supervised backpropagation training • In the early papers DBN pre-training was used tosolve the problem of the difficulty of backpropagation training • DBN pre-training resulted in a good starting point for backpropagation • The current view is that DBN pre-training is no longer necessary • We usually train on much more training data • The new activation functions also help a lot • We have new weight initialization methods and other training tricks(e.g. batch normalization)

Cases when DBNs are still useful • Training DBNs using the CD criterion is unsupervised training • DBN pre-training may still be useful when we have a lot of unlabelled data, and only few labelled • The connection between two RBM layers is symmetric • We can easily reconstruct the input from a hidden representation • We can easily visualize what hidden representation has the network learned

Convolutional neural networks • Classic architecture: „fully connected” net • Between two layers all neurons are connectedto all neurons • The order of the inputs plays no role • If we permute the inputs randomly (but usingthe same permutation for each vector!), thenthe network will attain a very similar accuracy • For many classification task the order of the inputs indeed plays no role • Eg. our very first example: (fever, joint_pain, cough)influenza • Obviously, just as learnable as (joint_pain, cough, fever)influenza • But there are tasks where the order (topology) of features is important • The best example is image recognition • In this case it is worth using special a network structure  convolutional network

Motivation #1 • There are tasks where the order (relation, topology) of the features is important • Image recognition: we won’t see the same picture if we mix the pixels • But a fully connected network will achieve the same performance in both cases • The arrangement of the pixels contains vital information for our brain, but a fully connected network is not able to exploit it • The pixels form the image together • The semantic connection of nearby pixelsis typically stronger (they from objects together)

Motivation #2 • A picture typically has a hierarchical structure • From simpler, local building blocks to larger, more complex notions • The early shape recognition experiments using ANNs considered such complex tasks to be hopeless • They tried only simpler tasks like character recognition • Now we can recognize complex “real” images • Convolutional networks greatly contributed to this success

Motivation #3 • Certain research results show that the human brain also processes the images hierarchically • Neurons in the visual cortex fire when seeing certain simple graphical primitives like edges having a certain direction • Consider the „Thatcher illusion” – for the upside-down image our brain concludes that the mouth, nose an eyesseem to be correct and in place (they form a face together), but does not realize the error in the fine details

Convolutional neural networks • Based on the above motivations, the convolutional neural network • Processes the input hierarchically • As a consequence, it will necessarily be deep • The input of each neuron is local, so they focus on a small block of the image • But going upwards we will cover larger and larger areas • Neurons in the higher layerscover larger and larger parts of the image, but with a decreasing resolution (fine details count less) • The network will be less sensitive to the exact location of objects (this is where the convolution operation will help) • The main application area of convolutional networks is image recognition, but they may be applied in any other areas where the input has a hierarchical structure (for example they are also used in speech recognition)

Building blocks of convolutional networks • The typical architecture of a convolutional neural network • https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ • Convolutional neurons (also known as filters) • Pooling operation • These two are repeated several times • On top, the network usually has some fully connected layers and a softmax output layer • The networkcan be trained with the usual backpropagation algorithm (adjusted to the convolution and pooling steps)

The convolution operation • The convolutional neurons are very similar to standard neurons. The main differences are: • Locality: They process only small parts of the input image • Convolution: the same neuron is evaluated at many different locations of the image (as these evaluations use the same weights, this property is also known as “weight sharing”) • The operation performed by the neurons is very similar to the filters known from image processing, this is why these neurons are sometimes called filters • But in this case we also apply a nonlinear activation function on the filter output • And the parameters (weights) will be tuned automatically, not specified manually

The convolution operation • An example of what a convolutional neuron does • Input image, filter (weight matrix of the neuron), and the result: • The evaluation can be performed at each position, so the result is a matrix with the same size as the input • But the processed range may also be restricted • Stride: the step size of the filter. E.g. stride 2 means that we skip every second position • Zero padding: To fit the filter on the edge pixels might require the “padding” of the image (with zeros) – it is a matter of design decision

The convolution operation • We expect the convolutional filters to learn abstract notions (eg. to be able to detect an edge or a nose) • For this we usually need more neurons, so we train a set of neurons for the same task in parallel.In this example we use 3 neurons to process a position,resulting in 3 output matrices. We call this the “depth” of the output • We train further layers on the output of the given layer, so we interpret the layer as a feature extractor for the subsequent layer (hopefully more and more abstract features as we go up…) • This is why it is called „feature map” in the image

The pooling operation • As we go up we don’t want to preserve the fine details • E.g.if the „nose detector” found a nose, then we don’t want to keep its precise position and shape in the higher layers • This is what the pooling step does: • In a local neighborhood it pools the output values • For example, taking their maximum • Advantages: • Shift invariance within the pooling region (no matter where we found the nose, the maximum will be the same) • We gradually decrease the output size (downsampling) – reducing the number of parameters is always useful • Going up, we cover larger and larger areas with filters of the same size – hierarchic processing

Hierarchic processing • We stack a lot of convolution+pooling steps on each other • We expect the higher layers to extract more and more abstract features • This is more or lesstrue (example froma face recognitionnetwork): • The final classification is performed by fully connected layers

Summary • The advantages and drawbacks of convolutional networks • Advantages of convolution: local processing of input blocks using the same weights – fewer parameters, shift-invariance • Drawback: slower computation than with fully connected nets • Advantages of pooling: parameter reduction, shift-invariance, hierarchic processing from local to larger context • Drawback: fine details are lost, they would be useful in certain cases • Advantages of hierachical processing: complex pictures have a hierarchic structure, makes sense to process them hierarchically • Drawback: the network will be inevitably deep, training deep networks is problematic

Examples • Complex image recognition task: more objects on the same image • ImageNet database: 1,2 million high-resolution images, 1000 target labels • The network can have 5 guesses, it is considered correct if the correct answer is among them • Before convolutional networks the smallest error was 26%

More examples • First convolutional network (2012): 16% errornow: <5%

Modelling time series with recurrent neural networks • So war we assumed that the examplesthat follow each other are independent – This is true for most recognition tasks • E.g. image recognition • But there might be tasks where the order of examples carries vital information • This typically occurs when we model time series • E.g.: speech recognition, language processing,handwriting recognition, video analysis, stock exchange rates,… • The questions is how to modify the network so that it would take the neighbors of an input vector into consideration • Feed-forward network on several vectors: Time-delay neural network • Recurrent neural network • Recurrent neural network with memory: Long short-term memory network

Using neighboring input vectors • The networkthat processes one input vector looks like this(the line corresponds to full connection): • We can easily modify this to process more than one input vectors • No modification is required in the network structure • Drawback 1: the input size greatlyincreases • Drawback 2: the input context islarger, but still finite

Time-Delay Neural Network • Processes several neighboring input vectors • But (at the lower level) we perform the same processing on each vector • The results are combined only ata higher level • The 5 blocks of the hidden layeruses the same weights”Weight sharing” • Advantage: we can increase the input size without increasing thenumber of processing neurons • If the size of the hidden layer is relatively small, than the input size of the output layer increases only slowly • Of course, both the lower and upper processing parts may consist of several layers

Time-Delay Neural Network 2 • The TDNN is a feedforward network • Backpropagation can be used as before • Backpropagation through the „weight sharing” is the only complication • Evaluation(„forward pass”): • Similar to a fully connected net, but the same weight are used at many positions • Training („backward pass”): • The error values obtained at the different pathsbelong to the same weights, so they must be summed before update • The TDNN is close relative of convolutionalneural networks

Recurrent neural networks (RNN) • We introduce real recurrent connections • So the input consists of not only the actual input, but also of the previous output • We usually feed back the hidden layer rather than the output layer • The network now sees its previous hidden states, so is is like “adding memory” to the network

Backpropagation through time • How can we evaluate an RNN? • From left to right, from vector to vector • We cannot skip vectors • In the first step ht-1 must be initialized somehow • How can we train an RNN? • Theoratically, in each step we need the previous ht-1 infinite recursion • In practice, however, any training data set is finite • We may also cut it to chunks artificially… • This way, the network can be „unfolded” in time • And can be trained usingbackpropagation

Backpropagation through time • What are the difficulties of „backpropagation through time”? • The “copies” at differentpositions have the same weights! • The errors must be collected, just as we saw in the case of „weight sharing” • The training involves very long paths along time • Theoretically, the actual output is influenced by all previous inputs • In practice the gradients along the long paths may vanish or blow up • The training of RNNs has instability problems • RNNs usually fail to learn very long-term dependencies • These are the same problems as with the training of very deep networks

Long short-term memory (LSTM) nets • We want to allow the neurons to learn which previous inputs are important and which may be forgotten • This would also solve the problem of learning long-term relations • We introduce an inner state that will act as a memory, and it will be less sensitive to backpropagation • The information will flow through several paths • A „gate” will control when the memory is deleted („forget gate”) • And what to store from the actual input („input gate”) • And what will contribute to the output („output gate”) • This new model is the LSTM neuron, we will replace our previous recurrent neurons with these

RNN vs. LSTM neuron • RNN (with tanh activation): • LSTM:

How the gates operate • The gating weights multiply each component of the input • The weights are kept between 0 and 1 using the sigmoid function • The optimal weights are learned • Example with 0 and 1 gating weights:

LSTM neuron • Forget gate: whichcomponentsshould be discarded from memory C • Input gate: which input componentsshould be stored in C

LSTM neuron • The cell state („memory”) is updated using the forget gate and the input gate: • The new value of the hidden state h is calculated as:

LSTM summary

LSTM network • We can construct a network from LSTM cells just the same way as from standard recurrent neurons • Of course, the training is slower and more complicated • But for most task it gives better results than a standard RNN • Variants of LSTMs: • There are variants with more paths • LSTM with „peephole” connections • Some people try to simplify LSTMs • GRU – gated recurrent unit • This is one of the most active current research topics • https://www.slideshare.net/LarryGuo2/chapter-10-170505-l

Bidirectional recurrentnetworks • We process the input in both directions • There is a hidden layer for both directions (they are independent) • The output layer combines the two hidden layers • Advantage: takes both the earlier and the later context into consideration • Disadvantage: cannot operate in real time

Deep recurrentnetworks • Of course, we can put many recurrent layers on each other • These can even be bidirectional • But it is less usual (or with fewer layers) than with feed-forward layers • Recurrent layers have a larger representational power • And their training is much slower and more complicated

Deep Neural Networks: Maximizing Efficiency Through Advanced Training Techniques

Deep Neural Networks: Maximizing Efficiency Through Advanced Training Techniques

Presentation Transcript

CSC 578 Neural Networks and Deep Learning

Neural Networks

CSC 578 Neural Networks and Deep Learning

Neural Networks and Deep Learning

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Predicting Signal Peptides using Deep Neural Networks

Neural Networks

Neural Networks

Neural networks

Neural Networks

Neural Networks

Parallel Systems to Compute Deep Neural Networks

Deep Learning and Mixed Integer Optimization

Deep Neural Networks for Supervised Speech Separation