460 likes | 503 Views
Deep neural networks (DNNs). Conventional and deep networks. What is the difference between a conventional and a deep network ? The structural difference is that deep networks have more hidden layers ( 5-10 instead of 1-2, but nowadays even 100-150) Sounds simple – why it took so long ??
E N D
Conventional and deep networks • What is the difference between a conventional and a deep network? • The structural difference is thatdeep networks have more hidden layers(5-10 instead of 1-2, but nowadays even 100-150) • Sounds simple – why it took so long?? • The training of deep networks requiresnew algorithms • The first one: DBN pre-training, 2006 • The advantages of deep learning show uponly for huge amounts of data • These were not available on the 80`s • Training deep networks is slow • This is solved by GPUs now • All three factors – new algorithms, access to a lot of data, invention of GPUs – contributed to the current success of deep learning
Why deep networks are more efficient? • We saw a proof earlier that we can solve any task with a network of 2 hidden layers • However, it is only true if we have infinitely many neurons, infinite amount of training data and a training algorithm that guarantees global optimum • Having a fixed and finite number of neurons, it is more efficient to arrange them into several smaller layers rather than 1-2 „wide” layers • This allows the network to process the data hierarchically • In image recognition tasks the higher layers indeed learn more and more abstract notions • Pixeledge mouth, nose facehuman, …
Training deep neural networks • Training deep networks is more difficult than “shallow” networks • Backpropagation propagates the error from the output to hidden layers • The more layers we go back the larger the chance that the gradient “vanishes” so the deeper layers will not learn • Solution approaches for training deep networks • Pre-training with unlabelled data (DBN pre-training using the contrastive divergence (CD) cost function) • This was the first solution, mathematically involved and slow • Build and train the network adding layer after layer • Much simpler, requires only the backpropagation algorithm • We can use newer types of activation functions • The simplest solution, we will look at this first • Training very deep network requires further tricks (batch normalization, highway networks, etc.)
Modifying the activation function • The sigmoid activation function has been used for 30 years… • The tanh activation function is equivalent with this (sigmoid returnsvalues within [0-1], tanh returns values in [-1,1] ) • Problem: the two ends are very „flat” derivative is practically 0 the gradient may easily vanish • To avoid this, the rectifier activation function was proposed first, and since then many new activation functions have been recommended
The rectifier activation function • In comparison with the tanh activation function: • Formally: • For positive input the derivative is always 1, never vanishes • For negative input the output and the derivative are both 0 • The nonlinearity is very important! (compare it with a linear activation) • It works fine in spite of the 0 derivative for negative input, but there have been improved versions proposed for which the derivative in never 0 • Networks built out of „rectified linear” (ReLU) neurons is currently the de facto standard in deep learning
Even newer activation functions • See also: https://en.wikipedia.org/wiki/Activation_function • Linear, sigmoid, tanh, ReLU, ELU, SELU, Softplus… • Examples: • These sometimes give slightly better results then the ReLU function, but none of them resulted in a general breakthrough
Therestricted Boltzmann machine (RBM) Very similar to a pair of network layers - but works with binary values Training: contrastive divergence (CD) - It is an unsupervised method (there are no class labels) - Seeks to reconstruct the input from the hidden representation - It can be interpreted as an approximation of the Maximum Likelihood cost function - Iterative, similar to backpropagation
Deep Belief Network Deep Belief Network: Restricted Boltzmann machines stacked on each other - training: CDalgorithm, adding layer after layer
Conversion into a deep neural network • DBNs were proposed for the initialization („pre-training”) of deep networks • After training, the DBN can be converted into a deep network • RBMs are turned to conventional sigmoid neurons (with same weights) • A softmax output layer is added on top • So we can continue with supervised backpropagation training • In the early papers DBN pre-training was used tosolve the problem of the difficulty of backpropagation training • DBN pre-training resulted in a good starting point for backpropagation • The current view is that DBN pre-training is no longer necessary • We usually train on much more training data • The new activation functions also help a lot • We have new weight initialization methods and other training tricks(e.g. batch normalization)
Cases when DBNs are still useful • Training DBNs using the CD criterion is unsupervised training • DBN pre-training may still be useful when we have a lot of unlabelled data, and only few labelled • The connection between two RBM layers is symmetric • We can easily reconstruct the input from a hidden representation • We can easily visualize what hidden representation has the network learned
Convolutional neural networks • Classic architecture: „fully connected” net • Between two layers all neurons are connectedto all neurons • The order of the inputs plays no role • If we permute the inputs randomly (but usingthe same permutation for each vector!), thenthe network will attain a very similar accuracy • For many classification task the order of the inputs indeed plays no role • Eg. our very first example: (fever, joint_pain, cough)influenza • Obviously, just as learnable as (joint_pain, cough, fever)influenza • But there are tasks where the order (topology) of features is important • The best example is image recognition • In this case it is worth using special a network structure convolutional network
Motivation #1 • There are tasks where the order (relation, topology) of the features is important • Image recognition: we won’t see the same picture if we mix the pixels • But a fully connected network will achieve the same performance in both cases • The arrangement of the pixels contains vital information for our brain, but a fully connected network is not able to exploit it • The pixels form the image together • The semantic connection of nearby pixelsis typically stronger (they from objects together)
Motivation #2 • A picture typically has a hierarchical structure • From simpler, local building blocks to larger, more complex notions • The early shape recognition experiments using ANNs considered such complex tasks to be hopeless • They tried only simpler tasks like character recognition • Now we can recognize complex “real” images • Convolutional networks greatly contributed to this success
Motivation #3 • Certain research results show that the human brain also processes the images hierarchically • Neurons in the visual cortex fire when seeing certain simple graphical primitives like edges having a certain direction • Consider the „Thatcher illusion” – for the upside-down image our brain concludes that the mouth, nose an eyesseem to be correct and in place (they form a face together), but does not realize the error in the fine details
Convolutional neural networks • Based on the above motivations, the convolutional neural network • Processes the input hierarchically • As a consequence, it will necessarily be deep • The input of each neuron is local, so they focus on a small block of the image • But going upwards we will cover larger and larger areas • Neurons in the higher layerscover larger and larger parts of the image, but with a decreasing resolution (fine details count less) • The network will be less sensitive to the exact location of objects (this is where the convolution operation will help) • The main application area of convolutional networks is image recognition, but they may be applied in any other areas where the input has a hierarchical structure (for example they are also used in speech recognition)
Building blocks of convolutional networks • The typical architecture of a convolutional neural network • https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ • Convolutional neurons (also known as filters) • Pooling operation • These two are repeated several times • On top, the network usually has some fully connected layers and a softmax output layer • The networkcan be trained with the usual backpropagation algorithm (adjusted to the convolution and pooling steps)
The convolution operation • The convolutional neurons are very similar to standard neurons. The main differences are: • Locality: They process only small parts of the input image • Convolution: the same neuron is evaluated at many different locations of the image (as these evaluations use the same weights, this property is also known as “weight sharing”) • The operation performed by the neurons is very similar to the filters known from image processing, this is why these neurons are sometimes called filters • But in this case we also apply a nonlinear activation function on the filter output • And the parameters (weights) will be tuned automatically, not specified manually
The convolution operation • An example of what a convolutional neuron does • Input image, filter (weight matrix of the neuron), and the result: • The evaluation can be performed at each position, so the result is a matrix with the same size as the input • But the processed range may also be restricted • Stride: the step size of the filter. E.g. stride 2 means that we skip every second position • Zero padding: To fit the filter on the edge pixels might require the “padding” of the image (with zeros) – it is a matter of design decision
The convolution operation • We expect the convolutional filters to learn abstract notions (eg. to be able to detect an edge or a nose) • For this we usually need more neurons, so we train a set of neurons for the same task in parallel.In this example we use 3 neurons to process a position,resulting in 3 output matrices. We call this the “depth” of the output • We train further layers on the output of the given layer, so we interpret the layer as a feature extractor for the subsequent layer (hopefully more and more abstract features as we go up…) • This is why it is called „feature map” in the image
The pooling operation • As we go up we don’t want to preserve the fine details • E.g.if the „nose detector” found a nose, then we don’t want to keep its precise position and shape in the higher layers • This is what the pooling step does: • In a local neighborhood it pools the output values • For example, taking their maximum • Advantages: • Shift invariance within the pooling region (no matter where we found the nose, the maximum will be the same) • We gradually decrease the output size (downsampling) – reducing the number of parameters is always useful • Going up, we cover larger and larger areas with filters of the same size – hierarchic processing
Hierarchic processing • We stack a lot of convolution+pooling steps on each other • We expect the higher layers to extract more and more abstract features • This is more or lesstrue (example froma face recognitionnetwork): • The final classification is performed by fully connected layers
Summary • The advantages and drawbacks of convolutional networks • Advantages of convolution: local processing of input blocks using the same weights – fewer parameters, shift-invariance • Drawback: slower computation than with fully connected nets • Advantages of pooling: parameter reduction, shift-invariance, hierarchic processing from local to larger context • Drawback: fine details are lost, they would be useful in certain cases • Advantages of hierachical processing: complex pictures have a hierarchic structure, makes sense to process them hierarchically • Drawback: the network will be inevitably deep, training deep networks is problematic
Examples • Complex image recognition task: more objects on the same image • ImageNet database: 1,2 million high-resolution images, 1000 target labels • The network can have 5 guesses, it is considered correct if the correct answer is among them • Before convolutional networks the smallest error was 26%
More examples • First convolutional network (2012): 16% errornow: <5%
Modelling time series with recurrent neural networks • So war we assumed that the examplesthat follow each other are independent – This is true for most recognition tasks • E.g. image recognition • But there might be tasks where the order of examples carries vital information • This typically occurs when we model time series • E.g.: speech recognition, language processing,handwriting recognition, video analysis, stock exchange rates,… • The questions is how to modify the network so that it would take the neighbors of an input vector into consideration • Feed-forward network on several vectors: Time-delay neural network • Recurrent neural network • Recurrent neural network with memory: Long short-term memory network
Using neighboring input vectors • The networkthat processes one input vector looks like this(the line corresponds to full connection): • We can easily modify this to process more than one input vectors • No modification is required in the network structure • Drawback 1: the input size greatlyincreases • Drawback 2: the input context islarger, but still finite
Time-Delay Neural Network • Processes several neighboring input vectors • But (at the lower level) we perform the same processing on each vector • The results are combined only ata higher level • The 5 blocks of the hidden layeruses the same weights”Weight sharing” • Advantage: we can increase the input size without increasing thenumber of processing neurons • If the size of the hidden layer is relatively small, than the input size of the output layer increases only slowly • Of course, both the lower and upper processing parts may consist of several layers
Time-Delay Neural Network 2 • The TDNN is a feedforward network • Backpropagation can be used as before • Backpropagation through the „weight sharing” is the only complication • Evaluation(„forward pass”): • Similar to a fully connected net, but the same weight are used at many positions • Training („backward pass”): • The error values obtained at the different pathsbelong to the same weights, so they must be summed before update • The TDNN is close relative of convolutionalneural networks
Recurrent neural networks (RNN) • We introduce real recurrent connections • So the input consists of not only the actual input, but also of the previous output • We usually feed back the hidden layer rather than the output layer • The network now sees its previous hidden states, so is is like “adding memory” to the network
Backpropagation through time • How can we evaluate an RNN? • From left to right, from vector to vector • We cannot skip vectors • In the first step ht-1 must be initialized somehow • How can we train an RNN? • Theoratically, in each step we need the previous ht-1 infinite recursion • In practice, however, any training data set is finite • We may also cut it to chunks artificially… • This way, the network can be „unfolded” in time • And can be trained usingbackpropagation
Backpropagation through time • What are the difficulties of „backpropagation through time”? • The “copies” at differentpositions have the same weights! • The errors must be collected, just as we saw in the case of „weight sharing” • The training involves very long paths along time • Theoretically, the actual output is influenced by all previous inputs • In practice the gradients along the long paths may vanish or blow up • The training of RNNs has instability problems • RNNs usually fail to learn very long-term dependencies • These are the same problems as with the training of very deep networks
Long short-term memory (LSTM) nets • We want to allow the neurons to learn which previous inputs are important and which may be forgotten • This would also solve the problem of learning long-term relations • We introduce an inner state that will act as a memory, and it will be less sensitive to backpropagation • The information will flow through several paths • A „gate” will control when the memory is deleted („forget gate”) • And what to store from the actual input („input gate”) • And what will contribute to the output („output gate”) • This new model is the LSTM neuron, we will replace our previous recurrent neurons with these
RNN vs. LSTM neuron • RNN (with tanh activation): • LSTM:
How the gates operate • The gating weights multiply each component of the input • The weights are kept between 0 and 1 using the sigmoid function • The optimal weights are learned • Example with 0 and 1 gating weights:
LSTM neuron • Forget gate: whichcomponentsshould be discarded from memory C • Input gate: which input componentsshould be stored in C
LSTM neuron • The cell state („memory”) is updated using the forget gate and the input gate: • The new value of the hidden state h is calculated as:
LSTM network • We can construct a network from LSTM cells just the same way as from standard recurrent neurons • Of course, the training is slower and more complicated • But for most task it gives better results than a standard RNN • Variants of LSTMs: • There are variants with more paths • LSTM with „peephole” connections • Some people try to simplify LSTMs • GRU – gated recurrent unit • This is one of the most active current research topics • https://www.slideshare.net/LarryGuo2/chapter-10-170505-l
Bidirectional recurrentnetworks • We process the input in both directions • There is a hidden layer for both directions (they are independent) • The output layer combines the two hidden layers • Advantage: takes both the earlier and the later context into consideration • Disadvantage: cannot operate in real time
Deep recurrentnetworks • Of course, we can put many recurrent layers on each other • These can even be bidirectional • But it is less usual (or with fewer layers) than with feed-forward layers • Recurrent layers have a larger representational power • And their training is much slower and more complicated