310 likes | 333 Views
Dive deep into regularization methods for neural networks outlined in Goodfellow, Bengio, Courville's "Deep Learning". Learn about parameter norm penalties, L1 and L2 regularization, dataset augmentation, early stopping, dropout, parameter sharing, and more to optimize your models and prevent overfitting. Understand how regularization impacts network flexibility, variance, and bias. Explore concepts like autoencoders, sparse representation, dropout techniques, and the benefits of regularization. Enhance your neural network skills with practical insights and strategies.
E N D
Neural networks (3) Regularization Autoencoder Recurrent neural network Goodfellow, Bengio, Courville, “Deep Learning” Charte, “A practical tutorial on autoencoders for nonlinear feature fusion” Le, “A tutorial on Deep Learning”
More on regularization Neural network models are extremely versatile – potential for overfitting Regularization aims at reducing the EPE, not the training error. Too flexible High variance Low bias Too rigid Low variance High bias
More on regularization Parameter Norm Penalties The regularized loss function: L2 Regularization: The gradient becomes: The update step becomes:
More on regularization L1 Regularization The loss function: The gradient: Dataset Augmentation Generate new (x, y) pairs by transforming the x inputs in the training set. Ex: In object recognition, translating the training images a few pixels in each direction rotating the image or scaling the image
More on regularization Early Stopping treat the number of training steps as another hyperparameter requires a validation set can be combined with other regularization strategies
More on regularization Parameter Sharing CNN is an example Grows large network without dramatically increasing the number of unique model parameters without requiring a corresponding increase in training data. Sparse Representation L1 penalization induces a sparse parametrization Representational sparsity means a representation where many of the elements of the representation are zero Achieved by: L1 penalty on the elements of the representation hard constraint on the activation values
More on regularization sparse parametrization Representational sparsity h is a function of x that represents the information present in , but does so with a sparse vector.
More on regularization Dropout Bagging (bootstrap aggregating) reduces generalization error by combining several models - train several different models separately, then have all of the models vote on the output for test examples
More on regularization Dropout Neural networks have many solution points. random initialization random selection of minibatches differences in hyperparameters different outcomes of non-deterministic implementations different members of the ensemble make partially independent errors However, training multiple models is impractical when each model is a large neural network.
More on regularization Dropout provides an inexpensive approximation to training and evaluating a bagged ensemble of exponentially many neural networks. Use a minibatch-based learning algorithm Each time, randomly sample a different binary mask vector μto apply to all of the input and hidden units in the network. The probability of sampling a mask value of one (causing a unit to be included) is a hyperparameter Run forward propagation, back-propagation, and the learning update as usual.
More on regularization Dropout training minimizes EμJ(θ, μ). The expectation contains exponentially many terms: 2number of non-output nodes The unbiased estimate is obtained by randomly sampling values of μ. Most of the exponentially large number of models are not explicitly trained. A tiny fraction of the possible sub-networks are each trained for a single step, and the parameter sharing causes the remaining sub-networks to arrive at good settings of the parameters. “weight scaling inference rule”: approximate theensembleoutputpensemble by evaluating p(y | x) in one model: the model with all units, but with the weights going out of unitimultiplied by the probability of including unit i.
Autoencoder Seeks data structure within X Serves as a good starting point to fit CNN Map data {x(1), x(2),…, x(m)} to {z(1), z(2),…, z(m)} of lower dimension. Recover X from Z. In the linear case: Example: data compression
Autoencoder This is a linear autoencoder. The objective function: Can be minimized by stochastic gradient descent.
Autoencoder Nonlinear autoencoder– finds nonlinear structures in the data. Multiple layers can extract highly nonlinear structure.
Autoencoder Undercomplete, if the encoding layer has a lower dimensionality than the input. Overcomplete, if the encoding layer has the same or more units than the input (restrictions needed to avoid copying the data)
Autoencoder Purposes: nonlinear dimensionality reduction feature learning de-noising generative modeling Sparse autoencoder: A sparsity penalty on the code layer often to learn features for another task such as classification
Autoencoder We can think of feedforward networks trained by supervised learning as performing a kind of representation learning. - the last layer is a (generalized) linear classifier - previous layers learn data presentation for the classifier - hidden layers take on nonlinear properties that make the classification task easier. Unsupervised learning tries to capture the shape of the input distribution. – Can sometimes be useful for another task supervised learning with the same input domain Greedy layer-wise unsupervised pre-training can be used.
Autoencoder Autoencoder for the pre-training of a classification task:
Autoencoder Denoisingautoencoder (DAE): receives a corrupted data point as input, and predict the original, uncorrupted data point as its output. Minimize
Recurrent neural networks(RNNs) recurrent neural networks: operating on a sequence that contains vectors x(t) with the time step index t ranging from 1 to τ. Challenges: A feedforward network that processes sentences of fixed length won’t work on extracting the time, as weights are associated to location. “I went to Nepal in 2009” “In 2009, I went to Nepal.” τ can be different for different inputs. Goal: Share statistical strength across different sequence lengths and across different positions.
Recurrent neural networks Ex: statistical language modeling - predict the next word given previous words Predicting stock price based on the existing data:
Computational Graph a dynamical system driven by an external signal x(t),
Recurrent neural networks Three sets of parameters: input to hidden weights (W), hidden to hidden weights (U) hidden to label weight (V) Minimize a cost function, e.g. (y − f (x))2 to obtain the appropriate weights. Can use back propagation again – ”Back propagation through time (BPTT)”
Recurrent neural networks Assume the output is discrete. For each time step from t = 1 to t = τ, The input sequence and output sequence are of same length The total loss:
Recurrent neural networks gradient computation involves A forward propagation pass moving left to right through the unrolled graph A backward propagation pass moving right to left through the graph (back-propagation through time, or BPTT)
Deep learning • Major areas of application • Speech Recognition and Signal Processing • Object Recognition • Natural Language Processing • …… • The technique is not applicable to all areas. In some challenging areas where data is limited: • Training data size (subjects) is still too small compared to the number of variables • Neural network could be applied when human selection of variables is done first. • Existing knowledge, in the form of existing networks, are already explicitly used, instead of being learned from data. They are hard to beat with a limited amount of data. IEEE Trans Pattern Anal Mach Intell. 2013 Aug;35(8):1798-828