Gradient-Based Learning • In General: • We want to update our parameters (in Deep learning, usually , weights) so that our loss function, , will reach a minimum. • Compute, , the gradient of the loss function. • Update weights () in the direction opposite to the gradient () in pre-defined step size = learning rate (). • Iterate over the data until desired loss is achieved.

The Gradient • When the problem is convex: • Want: (i.e. loss function) • In • In set the gradient (vector of partial derivatives) to zero:

Gradient Descent • The gradient of a function, , points out to the direction which will maximize our function. Therefor, our interest is to “descent” the gradient by moving in the direction opposite to it • When updating our parameters, we need to define the magnitude of the change. This is called the learning rate (. Our update rule is thus: • At iteration • The gradient is calculated by taking the expectation of the entire dataset: • This is also called “batch gradient descent”.

Learning Rate [figures source: http://cs231n.github.io]

Stochastic Gradient Descent • Going over the entire dataset might very expensive (recall ImageNet contains 1.2 million images). • Instead, separate the dataset into small batches and compute the gradient on them. • Mini-batch must be selected randomly and independent. • When |batch-size| = 1, SGD is also called “on-line learning”

SGD - Justification • Unbiased estimate of the gradient: • Standard error of the mean: • For instance: estimating the average gradient over 100 vs. 10,000 examples. • The latter requires 100 more computation than the former, but reduces the std of the mean by only a factor of 10. • Reduces the generalization error: small batches offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. • [Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

SGD - disadvantages • Training with small batch size might require a small learning rate due to the high variance in the estimate of the gradient. • In addition, might require learning rate decay due to oscillations near the minimum. [figure https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3]

SGD • [Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

Computing the gradient • Computing the gradient numerically with finite differences: • Approximation • Linear with the number of parameters • Not scalable with modern models of deep neural networks (~millions of parameters) • Feedforward for every to calculate finite difference. • Computing the gradient analytically with Calculus • Direct formula for the gradient • Fast to compute • Need to know the derivatives for every function in our model w.r.t. the parameters

Example: sigmoid function derivative

Example: ReLU gradient We compute the activation function element-wise. • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

How do we compute the gradient for every parameter? • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

Computational graphs • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

Computational graphs , • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

Want: Solution: [example from: http://cs231n.stanford.edu]

Patterns in backward flow • Addition node (+): distributes the gradient equally to all of its inputs. • Max node (): “routes” the gradient. • Multiplication node (*): multiples the gradient from the output with the (switched) input. [example from: http://cs231n.github.io/optimization-2/]

How do we compute the gradient for every parameter? • Our simple 2-layer neural network model from previous lecture: • Where • If we choose a quadratic loss: • Want • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

How do we compute the gradient for every parameter? • Let’s solve the gradient w.r.t. : • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

Back-Propagation 2 1 4 0.5 4 -5 2 -8 3 4 -7 -2 1

Back-Propagation 2 4 1 8 4 0.5 16 8 4 -5 8 4 2 -8 3 4 -4 4 8 -7 0 -2 -4 -16 1 -4

Use intermediate variables Where: , [Slide from http://cs231n.stanford.edu]

[Slide from http://cs231n.stanford.edu]

In practice, use vector/matrix operations Let Want need to calculate the Jacobean of [http://cs231n.stanford.edu/vecDerivs.pdf]

In practice, use vector/matrix operations Let’s look at a single example: [http://cs231n.stanford.edu/vecDerivs.pdf]

In practice, use vector/matrix operations • We’ve shown that: • So we can write the Jacobian as: • Which is itself. Therefor: [http://cs231n.stanford.edu/vecDerivs.pdf]

Vanishing / exploding gradient • Optimization becomes tricky when the computational graph becomes extremely deep. • Suppose we need to repeatedly multiply by a weight matrix • Suppose that has an eigendecomposition • Any eigenvalues that are not near an absolute value of 1 will either explode if or vanish if . • [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

Vanishing / exploding gradient (activation function) ) . . .

Vanishing gradient problem Sigmoid Function: • where

Activation Functions From Stanford cs321

UniversalApproximationTheorem • Given any continuous functionand some , there exists a Neural Network with one hidden layer, such that: . • The theorem is not constructive: it does not provide a way to find the right parameters (or even their number). • When using a simple feedforward neural network, usually 2-3 hidden layers would suffice (in contrast with ConvNets, which some are 120+ deep).

Python Tutorial • http://cs231n.github.io/neural-networks-case-study/

Summary • Gradient-based learning: • Gradient descent algorithm: • Batch • Mini-batch • Stochastic • Computational graph • Chain rule • Vanishing \ exploding gradient • Back-propagation

Contents

Contents

Presentation Transcript

Contents

Contents

Contents

Contents

Contents

Contents

Contents

CONTENTS

Contents

Contents