360 likes | 372 Views
CV 192: Introduction to Deep Learning: Part 2 Oren Freifeld Ron Shapira Weber Computer Science, Ben-Gurion University. Contents. Computational graph Chain rule Backpropagation.
E N D
CV 192: Introduction to Deep Learning: Part 2Oren Freifeld Ron Shapira WeberComputer Science, Ben-Gurion University
Contents • Computational graph • Chain rule • Backpropagation [Figure from previous slide taken from https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html]
Gradient-Based Learning [https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3]
Gradient-Based Learning • In General: • We want to update our parameters (in Deep learning, usually , weights) so that our loss function, , will reach a minimum. • Compute, , the gradient of the loss function. • Update weights () in the direction opposite to the gradient () in pre-defined step size = learning rate (). • Iterate over the data until desired loss is achieved.
The Gradient • When the problem is convex: • Want: (i.e. loss function) • In • In set the gradient (vector of partial derivatives) to zero:
Gradient Descent • The gradient of a function, , points out to the direction which will maximize our function. Therefor, our interest is to “descent” the gradient by moving in the direction opposite to it • When updating our parameters, we need to define the magnitude of the change. This is called the learning rate (. Our update rule is thus: • At iteration • The gradient is calculated by taking the expectation of the entire dataset: • This is also called “batch gradient descent”.
Learning Rate [figures source: http://cs231n.github.io]
Stochastic Gradient Descent • Going over the entire dataset might very expensive (recall ImageNet contains 1.2 million images). • Instead, separate the dataset into small batches and compute the gradient on them. • Mini-batch must be selected randomly and independent. • When |batch-size| = 1, SGD is also called “on-line learning”
SGD - Justification • Unbiased estimate of the gradient: • Standard error of the mean: • For instance: estimating the average gradient over 100 vs. 10,000 examples. • The latter requires 100 more computation than the former, but reduces the std of the mean by only a factor of 10. • Reduces the generalization error: small batches offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. • [Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]
SGD - disadvantages • Training with small batch size might require a small learning rate due to the high variance in the estimate of the gradient. • In addition, might require learning rate decay due to oscillations near the minimum. [figure https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3]
SGD • [Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]
Computing the gradient • Computing the gradient numerically with finite differences: • Approximation • Linear with the number of parameters • Not scalable with modern models of deep neural networks (~millions of parameters) • Feedforward for every to calculate finite difference. • Computing the gradient analytically with Calculus • Direct formula for the gradient • Fast to compute • Need to know the derivatives for every function in our model w.r.t. the parameters
Example: ReLU gradient We compute the activation function element-wise. • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]
How do we compute the gradient for every parameter? • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]
Computational graphs • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]
Computational graphs , • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]
Want: Solution: [example from: http://cs231n.stanford.edu]
Want: Solution: [example from: http://cs231n.stanford.edu]
Patterns in backward flow • Addition node (+): distributes the gradient equally to all of its inputs. • Max node (): “routes” the gradient. • Multiplication node (*): multiples the gradient from the output with the (switched) input. [example from: http://cs231n.github.io/optimization-2/]
How do we compute the gradient for every parameter? • Our simple 2-layer neural network model from previous lecture: • Where • If we choose a quadratic loss: • Want • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]
How do we compute the gradient for every parameter? • Let’s solve the gradient w.r.t. : • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]
Back-Propagation 2 1 4 0.5 4 -5 2 -8 3 4 -7 -2 1
Back-Propagation 2 4 1 8 4 0.5 16 8 4 -5 8 4 2 -8 3 4 -4 4 8 -7 0 -2 -4 -16 1 -4
Use intermediate variables Where: , [Slide from http://cs231n.stanford.edu]
In practice, use vector/matrix operations Let Want need to calculate the Jacobean of [http://cs231n.stanford.edu/vecDerivs.pdf]
In practice, use vector/matrix operations Let’s look at a single example: [http://cs231n.stanford.edu/vecDerivs.pdf]
In practice, use vector/matrix operations • We’ve shown that: • So we can write the Jacobian as: • Which is itself. Therefor: [http://cs231n.stanford.edu/vecDerivs.pdf]
Vanishing / exploding gradient • Optimization becomes tricky when the computational graph becomes extremely deep. • Suppose we need to repeatedly multiply by a weight matrix • Suppose that has an eigendecomposition • Any eigenvalues that are not near an absolute value of 1 will either explode if or vanish if . • [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]
Vanishing / exploding gradient (activation function) ) . . .
Vanishing gradient problem Sigmoid Function: • where
Activation Functions From Stanford cs321
UniversalApproximationTheorem • Given any continuous functionand some , there exists a Neural Network with one hidden layer, such that: . • The theorem is not constructive: it does not provide a way to find the right parameters (or even their number). • When using a simple feedforward neural network, usually 2-3 hidden layers would suffice (in contrast with ConvNets, which some are 120+ deep).
Python Tutorial • http://cs231n.github.io/neural-networks-case-study/
Summary • Gradient-based learning: • Gradient descent algorithm: • Batch • Mini-batch • Stochastic • Computational graph • Chain rule • Vanishing \ exploding gradient • Back-propagation