1 / 36

Contents

CV 192: Introduction to Deep Learning: Part 2 Oren Freifeld Ron Shapira Weber Computer Science, Ben-Gurion University. Contents. Computational graph Chain rule Backpropagation.

gtimothy
Download Presentation

Contents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CV 192: Introduction to Deep Learning: Part 2Oren Freifeld Ron Shapira WeberComputer Science, Ben-Gurion University

  2. Contents • Computational graph • Chain rule • Backpropagation [Figure from previous slide taken from https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html]

  3. Gradient-Based Learning [https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3]

  4. Gradient-Based Learning • In General: • We want to update our parameters (in Deep learning, usually , weights) so that our loss function, , will reach a minimum. • Compute, , the gradient of the loss function. • Update weights () in the direction opposite to the gradient () in pre-defined step size = learning rate (). • Iterate over the data until desired loss is achieved.

  5. The Gradient • When the problem is convex: • Want: (i.e. loss function) • In • In set the gradient (vector of partial derivatives) to zero:

  6. Gradient Descent • The gradient of a function, , points out to the direction which will maximize our function. Therefor, our interest is to “descent” the gradient by moving in the direction opposite to it • When updating our parameters, we need to define the magnitude of the change. This is called the learning rate (. Our update rule is thus: • At iteration • The gradient is calculated by taking the expectation of the entire dataset: • This is also called “batch gradient descent”.

  7. Learning Rate [figures source: http://cs231n.github.io]

  8. Stochastic Gradient Descent • Going over the entire dataset might very expensive (recall ImageNet contains 1.2 million images). • Instead, separate the dataset into small batches and compute the gradient on them. • Mini-batch must be selected randomly and independent. • When |batch-size| = 1, SGD is also called “on-line learning”

  9. SGD - Justification • Unbiased estimate of the gradient: • Standard error of the mean: • For instance: estimating the average gradient over 100 vs. 10,000 examples. • The latter requires 100 more computation than the former, but reduces the std of the mean by only a factor of 10. • Reduces the generalization error: small batches offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. • [Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

  10. SGD - disadvantages • Training with small batch size might require a small learning rate due to the high variance in the estimate of the gradient. • In addition, might require learning rate decay due to oscillations near the minimum. [figure https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3]

  11. SGD • [Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

  12. Computing the gradient • Computing the gradient numerically with finite differences: • Approximation • Linear with the number of parameters • Not scalable with modern models of deep neural networks (~millions of parameters) • Feedforward for every to calculate finite difference. • Computing the gradient analytically with Calculus • Direct formula for the gradient • Fast to compute • Need to know the derivatives for every function in our model w.r.t. the parameters

  13. Example: sigmoid function derivative

  14. Example: ReLU gradient We compute the activation function element-wise. • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

  15. How do we compute the gradient for every parameter? • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

  16. Computational graphs • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

  17. Computational graphs , • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

  18. Want: Solution: [example from: http://cs231n.stanford.edu]

  19. Want: Solution: [example from: http://cs231n.stanford.edu]

  20. Patterns in backward flow • Addition node (+): distributes the gradient equally to all of its inputs. • Max node (): “routes” the gradient. • Multiplication node (*): multiples the gradient from the output with the (switched) input. [example from: http://cs231n.github.io/optimization-2/]

  21. How do we compute the gradient for every parameter? • Our simple 2-layer neural network model from previous lecture: • Where • If we choose a quadratic loss: • Want • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

  22. How do we compute the gradient for every parameter? • Let’s solve the gradient w.r.t. : • [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

  23. Back-Propagation 2 1 4 0.5 4 -5 2 -8 3 4 -7 -2 1

  24. Back-Propagation 2 4 1 8 4 0.5 16 8 4 -5 8 4 2 -8 3 4 -4 4 8 -7 0 -2 -4 -16 1 -4

  25. Use intermediate variables Where: , [Slide from http://cs231n.stanford.edu]

  26. [Slide from http://cs231n.stanford.edu]

  27. In practice, use vector/matrix operations Let Want need to calculate the Jacobean of [http://cs231n.stanford.edu/vecDerivs.pdf]

  28. In practice, use vector/matrix operations Let’s look at a single example: [http://cs231n.stanford.edu/vecDerivs.pdf]

  29. In practice, use vector/matrix operations • We’ve shown that: • So we can write the Jacobian as: • Which is itself. Therefor: [http://cs231n.stanford.edu/vecDerivs.pdf]

  30. Vanishing / exploding gradient • Optimization becomes tricky when the computational graph becomes extremely deep. • Suppose we need to repeatedly multiply by a weight matrix • Suppose that has an eigendecomposition • Any eigenvalues that are not near an absolute value of 1 will either explode if or vanish if . • [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). ]

  31. Vanishing / exploding gradient (activation function) ) . . .

  32. Vanishing gradient problem Sigmoid Function: • where

  33. Activation Functions From Stanford cs321

  34. UniversalApproximationTheorem • Given any continuous functionand some , there exists a Neural Network with one hidden layer, such that: . • The theorem is not constructive: it does not provide a way to find the right parameters (or even their number). • When using a simple feedforward neural network, usually 2-3 hidden layers would suffice (in contrast with ConvNets, which some are 120+ deep).

  35. Python Tutorial • http://cs231n.github.io/neural-networks-case-study/

  36. Summary • Gradient-based learning: • Gradient descent algorithm: • Batch • Mini-batch • Stochastic • Computational graph • Chain rule • Vanishing \ exploding gradient • Back-propagation

More Related