140 likes | 151 Views
Explore various optimization algorithms such as gradient descent variants and their implementations in backpropagation, along with strategies for weight initialization. Learn about batch gradient descent, stochastic gradient descent, RMSprop, Adam, and more. Discover the importance of choosing the right optimizer and techniques for parallelizing and distributing SGD.
E N D
More on Back Propagation:1) Optimization Algorithms and Weight Initialization2) Toward Biologically Plausible Implementations Psychology 209January 24, 2019
Optimization Algorithms • Gradient descentvariants • Batch gradient descent • Stochastic gradient descent • Mini-batch gradient descent • Challenges • Gradient descent optimization algorithms • Momentum • Nesterov accelerated gradient • Adagrad • Adadelta • RMSprop • Adam • AdaMax • Nadam • AMSGrad • Visualization of algorithms • Which optimizer to choose? • Parallelizing and distributing SGD • Hogwild! • Downpour SGD • Delay-tolerant Algorithms for SGD • TensorFlow • Elastic Averaging SGD • Additional strategies for optimizing SGD • Shuffling and Curriculum Learning • Batch normalization • Early Stopping • Gradient noise
Some of the Algorithms Vector of dL/dw Vector of weights + biases • Batch Gradient Descent • Momentum • AdaGrad • RMSprop • AdaDelta Sum of Squares of gt’,I up to time t gt2 is the vector of squaresof the gi,t Corresponds to RMS[g]t RMS of previous changein the parameters
Adam • Two running averages: • Update depends on momentum normalized by variance:
Neural network activation Functions • All receiving units use a ‘net input’, here called , given by • This is then used as the basis of the unit’s activation using an ‘activation function,’ usually one of the following: • Linear: • Logistic: • Tanh: • Relu:, i.e., 0 if < 0 else
Weight Initialization • To small: Learning is too slow • Too big: you jam units into the non-linear range, loose the gradient signal • Just right? • One approach: consider the fan in to each receiving unit: • For example, ‘He Initialization’ is as follows: Alternatives include: See appendix slide for details if interested.
Recirculation Algorith Assuming symmetric connections: jk
Generalized Recirculation Present input, feed activation forward,compute output, let it feed backand letthe hidden state settle to a fixed point. Then clamp both input and outputunits into desired state, and settle again.* tk hj, yj si *equations neglect the componentto the net input at the hidden layerfrom the input layer.
Random feedback weights can deliveruseful teaching signals Lillicrap TP, Cownden D, Tweed DB, Akerman CJ: Random synaptic feedback weights support error backpropagation for deep learning. Nat Commun 2016, 7:13276 http://dx.doi.org/ 10.1038/ncomms13276.
‘Feedback Alignment’ Equals or Beats BackPropon MNIST in 3 and 4 layer nets
Can we make it work using Hebbian Learning? Link to article in Science Direct
Normalized Initialization • Start with a more uniformdistribution: • This often works ok • But it leads to degenerateactivations at initialization when a tanhfunction is used. • ‘Normalized initialization’ solves the problem (figure) and prodces improvement with tanhnonlinearity: Now we know even better ways, but we’ll consider these later