Optimization Algorithms and Weight Initialization

More on Back Propagation:1) Optimization Algorithms and Weight Initialization2) Toward Biologically Plausible Implementations Psychology 209January 24, 2019

Optimization Algorithms • Gradient descentvariants • Batch gradient descent • Stochastic gradient descent • Mini-batch gradient descent • Challenges • Gradient descent optimization algorithms • Momentum • Nesterov accelerated gradient • Adagrad • Adadelta • RMSprop • Adam • AdaMax • Nadam • AMSGrad • Visualization of algorithms • Which optimizer to choose? • Parallelizing and distributing SGD • Hogwild! • Downpour SGD • Delay-tolerant Algorithms for SGD • TensorFlow • Elastic Averaging SGD • Additional strategies for optimizing SGD • Shuffling and Curriculum Learning • Batch normalization • Early Stopping • Gradient noise

Some of the Algorithms Vector of dL/dw Vector of weights + biases • Batch Gradient Descent • Momentum • AdaGrad • RMSprop • AdaDelta Sum of Squares of gt’,I up to time t gt2 is the vector of squaresof the gi,t Corresponds to RMS[g]t RMS of previous changein the parameters

Adam • Two running averages: • Update depends on momentum normalized by variance:

Neural network activation Functions • All receiving units use a ‘net input’, here called , given by • This is then used as the basis of the unit’s activation using an ‘activation function,’ usually one of the following: • Linear: • Logistic: • Tanh: • Relu:, i.e., 0 if < 0 else

The algorithms in action

Weight Initialization • To small: Learning is too slow • Too big: you jam units into the non-linear range, loose the gradient signal • Just right? • One approach: consider the fan in to each receiving unit: • For example, ‘He Initialization’ is as follows: Alternatives include: See appendix slide for details if interested.

Recirculation Algorith Assuming symmetric connections: jk

Generalized Recirculation Present input, feed activation forward,compute output, let it feed backand letthe hidden state settle to a fixed point. Then clamp both input and outputunits into desired state, and settle again.* tk hj, yj si *equations neglect the componentto the net input at the hidden layerfrom the input layer.

Random feedback weights can deliveruseful teaching signals Lillicrap TP, Cownden D, Tweed DB, Akerman CJ: Random synaptic feedback weights support error backpropagation for deep learning. Nat Commun 2016, 7:13276 http://dx.doi.org/ 10.1038/ncomms13276.

‘Feedback Alignment’ Equals or Beats BackPropon MNIST in 3 and 4 layer nets

How it works

Can we make it work using Hebbian Learning? Link to article in Science Direct

Normalized Initialization • Start with a more uniformdistribution: • This often works ok • But it leads to degenerateactivations at initialization when a tanhfunction is used. • ‘Normalized initialization’ solves the problem (figure) and prodces improvement with tanhnonlinearity: Now we know even better ways, but we’ll consider these later

Optimization Algorithms and Weight Initialization

Optimization Algorithms and Weight Initialization

Presentation Transcript

January 2019

24 January 2019, 10.00-11.30 CET

January 24-26, 2019

January 2019

Glasgow City Chambers, 24 January 2019

24 January

January 2019

January 24