160 likes | 288 Views
Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training. Vitit Kantabutra & Batsukh Tsendjav Computer Science Program College of Engineering Idaho State University Pocatello, ID 83209. Elena Zheleva Dept. of CS/EE The University of Vermont
E N D
Glide Algorithm with Tunneling: A Fast, Reliably Convergent Algorithm for Neural Network Training Vitit Kantabutra & Batsukh Tsendjav Computer Science Program College of Engineering Idaho State University Pocatello, ID 83209 Elena Zheleva Dept. of CS/EE The University of Vermont Burlington, VT 05405
New Algorithm for Neural Network Training • Convergence of training algorithms is one of the most important issues in NN field today • We solve the problem for some well-known difficult-to-train networks • Parity 4 – 100% fast conv. • 2-Spiral – same • Character recog. – same
Our Glide Algorithms • Our first “Glide Algorithm” was a simple modification of gradient descent. • When the gradient is small, go a constant distance instead of a distance equal to a constant times the gradient • The idea was that flat regions are seemingly “safe,” enabling us to go a relatively long distance (“glide”) without missing the solution • Originally we even thought of going a longer distance when the gradient is smaller! • We simply didn’t believe in the conventional wisdom of going a longer distance on steep slopes.
Hairpin Observation – problem with original Glide Algorithm • Our original Glide Algorithm did converge significantly faster than plain gradient descent • But ours didn’t even converge as reliably as plain gradient descent! • What seemed to be wrong? • We weren’t right about flat regions being always safe!! • We experimented by running plain gradient descent and observe its flat region behavior • Flat regions are indeed often safe • But sometimes gradient descent makes a sharp “hairpin” turn!! • This sometimes derailed our first Glide Algorithm
Second Glide Algorithm: “Glide Algorithm with Tunneling” • In flat regions, we still try to go far • But we check error at tentative destination • Don’t go so far if error increase much • Can afford the time easily • But even if error increases a little, go anyway to “stir things up” • Also has mechanism for battling zigzagging • Direction of motion is average of 2 or 4 gradient descent moves • Seems better than momentum • Also has “tunneling” • Means linear search very locally, but fancier
Reducing the zigzagging problem • Direction of next move usually determined by averaging 2 or 4 (or 6, 8, etc) gradient descent moves Gradient Descentzigzagging despite momentum!!
Importance of Tunneling • Serves to set the weight at the “bottom of the gutter” error distance
A Few Experimental Results Didn’t converge CPU time, G.D. odd runs with Problem: Parity-4 with 4 hidden neurons y=running time (sec) until convergence Even runs: starting with previous run’s weights Odd runs: random starting wts X=run number
Two-Spiral Problem • Very hard problem • Glide algorithm • combined with gradient descent for quicker initial error reduction • number of epochs required for convergence varies widely • average 30453 epochs • Gradient descent • often did not converge
Tuning Insensitivity of Glide-Tunnel Algorithm!! Random params: odd runs Random params: even runs
Glide algorithm tested on character recognition problem • The network was built to recognize digits 0 through 9 • The algorithm was implemented in C++ • Glide Algorithm was shown to outperform regular gradient descent method by the test runs.
Small Neural Network • The network was 48-24-10 • Bipolar inputs • Trained on 200 training patterns • 20 samples for each digit • Trained and tested on printed characters • After the training, the recognition rate for test patterns was 70% on average. • Not enough training patterns
Network Structure • 6X8 pixel resolution • 48 bipolar inputs(1/-1) • Hidden Layer • 24 neurons • tanh(x) for activation • Output Layer • 10 neurons • tanh(x) activation function
Experimental results • 60 official runs of Glide Algorithm • All but 4 runs converged under 5000 epochs. • Average run time was 47 sec. • Parameters used • Eta = 0.005 (learning rate) • Lambda = 1 (steepness parameter )
Experimental results • 20 runs of Regular Gradient Descent Algorithm • All the runs after 20,000 epochs did not converge. • Average run time was 3.7 min. • Higher order methods exist • Not Stable • Not very efficient when the error surface is flat
Conclusion • New Glide Algorithm has been shown to perform really well for flat regions • With tunneling, the algorithm is very stable converging on all the test runs for different test problems • Converge more reliably than Gradient Descent and, presumably, than second-order methods • Some individual steps are computationally expensive but worth the CPU time because overall performance is far superior to regular gradient descent