200 likes | 338 Views
The Streamlined Glide Algorithm and the LM-Glide Algorithm. Towards Reliable Convergence in the Training of Neural Networks. Vitit Kantabutra Idaho State University Pocatello, Idaho, U.S.A. Neural Networks are Still Useful. Multi-layer perceptrons are still very popular in many fields
E N D
The Streamlined Glide Algorithm and the LM-Glide Algorithm Towards Reliable Convergence in the Training of NeuralNetworks Vitit Kantabutra Idaho State University Pocatello, Idaho, U.S.A.
Neural Networks are Still Useful • Multi-layer perceptrons are still very popular in many fields • Classification • General function approximation
Neural Networks Research has Declined Despite Training Convergence Problem • In 2000, organizers of NIPS conference pointed out…. • “neural networks” in the title was negatively correlated with acceptance • “SVM,” “Bayesian networks,” and “variational methods” positively correlated with acceptance
Non-Convergence Problem • Still very prevalent • Leads to frustrations and even compromised results
Second Order Algorithms • Much of research that’s still done in neural networks is on second order algorithms • But second order algorithms don’t help with large networks • Computational complexity problem • Flat region problem slows down all gradient-based algorithms • Neither first nor higher order conventional algorithms perform well. • Zigzagging is another problem
Attempted Solutions for Flat Region Problem • Changing formula for computing output layer’s delta (error signal) (Solla, Fahlman) • Helps, but doesn’t eliminate problem. We used Fahlman’s formula in some previous experiments to speed up convergence • Another approach is by Wilamowski and Torvik
Our First Attempt: Glide Algorithm • Kantabutra & Zheleva ’02 • Idea: flat regions are ‘safe’ • Why not go fast in flat regions? • Usually works, but sometimes error rises sharply • Key: gd often would have made sudden hairpin turns to safety when our algorithm would glide too far into high-mse territory • Weight trajectory hits sigmoidal wall
Our Second Attempt:Glide Algorithm with Tunneling • Kantabutra, Tsendjav and Zheleva ’03 • Glides more carefully • Checks error before making the move permanent • Adds ‘tunneling’ move • Performs a local line search to find bottom of “half-pipe” • Works • 100% convergence, fast and reliable (low stdev in convergence time) • Is complicated • Could be cleaner
Illustrating importance of tunneling Mean-square error as a function of distance – “half pipe” shaped curve; in areas of turbulence we want to be at bottom of half pipe
A Few Experimental Results From paper of 2003 Didn’t converge CPU time, G.D. odd runs with Problem: Parity-4 with 4 hidden neurons y=running time (sec) until convergence Even runs: starting with previous run’s weights Odd runs: random starting wts X=run number
Two-Spiral Problem (2003) • Very hard problem • Glide algorithm • combined with gradient descent for quicker initial error reduction • number of epochs required for convergence varies widely • average 30453 epochs • Gradient descent • often did not converge
Our Third Attempt: Simplified Glide Algorithm and LM-Glide • Still with tunneling, just removed the word from the name • Simpler but seemingly still effective
Glide Move: details • Take two small gradient descent moves just for calculation purposes (w0 -> w1 -> w2) • Let w0 -> w2 be our direction of weight motion • A far glide is a long distance glide (e.g. 0.2) • A near glide is a short distance glide (e.g. 0.1) • Some tuning is required, but not difficult compared with regular gradient descent • Tuning could still be significant, even though algorithm is less finicky than gradient descent • New: self-tuning version for heart arrhythmia classification (UCI database)
Downscaling or shrinking move • Multiply every weight by a factor like 0.95 • May be needed every few dozen glides to prevent weights growing out of control
Tunneling Move: detail • When mse(w2) > mse(w0) • Use local line searching in the direction of negative gradient from w0 or w1 to find lowest-error point of half-pipe • If mse(w1) <= mse(w0) search from w1, else search from w0. • Favor w1 because we want some weight movement if possible