Towards Reliable Convergence in the Training of Neural Networks

The Streamlined Glide Algorithm and the LM-Glide Algorithm Towards Reliable Convergence in the Training of NeuralNetworks Vitit Kantabutra Idaho State University Pocatello, Idaho, U.S.A.

Neural Networks are Still Useful • Multi-layer perceptrons are still very popular in many fields • Classification • General function approximation

Neural Networks Research has Declined Despite Training Convergence Problem • In 2000, organizers of NIPS conference pointed out…. • “neural networks” in the title was negatively correlated with acceptance • “SVM,” “Bayesian networks,” and “variational methods” positively correlated with acceptance

Non-Convergence Problem • Still very prevalent • Leads to frustrations and even compromised results

Second Order Algorithms • Much of research that’s still done in neural networks is on second order algorithms • But second order algorithms don’t help with large networks • Computational complexity problem • Flat region problem slows down all gradient-based algorithms • Neither first nor higher order conventional algorithms perform well. • Zigzagging is another problem

Illustrating Flat Regions

Illustrating Zigzagging Despite Momentum (= 0.9)

Attempted Solutions for Flat Region Problem • Changing formula for computing output layer’s delta (error signal) (Solla, Fahlman) • Helps, but doesn’t eliminate problem. We used Fahlman’s formula in some previous experiments to speed up convergence • Another approach is by Wilamowski and Torvik

Our First Attempt: Glide Algorithm • Kantabutra & Zheleva ’02 • Idea: flat regions are ‘safe’ • Why not go fast in flat regions? • Usually works, but sometimes error rises sharply • Key: gd often would have made sudden hairpin turns to safety when our algorithm would glide too far into high-mse territory • Weight trajectory hits sigmoidal wall

Why Our First Attempt Failed:The Hairpin Turn

Our Second Attempt:Glide Algorithm with Tunneling • Kantabutra, Tsendjav and Zheleva ’03 • Glides more carefully • Checks error before making the move permanent • Adds ‘tunneling’ move • Performs a local line search to find bottom of “half-pipe” • Works • 100% convergence, fast and reliable (low stdev in convergence time) • Is complicated • Could be cleaner

Illustrating importance of tunneling Mean-square error as a function of distance – “half pipe” shaped curve; in areas of turbulence we want to be at bottom of half pipe

A Few Experimental Results From paper of 2003 Didn’t converge CPU time, G.D. odd runs with  Problem: Parity-4 with 4 hidden neurons y=running time (sec) until convergence Even runs: starting with previous run’s weights Odd runs: random starting wts X=run number

Two-Spiral Problem (2003) • Very hard problem • Glide algorithm • combined with gradient descent for quicker initial error reduction • number of epochs required for convergence varies widely • average 30453 epochs • Gradient descent • often did not converge

Our Third Attempt: Simplified Glide Algorithm and LM-Glide • Still with tunneling, just removed the word from the name • Simpler but seemingly still effective

Glide Move: details • Take two small gradient descent moves just for calculation purposes (w0 -> w1 -> w2) • Let w0 -> w2 be our direction of weight motion • A far glide is a long distance glide (e.g. 0.2) • A near glide is a short distance glide (e.g. 0.1) • Some tuning is required, but not difficult compared with regular gradient descent • Tuning could still be significant, even though algorithm is less finicky than gradient descent • New: self-tuning version for heart arrhythmia classification (UCI database)

Downscaling or shrinking move • Multiply every weight by a factor like 0.95 • May be needed every few dozen glides to prevent weights growing out of control

Tunneling Move: detail • When mse(w2) > mse(w0) • Use local line searching in the direction of negative gradient from w0 or w1 to find lowest-error point of half-pipe • If mse(w1) <= mse(w0) search from w1, else search from w0. • Favor w1 because we want some weight movement if possible

Streamlined glide algorithm (with tunneling)

Results

Towards Reliable Convergence in the Training of Neural Networks

Towards Reliable Convergence in the Training of Neural Networks

Presentation Transcript

Training Neural Networks

Neural Networks

Supervised Training of Neural Networks

Neural Networks

Neural Networks

Neural Networks

NEURAL NETWORKS

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

CSC2535: Computation in Neural Networks Lecture 1: The history of neural networks

Neural Networks in Social Networks