260 likes | 265 Views
This chapter explores the concepts of modeling the human brain using Neural Networks and Projection Pursuit Regression. Topics include linear features, non-linear functions, ridge functions, training methods, issues in training, and examples.
E N D
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo
Modeling the Human Brain • Input builds up on receptors (dendrites) • Cell has an input threshold • Upon breech of cell’s threshold, activation is fired down the axon.
“Magical” Secrets Revealed • Linear features are derived from inputs • Target concept(s) are non-linear functions of features
Outline • Projection Pursuit Regression • Neural Networks proper • Fitting Neural Networks • Issues in Training • Examples
Projection Pursuit Regression • Generalization of 2-layer regression NN • Universal approximator • Good for prediction • Not good for deriving interpretable models of data
Projection Pursuit Regression Output ridge functions & unit vectors Inputs
PPR: Derived Features • Dot product is projection of onto • Ridge function varies only in the direction
PPR: Training • Minimize squared error • Consider • Given , we derive features and smooth • Given , we minimize over with Newton’s Method • Iterate those two steps to convergence
PPR: Newton’s Method • Use derivatives to iteratively improve estimate Least squares regression to hit the target
PPR: Implementation Details • Suggested smoothing methods • Local regression • Smoothing splines • ‘s can be readjusted with backfitting • ‘s usually not readjusted • “( , ) pairs added in a forward stage-wise manner”
Outline • Projection Pursuit Regression • Neural Networks proper • Fitting Neural Networks • Issues in Training • Examples
NNs: Sigmoid and Softmax • Transforming activation to probability (?) • Sigmoid: • Softmax: • Just like multilogit model
NNs: Training • We need an error function to minimize • Regression: sum squared error • Classification: cross-entropy • Generic approach: Gradient Descent (a.k.a. back propagation) • Error functions are differentiable • Forward pass to evaluate activations, backward pass to update weights
Update rules: Back propagation equations: NNs: Back Propagation
NNs: Back Propagation Details • Those were regression equations; classification equations are similar • Can be batch or online • Online learning rates can be decreased during training, ensuring convergence • Usually want to start weights small • Sometimes unpractically slow
Outline • Projection Pursuit Regression • Neural Networks proper • Fitting Neural Networks • Issues in Training • Examples
Issues in Training: Overfitting • Problem: might reach the global minimum of • Proposed solutions: • Limit training by watching the performance of a test set • Weight decay: penalizing large weights
A Closer Look at Weight Decay Less complicated hypothesis has lower error rate
Outline • Projection Pursuit Regression • Neural Networks proper • Fitting Neural Networks • Issues in Training • Examples
Example #1: Synthetic Data More hidden nodes -> overfitting Multiple initial weight settings should be tried Radial function learned poorly
2 parameters to tune: Weight decay Hidden units Suggested training strategy: Fix either parameter where model is least constrained Cross validate other Example #1: Synthetic Data
Example #2: ZIP Code Data • Yann LeCun • NNs can be structurally tailored to suit the data • Weight sharing: multiple units in a given layer will condition the same weights
Example #2: 5 Networks • Net 1: No hidden layer • Net 2: One hidden layer • Net 3: 2 hidden layers • Local connectivity • Net 4: 2 hidden layers • Local connectivity • 1 layer weight sharing • Net 5: 2 hidden layers • Local connectivity • 2 layer weight sharing
Example #2: Results • Net 5 does best • Small number of features identifiable throughout image
Conclusions • Neural Networks are very general approach to both regression and classification • Effective learning tool when: • Signal / noise is high • Prediction is desired • Formulating a description of a problem’s solution is not desired • Targets are naturally distinguished by direction as opposed to distance