250 likes | 569 Views
Neural Networks I. CMPUT 466/551 Nilanjan Ray. Outline. Projection Pursuit Regression Neural Network Background Vanilla Neural Networks Back-propagation Examples. Projection Pursuit Regression. Additive model with non-linear g m ’s
E N D
Neural Networks I CMPUT 466/551 Nilanjan Ray
Outline • Projection Pursuit Regression • Neural Network • Background • Vanilla Neural Networks • Back-propagation • Examples
Projection Pursuit Regression • Additive model with non-linear gm’s • Features X is projected to parameters wm, which we have to find from training data • Precursors to neural networks
Fitting a PPR Model • Minimize squared-error loss function: • Proceed in forward stages: M=1, 2,…etc. • At each stage, estimate g, given w (Say, by fitting a spline function) • Estimate w, given g (details provided in the next slide) • The value of M is decided by cross-validation
adjusted responses weights Fitting a PPR Model… • At stage m, given g, compute w(Gauss-Newton search) Residual after stage m-1 So, this is a weighted least-square technique
Vanilla Neural Network Output layer hiddenlayer Input layer
The Sigmoid Function σ(10v) σ(0.5v) • σ(sv) = 1/(1+exp(-sv)) • a smooth (regularized) threshold function • s controls the activation rate • s↑, hard activation s↓, close to identity function
Multilayer Feed Forward NN Examples architectures http://www.teco.uni-karlsruhe.de/~albrecht/neuro/html/node18.html
NN: Universal Approximator • A NN with one hidden units, can approximate arbitrarily well any functional continuous mapping from one finite dimensional space to another, provided number of hidden units is sufficiently large. • Proof is based on Fourier expansion of a function (see Bishop).
NN: Kolomogorov’s Theorem • Any continuous mapping f(x) from d input variables can be expressed by a neural networks with two hidden layers of nodes. The first layer contains d(2d+1) nodes, and the second layer contains (2d+1) nodes. • So, why bother about topology at all? This ‘universal’ architecture is impractical because the functions represented by hidden units will be non-smooth and are unsuitable for learning. (see Bishop for more.)
The XOR Problem and NN Activation functions are hard thresholds at 0
Fitting Neural Networks • Parameters to learn from training data • Cost functions • Sum-of-squared errors for regression • Cross-entropy errors for classification
Back-propagation: Implementation • Step 1: Initialize the parameters (weights) of NN • Iterate • Forward pass: compute fk(X) for the current parameter values starting at the input layer and moving all the up to the output layer. • Backward pass: Start at the output layer; compute i; go down one layer at a time and compute smi all the way down to the input layer • Update weights by gradient descent rule
Issues in Training Neural Networks • Initial values of parameters • Back-propagation finds local minimum • Overfitting • Neural networks have too many parameters • Early stop and regularization • Scaling of the inputs • Inputs are typically scaled to have zero mean and unit standard deviation • Number of hidden units and layers • Better to have too many than too few • With ‘traditional’ back-propagation a long NN gets stuck in local minima and does not learn well
Avoiding Overfitting • Weight decay cost function: • Weight elimination penalty function:
Architectures and Parameters • Net-1: No hidden layer, equivalent to multinomial logistic regression • Net-2: One hidden layer, 12 hidden units fully connected • Net-3: Two hidden layers, locally connected • Net-4: Two hidden layers, locally connected with weight sharing • Net-5: Two hidden layers, locally connected, two levels of weight sharing Weight sharing is also known as convolutional neural networks
More on Architectures and Results Net-1: #Links/Weights- 2570 = 16*16*10+10 Net-2: #Links/Weights- 16*16*12+12+12*10+10=3214 Net-3: #Links/Weights- 8*8*3*3+8*8+4*4*5*5+4*4+10*4*4+10=1226 Net-4: #Links- 2*8*8*3*3 + 2*8*8 + 4*4*5*5*2 + 4*4 + 10*4*4+10 = 2266 #Weights- 2*3*3 + 2*8*8 + 4*4*5*5*2 + 4*4 + 10*4*4+10 = 1132 Net-5: #Links- 2*8*8*3*3 + 2*8*8 + 4*4*4*5*5*2 + 4*4*4 + 4*4*4*10 + 10 = 5194 #Weights- 2*3*3 + 2*8*8 + 4*5*5*2 + 4*4*4 + 4*4*4*10 + 10 = 1060
Some References • C.M. Bishop, Neural Networks for Pattern Recognition, Oxford Univ. Press, 1996. (For good understanding) • S. Haykin, Neural Networks and Learning Machines, Prentice Hall, 2009. (For very basic reading, lots of examples etc.) • Prominent Researchers: • Yann LeCun (http://yann.lecun.com/) • G.E. Hinton (http://www.cs.toronto.edu/~hinton/) • Yosua Bengio http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html