450 likes | 815 Views
CS 9633 Machine Learning. Neural Networks. Adapted from notes by Tom Mitchell http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html. Neural networks. Practical method for learning Real-valued functions Discrete-valued functions Vector-valued functions Robust in the presence of noise
E N D
CS 9633Machine Learning Neural Networks Adapted from notes by Tom Mitchell http://www-2.cs.cmu.edu/~tom/mlbook-chapter-slides.html Computer Science Department CS 9633 Machine Learning
Neural networks • Practical method for learning • Real-valued functions • Discrete-valued functions • Vector-valued functions • Robust in the presence of noise • Loosely based on biological model of learning Computer Science Department CS 9633 Machine Learning
Back propagation Neural Networks • Assume fixed structure of network • Directed graph (usually acyclic) • Learning consists of choosing weights for edges Computer Science Department CS 9633 Machine Learning
Characteristics of Back Propagation Problems • Instances represented by many attribute value pairs • Target functions • Discrete-valued • Real-valued • Vector-valued • Instances may contain errors • Long training times are acceptable • Fast evaluation of function may be required • Not important that people understand the learned function Computer Science Department CS 9633 Machine Learning
Perceptrons • Basic unit of many neural networks • Basic operation • Input: vector of real-values • Calculates a linear combination of inputs • Output • 1 if result is greater than some threshold • -1 otherwise Computer Science Department CS 9633 Machine Learning
A perceptron X0=1 x1 w0 w1 w2 x2 S x3 w3 Threshold Processor Summation Processor .. wn xn Computer Science Department CS 9633 Machine Learning
Notation Perceptron function Vector form of perceptron function Computer Science Department CS 9633 Machine Learning
Learning a perceptron • Learning consists of choosing values for n weights • Space H of candidate hypotheses Computer Science Department CS 9633 Machine Learning
Representational Power of Perceptrons • A perceptron represents a hyperplane decision surface in n-dimensional space of instances. • Output of a 1 for instances on one side of the plane and -1 for the other side of the plane • Equation for decision hyperplane • Sets of instances that can be separated by a hyperplane are said to be linearly separable Computer Science Department CS 9633 Machine Learning
Linearly Separable Pattern Classification Computer Science Department CS 9633 Machine Learning
Non-Linearly Separable Pattern Classification Computer Science Department CS 9633 Machine Learning
The Kiss of Death • 1969: Marvin Minsky and Seymour Papert proved that the perceptron had computational limits. Statement: “The perceptron has many features which attract attention: its linearity, its intriguing learning theorem...there is no reason to believe that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgment that the extension is sterile” Computer Science Department CS 9633 Machine Learning
Boolean functions • Perceptron can be used to represent the following Boolean functions • AND • OR • Any m-of-n function • NOT • NAND (NOT AND) • NOR (NOT OR) • Every Boolean function can be represented by a network of interconnected units based on these primitives • Two levels is enough Computer Science Department CS 9633 Machine Learning
Revival • 1982: John Hopfield responsible for revival • 1987: First IEEE conference on neural networks. Over 2000 attended. • And the rest is history! Computer Science Department CS 9633 Machine Learning
Perceptron Training • Initialize weight vector with random weights • Apply the perceptron to the training example • Modify the perceptron weights whenever an example is misclassified using perceptron training rule. • Repeat Computer Science Department CS 9633 Machine Learning
Characteristics of Perceptron Training Rule • Guaranteed to converge within a finite number of applications of the rule to a weight vector that correctly classifies all training examples if: • Training examples are linearly separable • The learning rate is acceptably small Computer Science Department CS 9633 Machine Learning
Gradient Descent and the Delta Rule • Designed to converge toward the best-fit approximation of the target concept if the instances are not linearly separable. • Searches the hypothesis space for possible weight vectors to find the weights that best fit the training data • Serves as a basis for backpropagation neural networks Computer Science Department CS 9633 Machine Learning
Training task • Task of training an linear unit without a threshold • Training error (minimization task) • E is a function of w Computer Science Department CS 9633 Machine Learning
Hypothesis Space Computer Science Department CS 9633 Machine Learning
Derivation of the Gradient Descent Learning Rule • Derivation is on page 91-92 of text • The derivative of the error gives the direction of steepest ascent. The negative is the direction of steepest descent. • The derivative gives a very nice, intuitive learning rule. Computer Science Department CS 9633 Machine Learning
Gradient-Descent (training_examples, ) Initialize each wito some small random value Until the termination condition is met Do Initialize each wi to zero For each <x,t> in training_examples Do Input the instance x to the unit and compute the output o For each linear unit weight wi Do wi = wi + (t - o) xi For each linear unit weight wi Do wi wi + wi
Gradient Ascent • Useful for very large of infinite hypothesis space • Can be applied if • Hypothesis space contains continuously parameterized hypothesis (e.g. weights) • The error can be differentiated with respect to the hypothesis parameters Computer Science Department CS 9633 Machine Learning
Practical Difficulties with Gradient Descent • Converging to a local minimum can sometimes be quite slow • If there are multiple local minima, there is no guarantee the procedure will find the global minimum Computer Science Department CS 9633 Machine Learning
Stochastic Gradient Descent • Also called incremental gradient descent • Tries to address practical problems with gradient descent • In gradient descent, the error is computed for all of the training examples and the weights are updated after all training examples have been presented • Stochastic gradient descent updates the weights incrementally based on the error with each example Computer Science Department CS 9633 Machine Learning
Stochastic-Gradient-Descent (training_examples, ) Initialize each wito some small random value Until the termination condition is met Do Initialize each wi to zero For each <x,t> in training_examples Do Input the instance x to the unit and compute the output o For each linear unit weight wi Do wi = wi + (t - o) xi
Standard versus Stochastic Gradient Descent Computer Science Department CS 9633 Machine Learning
Comparison of Learning Rules Computer Science Department CS 9633 Machine Learning
Multilayer Networks and Backpropagation O1 O2 I0 H0 I1 H1 I2 H2 I3 Output Layer Input Layer Hidden Layer Computer Science Department CS 9633 Machine Learning
Mutlilayer design • Need a unit whose • Output is a non-linear function of inputs • Output is differentiable function of its inputs • Choices • Use a unit like a perceptron that computes a linear combination of inputs • Applies a threshold to the result that is smoothed and differentiable Computer Science Department CS 9633 Machine Learning
Sigmoid Threshold Unit X0=1 x1 w0 w1 w2 x2 S x3 w3 Summation Processor .. Threshold Processor wn xn Computer Science Department CS 9633 Machine Learning
BACKPROPAGATION(training_examples, , nin,nout,nhidden) Create a feed-forward network with nin input units, nhiddenhidden units, and nout output units. Initialize each wito some small random value Until the termination condition is met Do For each <x,t> in training_examples Do Propagate the input forward through the network: 1. Input the instance x to the network and compute the output ou of every unit u in the network. Propagate the errors backward through the network: 2. For each network output unit k, calculate its error term k 3. For each hidden unit h, calculate its error term h 4. Update each network weight wji
Termination Conditions • Fixed number of iterations • Error on training examples falls below threshold • Error on validation set meets some criteria Computer Science Department CS 9633 Machine Learning
Adding Momentum • A variation on backpropagation • Makes the weight update on one iteration dependent on the update on the previous iteration • Keeps movement going in the “right” direction. • Can sometimes solve problems with local minima and enable faster convergence Computer Science Department CS 9633 Machine Learning
O1 O2 H2 H3 General Acyclic Network Structure H1 I1 I2 I3 Output Layer Input Layer Hidden Layer Computer Science Department CS 9633 Machine Learning
Derivation of Backpropagation Rule • See section 4.5.3 in the text Computer Science Department CS 9633 Machine Learning
Convergence and Local Minima • Error surface may contain many local minima • Algorithm is only guaranteed to converge toward some local minimum in E • In practice, it is a very effective function approximation method. • Problem with local minima is often not encountered • Local minimum with respect to one weight is often counter-balanced by other weights • Initially, with weights near 0, the function represented is nearly linear in its inputs Computer Science Department CS 9633 Machine Learning
Methods for Avoiding Local Minima • Add a momentum term • Use stochastic gradient descent • Train multiple networks • Select best • Use committee machine Computer Science Department CS 9633 Machine Learning
Representational Power of Feed Forward NNs • Boolean functions • Any Boolean function can be represented with 2-layer neural network. • Scheme for arbitrary Boolean function • For each possible input vector, create distinct hidden unit and set its weights so it activates iff this specific vector is input • OR all of these together Computer Science Department CS 9633 Machine Learning
Representational Power of Feed Forward NNs • Continuous Functions • Every bounded continuous function can be approximated with arbitrarily small error by a network with two layers of units • Sigmoid units at hidden layer • Unthresholded linear units at output layer • Number of hidden units depends on the function to be approximates Computer Science Department CS 9633 Machine Learning
Representational Power of Feed Forward NNs • Arbitrary Functions • Any function can be approximated to arbitrary accuracy by a network with 3 layers of units. • Two hidden layers use sigmoid units unthresholded linear units at output layer • Number of units needed at each layer is not known in general Computer Science Department CS 9633 Machine Learning
Hypothesis Search Space and Inductive Bias • Every set of network weights is a different hypothesis • Hypothesis space is continuous • Continuous space and E differentiable with respect to weights gives useful organization of search space by gradient descent • Inductive bias is • Defined by interaction of gradient descent search and weight space • Roughly characterized as smooth interpolation between data points Computer Science Department CS 9633 Machine Learning
Hidden Layer Representations • Backprop can learn useful intermediate representations at the hidden layer • Defines new hidden layer features that are not explicit in the input representation, but captures relevant properties of input instances Computer Science Department CS 9633 Machine Learning
Generalization, Overfitting, and Stopping Criterion • Using error on test examples as stopping criterion is bad idea • Backprop is prone to overfitting • Why does overfitting occur in later iterations, but not earlier? Computer Science Department CS 9633 Machine Learning
Avoiding overfitting • Weight decay • Decrease weights by small factor during each iteration • Stay away from complex surfaces • Validation Data • Train with training set • Get error with validation set • Keep best weights so far on validation data • Cross-validation to determine best number of iterations Computer Science Department CS 9633 Machine Learning