1.07k likes | 1.32k Views
Neural Networks. CSE 4309 – Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington. Perceptrons. A perceptron is a function that maps D-dimensional vectors to real numbers.
E N D
Neural Networks CSE 4309 – Machine Learning VassilisAthitsos Computer Science and Engineering Department University of Texas at Arlington
Perceptrons • A perceptron is a function that maps D-dimensional vectors to real numbers. • For notational convenience, we add a zero-th dimension to every input vector, that is always equal to 1. • is called the bias input. It is always equal to 1. • is called the bias weight. It is optimized during training. Output:
Perceptrons • A perceptron computes its output in two steps: First step: Second step: • In a single formula: Output:
Perceptrons • A perceptron computes its output in two steps: First step: Second step: • is called an activation function. • For example, could be the sigmoidal function Output:
Perceptrons • We have seen perceptrons before, we just did not call them perceptrons. • For example, logistic regression produces a classifier function . • If we set and , then is a perceptron. Output:
Perceptronsand Neurons • Perceptrons are inspired by neurons. • Neurons are the cells forming the nervous system, and the brain. • Neurons somehow sum up their inputs, and if the sum exceeds a threshold, they "fire". • Since brains are "intelligent", computer scientists have been hoping that perceptron-based systems can be used to model intelligence. Output:
Activation Functions • A perceptron produces output . • One choice for the activation function : the step function. • The step function is useful for providing some intuitive examples. • It is not useful for actual real-world systems. • Reason: it is not differentiable, it does not allow optimization via gradient descent.
Activation Functions • A perceptron produces output . • Another choice for the activation function : the sigmoidal function. • The sigmoidal is often used in real-world systems. • It is a differentiable function, it allows use of gradient descent.
Example: The AND Perceptron • Suppose we use the step function for activation. • Suppose boolean value false is represented as number 0. • Suppose boolean value true is represented as number 1. • Then, the perceptron below computes the boolean AND function: Output: false AND false = false false AND true = false true AND false = false true AND true = true
Example: The AND Perceptron • Verification: If and : • . • Corresponds to case false AND false = false. Output: false AND false = false false AND true = false true AND false = false true AND true = true
Example: The AND Perceptron • Verification: If and : • . • Corresponds to case false AND true = false. Output: false AND false = false false AND true = false true AND false = false true AND true = true
Example: The AND Perceptron • Verification: If and : • . • Corresponds to case true AND false = false. Output: false AND false = false false AND true = false true AND false = false true AND true = true
Example: The AND Perceptron • Verification: If and : • . • Corresponds to case true AND true = true. Output: false AND false = false false AND true = false true AND false = false true AND true = true
Example: The OR Perceptron • Suppose we use the step function for activation. • Suppose boolean value false is represented as number 0. • Suppose boolean value true is represented as number 1. • Then, the perceptron below computes the boolean OR function: Output: false OR false = false false OR true = true true OR false = true true OR true = true
Example: The OR Perceptron • Verification: If and : • . • Corresponds to case false OR false = false. Output: false OR false = false false OR true = true true OR false = true true OR true = true
Example: The OR Perceptron • Verification: If and : • . • Corresponds to case false OR true = true. Output: false OR false = false false OR true = true true OR false = true true OR true = true
Example: The OR Perceptron • Verification: If and : • . • Corresponds to case true OR false = true. Output: false OR false = false false OR true = true true OR false = true true OR true = true
Example: The OR Perceptron • Verification: If and : • . • Corresponds to case true OR true = true. Output: false OR false = false false OR true = true true OR false = true true OR true = true
Example: The NOT Perceptron • Suppose we use the step function for activation. • Suppose boolean value false is represented as number 0. • Suppose boolean value true is represented as number 1. • Then, the perceptron below computes the boolean NOT function: Output: NOT(false) = true NOT(true) = false
Example: The NOT Perceptron • Verification: If : • . • Corresponds to case NOT(false) = true. Output: NOT(false) = true NOT(true) = false
Example: The NOT Perceptron • Verification: If : • . • Corresponds to case NOT(true) = false. Output: NOT(false) = true NOT(true) = false
The XOR Function • As before, we representfalse with 0 and true with 1. • The figure shows the four input points of the XOR function. • green corresponds to output value true. • red corresponds to output value false. • The two classes (true and false) are not linearly separable. • Therefore, no perceptron can compute the XOR function. false XOR false = false false XOR true = true true XOR false = true true XOR true = false
Our First Neural Network: XOR • A neural network is built using perceptrons as building blocks. • The inputs to some perceptrons are outputs of other perceptrons. • Here is an example neural network computing the XOR function. Unit 4 Unit 3 Unit 5 Output:
Our First Neural Network: XOR • To simplify the picture, we do not show the bias input anymore. • We just show the bias weights . • Besides the bias input, there are two inputs: , . Unit 4 Unit 3 Unit 5 Output:
Our First Neural Network: XOR • The XOR network shows how individual perceptrons can be combined to perform more complicated functions. AND unit OR unit A AND (NOT B) Output:
Computing the Output: An Example • Suppose that (corresponding to false XOR true). • For the OR unit: • The dot product is: . • The activation function (assuming a step function) outputs 1. AND unit OR unit A AND (NOT B) *1 Output:
Computing the Output: An Example • Suppose that (corresponding to false XOR true). • For the AND unit: • The dot product is: . • The activation function (assuming a step function) outputs 0. AND unit OR unit A AND (NOT B) *1 Output:
Computing the Output: An Example • Suppose that 1 (corresponding to false XOR true). • For the output unit(computing the A AND (NOT B) function): • One input is the output of the OR unit, which is 1. • The other input is the output of the AND unit, which equals 0. AND unit OR unit A AND (NOT B) *1 Output:
Computing the Output: An Example • Suppose that 1 (corresponding to false XOR true). • For the output unit(computing the A AND (NOT B) function): • The dot product is: . • The activation function (assuming a step function) outputs 1. AND unit OR unit A AND (NOT B) *1 Output:
Verifying the XOR Network • We can follow the same process to compute the output of this network for the other three cases. • Here we consider the case where (corresponding to false XOR false). • The output is 0, as it should be. AND unit OR unit A AND (NOT B) *1 Output:
Verifying the XOR Network • We can follow the same process to compute the output of this network for the other three cases. • Here we consider the case where (corresponding to true XOR false). • The output is 1, as it should be. AND unit OR unit A AND (NOT B) *1 Output:
Verifying the XOR Network • We can follow the same process to compute the output of this network for the other three cases. • Here we consider the case where (corresponding to true XOR true). • The output is 0, as it should be. AND unit OR unit A AND (NOT B) *1 Output:
Neural Networks • This neural network example consists of six units: • Three input units (including the not-shown bias input). • Three perceptrons. • Yes, in the notation we will be using, inputs count as units. Unit 4 Unit 3 Unit 5 Output:
Neural Networks • Weights are denoted as . • Weight belongs to the edge that connects the output of unit with an input of unit . • Units are the input units(units 0, 1, 2 in this example). Unit 4 Unit 3 Unit 5 Output:
Neural Network Layers • Oftentimes, neural networks are organized into layers. • The input layer is the initial layer of input units (units 0, 1, 2 in our example). • The output layer is at the end (unit 5 in our example). • Zero, one or more hidden layers can be between the input and output layers. Unit 4 Unit 3 Unit 5 Output:
Neural Network Layers • There is only one hidden layer in our example, containing units 4 and 5. • Each hidden layer's inputs are outputs from the previous layer. • Each hidden layer's outputs are inputs to the next layer. • The first hidden layer's inputs come from the input layer. • The last hidden layer's outputs are inputs to the output layer. Unit 4 Unit 3 Unit 5 Output:
Feedforward Networks • Feedforward networks are networks where there are no directed loops. • If there are no loops, the output of a neuron cannot (directly or indirectly) influence its input. • While there are varieties of neural networks that are not feedforward or layered, our main focus will be layered feedforward networks. Unit 4 Unit 3 Unit 5 Output:
Computing the Output • Notation: L is the number of layers. • Layer 1 is the input layer, layer L is the output layer. • Given values for the input units, output is computed as follows: • For : • Compute the outputs of layer L, given the outputs of layer L-1. Unit 4 Unit 3 Unit 5 Output:
Computing the Output • To compute the outputs of layer (where ), we simply need to compute the output of each perceptron belonging to layer . • For each such perceptron, its inputs are coming from outputs of perceptrons at layer . • Remember, we compute layer outputs in increasing order of . Unit 4 Unit 3 Unit 5 Output:
What Neural Networks Can Compute • An individual perceptron is a linear classifier. • The weights of the perceptron define a linear boundary between two classes. • Layered feedforward neural networks with one hidden layer can compute any continuous function. • Layered feedforward neural networks with two hidden layers can compute any mathematical function. • This has been known for decades, and is one reason scientists have been optimistic about the potential of neural networks to model intelligent systems. • Another reason is the analogy between neural networks and biological brains, which have been a standard of intelligence we are still trying to achieve. • There is only one catch: How do we find the right weights?
Training a Neural Network • In linear regression, for the sum-of-squares error, we could find the best weights using a closed-form formula. • In logistic regression, for the cross-entropy error, we could find the best weights using an iterative method. • In neural networks, we cannot find the best weights (unless we have an astronomical amount of luck). • We only have optimization methods that find local minima of the error function. • Still, in recent years such methods have produced spectacular results in real-world applications.
Notation for Training Set • We define to be the vector of all weights in the neural network. • We have a set of N training examples. • Each is a (D+1)-dimensional column vector. • Dimension 0 is the bias input, always set to 1. • We also have a set of N target outputs. • is the target output for training example . • Each is a K-dimensional column vector: • Note: K typically is not equal to D.
Perceptron Learning • Before we discuss how to train an entire neural network, we start with a single perceptron. • Remember: given input , a perceptron computes its output using this formula: • We use sum-of-squares as our error function. • is the contribution of training example : • The overall error is defined as: • Important: a single perceptron has a single output. • Therefore, for perceptrons (but NOT for neural networks in general), we assume that is one-dimensional.
Perceptron Learning • Suppose that a perceptron is using the step function as its activation function . • Can we apply gradient descent in that case? • No, because is not differentiable. • Small changes of usually lead to no changes in • The only exception is when the change in causes to switch signs (from positive to negative, or from negative to positive).
Perceptron Learning • A better option is setting to the sigmoid function: • Then, measured just on a single training object , the error is defined as: • Note: here we use the sum-of-squares error, and not the cross-entropy error that we used for logistic regression. • Also note: if our neural network is a single perceptron, then the target output is one-dimensional.
Computing the Gradient • In this form, is differentiable. • If we do the calculations, the gradient turns out to be: • Note that is a (D+1) dimensional vector. It is a scalar (shown in red) multiplied by vector .
Weight Update • So, we update the weight vector as follows: • As before, is the learning rate parameter. • It is a positive real number that should be chosen carefully, so as not to be too big or too small. • In terms of individual weights , the update rule is:
Perceptron Learning - Summary • Input: Training inputs , target outputs • Extend each to a (D+1) dimensional vector, by adding 1 (the bias input) as the value for dimension 0. • Initialize weights to small random numbers. • For example, set each between -0.1 and 0.1 • For n = 1 to N: • Compute . • For d = 0 to D: • If some stopping criterion has been met, exit. • Else, go to step 3.
Stopping Criterion • At step 4 of the perceptron learning algorithm, we need to decide whether to stop or not. • One thing we can do is: • Compute the cumulative squared error E(w) of the perceptron at that point: • Compare the current value of with the value of computed at the previous iteration. • If the difference is too small (e.g., smaller than 0.00001) we stop.
Using Perceptrons for Multiclass Problems • “Multiclass” means that we have more than two classes. • A perceptron outputs a number between 0 and 1. • This is sufficient only for binary classification problems. • For more than two classes, there are many different options. • We will follow a general approach called one-versus-all classification.