540 likes | 1k Views
Part IV Perceptron Learning. NASA Space Program. Outline. Linear separability Linear classifiers and linear decision boundaries The perceptron classifier Perceptron learning algorithhm basic principles pseudocode. Decision Boundaries. What is a Classifier?
E N D
Part IVPerceptron Learning NASA Space Program
Outline • Linear separability • Linear classifiers and linear decision boundaries • The perceptron classifier • Perceptron learning algorithhm • basic principles • pseudocode
Decision Boundaries • What is a Classifier? • A classifier is a mapping from feature space (a d-dimensional vector) to the class labels {1, 2, … m} • Thus, a classifier partitions the feature space into m decision regions • The line or curve separating the 2 classes is the decision boundary • in more than 2 dimensions this is a surface (e.g., a hyperplane) • Linear Classifiers • a linear classifier is a mapping which partitions feature space using a linear function (a straight line, or a hyperplane) • it is one of the simplest classifiers we can imagine • “separate the two classes using a straight line in feature space” • in 2 dimensions the decision boundary is a straight line
Convex Hull of a Set of Points • Convex Hull of a set of Q points: • Intuitively • think of each point in Q as a nail sticking out from a 2d board • the convex hull = the shape formed by a tight rubber band that surrounds all the nails • Formally: “the convex hull is the smallest convex polygon P for which each point in Q is either on the boundary of P or in its interior” • (p.898, Cormen, Leiserson, and Rivest, Introduction to Algorithms) • can be found (for n points) in time n log n
Convex Hull of a Set of Points • Relation to Class Overlap • define convex hulls of data points D1 and D2 as P1 and P2 • if P1 and P2 intersect, then we have overlap • If P1 and P2 intersect then D1 and D2 are not linearly seperable
Convex Hull Example Feature 2 Feature 1
Convex Hull Example Convex Hull P1 Feature 2 Feature 1
Data from 2 Classes: Linearly Separable? Feature 2 x x x x x x x Feature 1
Data from 2 Classes: Linearly Separable? Convex Hull P1 Feature 2 x x x x x x x Feature 1
Data from 2 Classes: Linearly Separable? Convex Hull P1 Convex Hull P2 Feature 2 x x x x x x The 2 Hulls intersect => data from each class are not linearly separable x Feature 1
Different data => linearly separable Convex Hull P1 Convex Hull P2 Feature 2 x x x x x x x Feature 1
Some Theory Let N be the number of data points Let d be the dimension of the data points Let F(N, d) = the fraction of dichotomies of N points in d dimensions that are linearly separable Then, it can be shown that: = 1 if N < d+2 F(N, d) = (1 / 2 N-1 )Sdi=0 (N-1)! / [ (N-1-i)! i! ] if N > d
Fraction of Points that are Linearly Separable F(N,d) d = infinity 1 d = 2 0.5 d = 1 0 N/(d+1) 1 2 3
A Linear Classifier in 2 Dimensions Let Feature 1 be called X Let Feature 2 be called Y A linear classifier is a linear function of X and Y, i.e., it computes f(X,Y) = aX + bY + c Here a, b, and c are the “weights” of the classifier Define the output of the linear classifier to be T(f) = -1, if f <= 0 T(f) = +1, if f > 0 if f(X,Y) <= 0, the classifier produces a “-1” (Decision Region 1) if f(X,Y) > 0, the classifier produces a “+1” (Decision Region 2)
Decision Boundaries for a 2d Linear Classifier Depending on whether f(X,Y) is > or < 0, the features (X,Y) get classified into class 1 or class 2 Thus, f(X,Y) = 0 must define the decision boundary between class 1 and 2, i.e., f(X,Y) = aX + bY + c = 0 Thus, defining a, b, and c automatically locates the decision boundary in X,Y space In summary: - a classifier defines a decision boundaries between classes - for a linear classifier, this boundary is a line or a plane - the equation of the plane is defined by the parameters of the classifier
The Perceptron Classifier (for 2 features) X w1 w2 f = w1 X + w2 Y + w3 {-1, +1} Y T(f) w3 Threshold Function Output = class decision Weighted Sum of the inputs 1
Perceptrons • Perceptron = a linear classifier • The w’s are the weights (denoted as a, b,c, earlier) • real-valued constants (can be positive or negative) • Define an additional constant input “1” (allows an intercept in decision boundary)
Perceptrons • A perceptron calculates 2 quantities: • 1. A weighted sum of the input features • 2. This sum is then thresholded by the T function • A simple artificial model of human neurons • weights = “synapses” • threshold = “neuron firing”
Notation • Inputs: • x0, x1, x2, …………, xd-1, xd • x0 = 1 (a constant input) • x1, x2, …………, xd-1, xd are the values of the d features • x = ( x0, x1, x2, …………, xd-1, xd ) • Weights: • w0, w1, w2, …………, wd-1, wd • we have d+1 weights • one for each feature, + one for the constant • w = ( w0, w1, w2, …………, wd-1, wd )
Perceptron Operation • Equations of operation: • = 1 (if w0x0 + w1x1 +… wd xd > 0) • o(x1, x2,…, xd-1, xd) • = -1 (otherwise) • Note that: • w = (w0,….. wd), the “weight vector” (row vector, 1 x d+1) • and x = (x0,…… xd), the “feature vector” (row vector, 1 x d+1) • => w0x0 + w1x1 +… wd xd = w . x’ • and w . x’ is the “vector inner product” (w*x’ in MATLAB)
Perceptron Decision Boundary • Equations of operation (in vector form): • = 1 (if w . x’ > 0) • o(x1, x2,…, xd-1, xd) • = -1 (otherwise) • The perceptron represents a hyperplane decision surface in d-dimensional space, e.g., a line in 2d, a plane in 3d, etc. • The equation of the hyperplane is w . x’ = 0 • This is the equation for points in x-space that are on the boundary
Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 => 0.1 + 1. x1 - 1. x2 = 0 => x1 - x2 = 0 => x1 = x2 x2 x1
Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 => 0.1 + 1. x1 - 1. x2 = 0 => x1 - x2 = 0 => x1 = x2 This is the equation for the decision boundary x2 x1
Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ < 0 => x1 - x2 < 0 => x1 < x2 (this is the equation for decision region -1) w . x’ = 0 x2 x1
Example of Perceptron Decision Boundary w = (w0, w1,w2) = (0, 1, -1) w . x’ = 0 x2 w . X’ < 0 w . x’ > 0 => x1 - x2 > 0 => x1 > x2 (this is the equation for decision region +1) x1
Representational Power of Perceptrons • What mappings can a perceptron represent perfectly? • A perceptron is a linear classifier • thus it can represent any mapping that is linearly separable • some Boolean functions like AND (on left) • but not Boolean functions like XOR (on right) 0 x 0 x 0 0 x 0
Learning the Classifier Parameters • Where do the parameters (weights) of the classifier come from? • If we know a lot about the problem, we could “design” them • Typically we don’t know ahead of time what the values should be
Learning the Classifier Parameters • Learning from Training Data: • training data = labeled feature vectors • i.e., a set of N feature vectors each with a class label • we can use the training data to try to find good parameters • “good” parameters are ones which provide low error on the training data • Statement of the Learning Problem: • given a classifier, and some training data, find the values for the classifier’s parameters which maximize accuracy
Learning the Weights from Data An Example of a Training Data Set Example x1 x2 …. xd True Class Label (“target” t) x(1)3.4 -1.2 ….. 7.1 1 x(2)4.1 -3.1 ….. 4.6 -1 x(3)5.7 -1.0 ….. 6.2 -1 x(4)2.2 4.1 ….. 5.0 1 x(n)1.2 4.3 ….. 6.1 1
Learning as a Search Problem • The objective function: • the accuracy of the classifier (for a given set of weights w and labeled data) • Problem: • maximize this objective function
Learning as a Search Problem • Equivalent to an optimization or search problem • i.e., think of the vector (w1, w2, w3) • this defines a 3-dimensional “parameter space” • we want to find the value of (w1, w2, w3) which maximizes the objective function • So we could use hill-climbing, systematic search, etc., to search this parameter space • many learning algorithms = hill-climbing with random restarts
Classification Accuracy • Say we have N feature vectors • Say we know the true class label for each feature vector • We can measure how accurate a classifier is by how many feature vectors it classifies correctly
Classification Accuracy • Procedure: • accuracy = 0 • For each of the N feature vectors: • calculate the output of the classifier for this vector • if the class label agrees with the true label • ++ accuracy • continue • Percentage Accuracy = (accuracy/N) * 100%
Training Data and Learning • We have N examples • an example consists of a feature vector and a class label (target) • the ith feature vector is denoted x(i) • Learning on the Training Data • Let o(i) = o( x(i) ) be the output of a perceptron for the ith feature vector x(i) • Let t(i) be the target value • goal is to find perceptron weights such that • o(i) is close to t(i) for as many examples as possible • i.e., the output matches the desired target as often as possible TrainingAccuracy(w) = 1/N S I( o(i), t(i) ) where I( o(i), t(i) ) = 1 if o(i) = t(i), and 0 otherwise
Mean-Squared Error as a Measure of Accuracy • Consider training an “unthresholded” perceptron, • consider the sum without the threshold, f(x) = w . x’ • this is a real number • saw we want this real number to match the target as closely as possible, for each input x • Error for ith example = ei (w) = ( f(i) - t(i) )2 • Mean-squared error, E(w) = (1/N) Si ( f(i) - t(i) )2= (1/N) Si (w .x’(i) - t(i) )2 • We try to choose the weights which minimize E • note that f(i) is a function of the weights w • f(i) = w . x’(i) • thus, E is also a function the weights, i.e., E(w)
Gradient Descent Learning Rule • Weight update rule: • wj <- wj + h ( t(i) - f(i) ) xj(i) • t(i) and f(i) are the target and weighted sum (respectively) for the ith example • wj is the jth input weight • xj(i) is the jth input feature value, for the ith example • h is called the learning rate: a small positive number, 0 < h < 1 • An example of how this works: • Say wj and xj(i) are both positive: • say t(i) > f(i) => we increase the value of the weight • say t(i) < f(i) => we decrease the value of the weight • h controls how quickly we increase or decrease the weight
Perceptron Training Algorithm • Algorithm outline • initialize the weights (e.g., randomly) • loop through all N examples (this is 1 iteration) • for each example calculate the output • calculate the difference between the output and the target • update each of the d+1 weights using the gradient update rule • after all N examples are gone through • check if the overall error (MSE) has decreased significantly since the previous iteration • if not, then perform another iteration through all N examples • if so, then
Pseudo-code for Perceptron Training • Inputs: d features, N targets (class labels), learning rate h • Outputs: a set of learned weights • Initialize each wj (e.g., to a small random value) • While (termination condition not satisfied) • for i = 1: N • for j=0: d • deltaw = h ( t(i) - f(i) ) xj(i) • wj = wj + deltaw • end • calculate termination condition • end
Comments on Incremental Learning • will converge to a minimum of E under certain assumptions • if the learning rate is small enough • converges even if the data are not linearly separable • not guaranteed to minimize classification accuracy (but usually will in practice) • typical learning rates: from 0.01 to 0.00001 • can depend on the particular data set • can be adapted (adjusted) automatically during learning • termination conditions: • e.g., when either Accuracy or Mean Squared Error hardly change from one iteration to the next (one iteration of the while loop) • e.g., change in mean square error < 0.0001
Summary of Perceptron Learning Algorithms • “Error Correcting” Perceptron Training rule: • “error correction”, works with thresholded output • guaranteed to converge to zero error if examples are linearly separable • otherwise no guarantee of convergence • “Gradient Descent” Perceptron Training • Incremental version (this is the version for Assignment 3): • updates the weights after visiting each example: cycles through the examples • one iteration is defined as one pass through all N examples (N weight updates) • Batch version (we are not using this) • updates the weights after going through all the examples
MATLAB Demo of Perceptron Learning • sampledata1: linearly separable • sampledata2: not linearly separable • Illustration of • different learning rates • different initialization methods • different convergence criteria • Visualization of • decision boundaries • MSE and Accuracy as a function of iteration number
Assignment function [outputs] = perceptron(weights,data) % function [outputs] = perceptron(weights,data) % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % weights: 1 x (d+1) row vector of weights % data: N x (d+1) matrix of training data % % Outputs % outputs: N x 1 vector of perceptron outputs -------- Your code goes here -------
Assignment function [cerror, mse] = perceptron_error(weights,data,targets) % function [cerror, mse] = perceptron_error(weights,data,targets) % % brief description of what the function does % ...... % Your Name, ICS 175A, date % % Inputs % weights: 1 x (d+1) row vector of weights % data: N x (d+1) matrix of training data % targets: N x 1 vector of target values (+1 or -1) % % Outputs % cerror: the percentage of examples misclassified (between 0 and 100) % mse: the mean-squared error (sum of squared errors divided by N) -------- Your code goes here -------