800 likes | 1.04k Views
Feed-Forward Artificial Neural Networks. AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University. Binary Classification Example. Example:
E N D
Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University
Binary Classification Example • Example: • Classification to malignant or benign breast cancer from mammograms • Predictor 1: lump thickness • Predictor 2: single epithelial cell size Value of predictor 2 Value of predictor 1
Possible Decision Areas Class area: Green triangles Class area: red circles Value of predictor 2 Value of predictor 1
Possible Decision Areas Value of predictor 2 Value of predictor 1
Possible Decision Areas Value of predictor 2 Value of predictor 1
Binary Classification Example The simplest non-trivial decision function is the straight line (in general a hyperplane) One decision surface Decision surface partitions space into two subspaces Value of predictor 2 Value of predictor 1
Specifying a Line • Line equation: w2x2+w1x1+w0 = 0 • Classifier: • If w2x2+w1x1+w0 0 • Output • Else • Output w2x2+w1x1+w0 = 0 w2x2+w1x1+w0 < 0 w2x2+w1x1+w0 > 0 x2 x1
w2x2+w1x1+w0 = 0 w2x2+w1x1+w0 < 0 w2x2+w1x1+w0 > 0 x2 x1 Classifying with Linear Surfaces • Use 1 for • Use -1 for • Classifier becomes sgn(w2x2+w1x1+w0)= sgn(w2x2+w1x1+w0x0), with x0=1 always
The Perceptron Output: classification of patient (malignant or benign) W4=3 W2=0 W0=2 W1=4 W2=-2 Input: (attributes of patient to classify) 1 x4 x3 x2 x1 x0
The Perceptron Output: classification of patient (malignant or benign) W4=3 W2=0 W0=2 W1=4 W2=-2 Input: (attributes of patient to classify) 2 3.1 4 0 1 x4 x3 x2 x1 x0
The Perceptron Output: classification of patient (malignant or benign) W4=3 W2=0 W0=2 W1=4 W2=-2 Input: (attributes of patient to classify) 3 3.1 4 0 1 x4 x3 x2 x1 x0
The Perceptron Output: classification of patient (malignant or benign) sgn(3)=1 W4=3 W2=0 W0=2 W1=4 W2=-2 Input: (attributes of patient to classify) 2 3.1 4 0 1 x4 x3 x2 x1 x0
The Perceptron 1 Output: classification of patient (malignant or benign) sgn(3)=1 W4=3 W2=0 W0=2 W1=4 W2=-2 Input: (attributes of patient to classify) 2 3.1 4 0 1 x4 x3 x2 x1 x0
Learning with Perceptrons • Hypotheses Space: • Inductive Bias: Prefer hypotheses that do not misclassify any of the training instances (or minimize an error function). • Search method: perceptron training rule, gradient descent, etc. • Remember: the problem is to find “good” weights
Training Perceptrons True Output: -1 • Start with random weights • Update in an intelligent way to improve them • Intuitively: • Decrease the weights that increase the sum • Increase the weights that decrease the sum • Repeat for all training instances until convergence 1 sgn(3)=1 3 2 0 -2 4 2 3.1 4 0 1 x4 x3 x2 x1 x0
Perceptron Training Rule • t : output the current example should have • o: output of the perceptron • xi: value of predictor variable i • t=o : No change (for correctly classified examples)
Perceptron Training Rule • t=-1, o=1 : sum will decrease • t=1, o=-1 : sum will increase
Perceptron Training Rule • In vector form:
Example of Perceptron Training: The OR function x2 1 1 1 1 -1 x1 0 1
Example of Perceptron Training: The OR function • Initial random weights: Define line: x1=0.5 x2 1 1 1 1 -1 x1 0 1
Example of Perceptron Training: The OR function • Initial random weights: Defines line: x1=0.5 x2 1 1 1 1 -1 x1 0 1 x1=0.5 Area where classifier outputs -1 Area where classifier outputs 1
Example of Perceptron Training: The OR function Only misclassified example x1=0, x2=1 x2 1 1 1 1 -1 x1 0 1 x1=0.5
Example of Perceptron Training: The OR function Only misclassified example x1=0, x2=1 x2 1 1 1 1 -1 x1 0 1 x1=0.5
Example of Perceptron Training: The OR function Only misclassified example x1=0, x2=1 x2 1 1 1 For x2=0, x1=-0.5 For x1=0, x2=-0.5 So, new line is: 1 -1 x1 0 1 x1=0.5
Example of Perceptron Training: The OR function x2 1 1 1 For x2=0, x1=-0.5 For x1=0, x2=-0.5 1 -1 x1 0 1 -0.5 -0.5 x1=0.5
Example of Perceptron Training: The OR function Next iteration: x2 1 1 1 Misclassified example 1 -1 x1 0 1 -0.5 -0.5
Example of Perceptron Training: The OR function x2 Perfect classification No change occurs next 1 1 1 1 -1 x1 0 1 -0.5 -0.5
Analysis of the Perceptron Training Rule • Algorithm will always converge within finite number of iterations if the data are linearly separable. • Otherwise, it may oscillate (no convergence)
Training by Gradient Descent • Similar but: • Always converges • Generalizes to training networks of perceptrons (neural networks) • Idea: • Define an error function • Search for weights that minimize the error, i.e., find weights that zero the error gradient
Setting Up the Gradient Descent • Mean Squared Error: • Minima exist where gradient is zero:
The Sign Function is not Differentiable x2 1 1 1 1 -1 0 1 x1=0.5
Use Differentiable Transfer Functions • Replace • with the sigmoid
Updating the Weights with Gradient Descent • Each weight update goes through all training instances • Each weight update more expensive but more accurate • Always converges to a local minimum regardless of the data
Multiclass Problems • E.g., 4 classes • Two possible solutions • Use one perceptron (output unit) and interpret the results as follows: • Output 0-0.25, class 1 • Output 0.25 – 0.5, class 2 • Output 0.5 – 0.75, class 3 • Output 0.75 – 1, class 4 • Use four output units and a binary encoding of the output (1-of-m encoding).
1-of-M Encoding • Assign to class with largest output Output for Class 1 Output for Class 2 Output for Class 3 Output for Class 4 x4 x3 x2 x1 x0
Feed-Forward Neural Networks Output Layer Hidden Layer 2 Hidden Layer 1 Input Layer x4 x3 x2 x1 x0
Increased Expressiveness Example: Exclusive OR No line (no set of three weights) can separate the training examples (learn the true function). x2 1 1 -1 1 -1 0 1 w2 w1 w0 x2 x1 x0
Increased Expressiveness Example x2 w’1,1 w’2,1 1 1 -1 w2,2 1 -1 0 1 w1,2 w2,1 w1,1 w0,2 w0,1 x2 x1 x0
Increased Expressiveness Example O x2 1 1 1 -1 1 H1 H2 -1 1 -1 0 1 1 1 -1 0.5 0.5 x2 x1 x0
Increased Expressiveness Example x2 1 -1 O 1 1 1 -1 x1 H1 H2 -1 1 1 -1 0.5 0.5 x2 x1 x0
Increased Expressiveness Example x2 1 -1 O 1 1 1 -1 x1 H1 H2 -1 1 1 -1 -0.5 -0.5 0 0 1 x2 x1 x0
Increased Expressiveness Example x2 1 -1 -1 1 1 1 -1 x1 -1 -1 -1 1 1 -1 -0.5 -0.5 0 0 1 x2 x1 x0
Increased Expressiveness Example x2 1 -1 1 1 1 1 -1 x1 -1 1 -1 1 1 -1 -0.5 -0.5 0 1 1 x2 x1 x0
Increased Expressiveness Example x2 1 -1 1 1 1 1 -1 x1 1 -1 -1 1 1 -1 -0.5 -0.5 1 0 1 x2 x1 x0
Increased Expressiveness Example x2 1 -1 1 1 1 1 -1 x1 -1 -1 -1 1 1 -1 -0.5 -0.5 1 1 1 x2 x1 x0
Increased Expressiveness Example x2 1 -1 1 1 1 1 -1 x1 -1 -1 -1 1 1 -1 -0.5 -0.5 1 1 1 x2 x1 x0
Increased Expressiveness Example x2 1 -1 1 1 1 1 -1 x1 -1 -1 -1 1 1 -1 -0.5 -0.5 1 1 1 x2 x1 x0
From the Viewpoint of the Output Layer x2 T4 T3 O 1 1 T1 T2 x1 H1 Mapped By Hidden Layer to: H2 H2 T2 H1 T4 T1 T3
From the Viewpoint of the Output Layer • Each hidden layer maps to a new instance space • Each hidden node is a new constructed feature • Original Problem may become separable (or easier) x2 T4 T3 T1 T2 x1 Mapped By Hidden Layer to: H2 T2 H1 T4 T1 T3