CISC 4631 Data Mining

CISC 4631Data Mining Lecture 11: Neural Networks

Biological Motivation • Can we simulate the human learning process?  Two schools • modeling biological learning process • obtain highly effective algorithms, independent of whether these algorithms mirror biological processes (this course) • Biological learning system (brain) • complex network of neurons • ANN are loosely motivated by biological neural systems. However, many features of ANNs are inconsistent with biological systems

Neural Speed Constraints • Neurons have a “switching time” on the order of a few milliseconds, compared to nanoseconds for current computing hardware. • However, neural systems can perform complex cognitive tasks (vision, speech understanding) in tenths of a second. • Only time for performing 100 serial steps in this time frame, compared to orders of magnitude more for current computers. • Must be exploiting “massive parallelism.” • Human brain has about 1011 neurons with an average of 104 connections each.

Artificial Neural Networks (ANN) • ANN • network of simple units • real-valued inputs & outputs • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically

Neural Network Learning • Learning approach based on modeling adaptation in biological neural systems. • Perceptron: Initial algorithm for learning simple neural networks (single layer) developed in the 1950’s. • Backpropagation: More complex algorithm for learning multi-layer neural networks developed in the 1980’s.

Real Neurons

How Does our Brain Work? • A neuron is connected to other neurons via its input and output links • Each incoming neuron has an activation value and each connection has a weight associated with it • The neuron sums the incoming weighted values and this value is input to an activation function • The output of the activation function is the output from the neuron

Neural Communication • Electrical potential across cell membrane exhibits spikes called action potentials. • Spike originates in cell body, travels down axon, and causes synaptic terminals to release neurotransmitters. • Chemical diffuses across synapse to dendrites of other neurons. • Neurotransmitters can be excititory or inhibitory. • If net input of neurotransmitters to a neuron from other neurons is excititory and exceeds some threshold, it fires an action potential.

Real Neural Learning • To model the brain we need to model a neuron • Each neuron performs a simple computation • It receives signals from its input links and it uses these values to compute the activation level (or output) for the neuron. • This value is passed to other neurons via its output links.

Prototypical ANN • Units interconnected in layers • directed, acyclic graph (DAG) • Network structure is fixed • learning = weight adjustment • backpropagation algorithm

Appropriate Problems • Instances: vectors of attributes • discrete or real values • Target function • discrete, real, vector • ANNs can handle classification & regression • Noisy data • Long training times acceptable • Fast evaluation • No need to be readable • It is almost impossible to interpret neural networks except for the simplest target functions

Perceptrons • The perceptron is a type of artificial neural network which can be seen as the simplest kind of feedforward neural network: a linear classifier • Introduced in the late 50s • Perceptron convergence theorem (Rosenblatt 1962): • Perceptron will learn to classify any linearly separable set of inputs. • Perceptron is a network: • single-layer • feed-forward: data only travels in one direction XOR function (no linear separation)

ALVINN drives 70 mph on highways See Alvinn video Alvinn Video

1 w12 w16 w15 w14 w13 2 3 4 5 6 Artificial Neuron Model • Model network as a graphwith cells as nodes and synaptic connections as weighted edges from node i to node j, wji • Model net input to cell as • Cell output is: oj 1 (Tjis threshold for unit j) 0 Tj netj

Perceptron: Artificial Neuron Model Model network as a graphwith cells as nodes and synaptic connections as weighted edges from node i to node j, wji The input value received of a neuron is calculated by summing the weighted input values from its input links threshold threshold function Vector notation:

Different Threshold Functions We should learn the weight w1,…, wn

Examples(step activation function) w0 – t

Neural Computation • McCollough and Pitts (1943) showed how such model neurons could compute logical functions and be used to construct finite-state machines • Can be used to simulate logic gates: • AND: Let all wjibe Tj/n, where n is the number of inputs. • OR: Let all wjibe Tj • NOT: Let threshold be 0, single input with a negative weight. • Can build arbitrary logic circuits, sequential machines, and computers with such gates

Perceptron Training • Assume supervised training examples giving the desired output for a unit given a set of known input activations. • Goal: learn the weight vector (synaptic weights) that causes the perceptron to produce the correct +/- 1 values • Perceptron uses iterative update algorithm to learn a correct set of weights • Perceptron training rule • Delta rule • Both algorithms are guaranteed to converge to somewhat different acceptable hypotheses, under somewhat different conditions

Perceptron Training Rule • Update weights by: where η is the learning rate • a small value (e.g., 0.1) • sometimes is made to decay as the number of weight-tuning operations increases t – target output for the current training example o – linear unit output for the current training example

Perceptron Training Rule • Equivalent to rules: • If output is correct do nothing. • If output is high, lower weights on active inputs • If output is low, increase weights on active inputs • Can prove it will converge • if training data is linearly separable • and ηissufficiently small

Perceptron Learning Algorithm • Iteratively update weights until convergence. • Each execution of the outer loop is typically called an epoch. Initialize weights to random values Until outputs of all training examples are correct For each training pair, E, do: Compute current output oj for E given its inputs Compare current output to target value, tj , for E Update synaptic weights and threshold using learning rule

Delta Rule • Works reasonably with data that is not linearly separable • Minimizes error • Gradient descent method • basis of Backpropagation method • basis for methods working in multidimensional continuous spaces • Discussion of this rule is beyond the scope of this course

Perceptron as a Linear Separator • Since perceptron uses linear threshold function it searches for a linear separator that discriminates the classes o3 ?? Or hyperplane in n-dimensional space o2

Concept Perceptron Cannot Learn • Cannot learn exclusive-or (XOR), or parity function in general o3 + 1 – ?? – + 0 o2 1

General Structure of an ANN Perceptrons have no hidden layers Multilayer perceptrons may have many

Learning Power of an ANN • Perceptron is guaranteed to converge if data is linearly separable • It will learn a hyperplane that separates the classes • The XOR function on page 250 TSK is not linearly separable • A mulitlayer ANN has no such guarantee of convergence but can learn functions that are not linearly separable • An ANN with a hidder layer can learn the XOR function by constructing two hyperplanes (see page 253)

Multilayer Network Example The decision surface is highly nonlinear

Sigmoid Threshold Unit • Sigmoid is a unit whose output is a nonlinear function of its inputs, but whose output is also a differentiable function of its inputs • We can derive gradient descent rules to train • Sigmoid unit • Multilayer networks of sigmoid units  backpropagation

output hidden input Multi-Layer Networks • Multi-layer networks can represent arbitrary functions, but an effective learning algorithm for such networks was thought to be difficult • A typical multi-layer network consists of an input, hidden and output layer, each fully connected to the next, with activation feeding forward. • The weights determine the function computed. Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer activation

Logistic function Inputs Output Age 34 .5 0.6 .4 S Gender 1 “Probability of beingAlive” .8 4 Stage Independent variables Coefficients Prediction

Neural Network Model S S S Inputs Output .6 Age 34 .4 .2 0.6 .5 .1 Gender 2 .2 .3 .8 “Probability of beingAlive” .7 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Getting an answer from a NN S Inputs Output .6 Age 34 .5 0.6 .1 Gender 2 .8 “Probability of beingAlive” .7 4 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Getting an answer from a NN S Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 “Probability of beingAlive” .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Getting an answer from a NN S Inputs Output .6 Age 34 .5 .2 0.6 .1 Gender 1 .3 “Probability of beingAlive” .7 .8 4 .2 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Comments on Training Algorithm • Not guaranteed to converge to zero training error, may converge to local optima or oscillate indefinitely. • However, in practice, does converge to low error for many large networks on real data. • Many epochs (thousands) may be required, hours or days of training for large networks. • To avoid local-minima problems, run several trials starting with different random weights (random restarts). • Take results of trial with lowest training set error. • Build a committee of results from multiple trials (possibly weighting votes by training set accuracy).

Hidden Unit Representations • Trained hidden units can be seen as newly constructed features that make the target concept linearly separable in the transformed space: key features of ANNs • On many real domains, hidden units can be interpreted as representing meaningful features such as vowel detectors or edge detectors, etc.. • However, the hidden layer can also become a distributed representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

Expressive Power of ANNs • Boolean functions: Any boolean function can be represented by a two-layer network with sufficient hidden units. • Continuous functions: Any bounded continuous function can be approximated with arbitrarily small error by a two-layer network. • Arbitrary function: Any function can be approximated to arbitrary accuracy by a three-layer network.

Sample Learned XOR Network 3.11 O 6.96 7.38 5.24 B 2.03 A 3.58 3.6 5.57 5.74 X Y Hidden Unit A represents: (X  Y) Hidden Unit B represents: (X  Y) Output O represents: A  B = (X  Y)  (X  Y) = X  Y

Expressive Power of ANNs • Universal Function Approximator: • Given enough hidden units, can approximate any continuous function f • Why not use millions of hidden units? • Efficiency (neural network training is slow) • Overfitting

Combating Overfitting in Neural Nets • Many techniques • Two popular ones: • Early Stopping • Use “a lot” of hidden units • Just don’t over-train • Cross-validation • Choose the “right” number of hidden units

Determining the Best Number of Hidden Units • Too few hidden units prevents the network from adequately fitting the data • Too many hidden units can result in over-fitting • Use internal cross-validation to empirically determine an optimal number of hidden units error on test data on training data 0 # hidden units

Over-Training Prevention • Running too many epochs can result in over-fitting. • Keep a hold-out validation set and test accuracy on it after every epoch. Stop training when additional epochs actually increase validation error. • To avoid losing training data for validation: • Use internal 10-fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy. • Train final network on complete training set for this many epochs. error on test data on training data 0 # training epochs

Summary of Neural Networks When are Neural Networks useful? Instances represented by attribute-value pairs Particularly when attributes are real valued The target function is Discrete-valued Real-valued Vector-valued Training examples may contain errors Fast evaluation times are necessary When not? Fast training times are necessary Understandability of the function is required

Successful Applications • Text to Speech (NetTalk) • Fraud detection • Financial Applications • HNC (eventually bought by Fair Isaac) • Chemical Plant Control • Pavillion Technologies • Automated Vehicles • Game Playing • Neurogammon • Handwriting recognition

Issues in Neural Nets • Learning the proper network architecture: • Grow network until able to fit data • Cascade Correlation • Upstart • Shrink large network until unable to fit data • Optimal Brain Damage • Recurrent networks that use feedback and can learn finite state machines with “backpropagation through time.”

CISC 4631 Data Mining