Artificial Neural Networks

Artificial Neural Networks

Overview • Computational units and architectures • Learning in perceptrons • Learning in Multilayer feed-forward nets

Neural Nets • Composed of basic units and weighted links between them • The basic units (or nodes) are an idealization of neurons • Responsible for basic computations • The pattern of connections of the units determines the network architecture

Computation at Units • Compute a 0-1 or a graded function of the weighted sum of the inputs • is the activation function

Common Activation Functions • Step function: g(x)=1, if x >= t ( t is a threshold) g(x) = 0, if x < t • Sign function: g(x)=1, if x >= t ( t is a threshold) g(x) = -1, if x < t • Sigmoid function: g(x)= 1/(1+exp(-x))

Can Implement Boolean Functions • A unit can implement And, Or, and Not • Need mapping True and False to numbers: • e.g. True = 1.0, False= 0.0 • (Exercise) Use a step function and show how to implement various simple Boolean functions • Combining the units, we can get any Boolean function of n variables Can obtain logical circuits as special case

Network Structures • Recurrent (cycles exist), more powerful as they can implement state, but harder to analyze. Examples: • Hopfield network, symmetric connections, interesting properties, useful for implementing associative memory • Boltzmann machines: more general, with applications in constraint satisfaction and combinatorial optimization

Network Structures • Feedforward (no cycles), less power, easier understood • Input units • Hidden layers • Output units • Perceptron: No hidden layer, so basically correspond to one unit, also basically linear threshold functions (ltf) • Ltf: defined by weights and threshold , value is 1 iff otherwise, 0

Perceptron Capabilities • Quite expressive: many, but not all Boolean functions can be expressed. Examples: • conjuncts and disjunctions, example • more generally, can represent functions that are true if and only if at least k of the inputs are true: • Can’t represent XOR

Representable Functions • Perceptrons have a monotinicity property: If a link has positive weight, activation can only increase as the corresponding input value increases (irrespective of other input values) • Can’t represent functions where input interactions can cancel one another’s effect (e.g. XOR)

Representable Functions • Can represent only linearly separable functions • Geometrically: only if there is a line (plane) separating the positives from the negatives • The good news: such functions are PAC learnable and learning algorithms exist

Linearly Separable - + + + _ + + + + + + + + +

NOT linearly Separable + + + _ + + OR + + +

The Perceptron Learning Algorithm • Example of current-best-hypothesis (CBH) search (so incremental, etc.): • Begin with a hypothesis (a perceptron) • Repeat over all examples several times • Adjust weights as examples are seen • Until all examples correctly classified or a stopping criterion reached

Method for Adjusting Weights • One weight update possibility: • If classification correct, don’t change • Otherwise: • If false negative, add input: • If false positive, subtract input: • Intuition: For instance, if example is positive, strengthen/increase the weights corresponding to the positive attributes of the example

Properties of the Algorithm • In general, also apply a learning rate (see book): • The adjustment is in the direction of minimizing error on the example • If learning rate is appropriate and the examples are linear separable, after a finite number of iterations, the algorithm converges to a linear separator

Another Algorithm(least-sum-squares algorithm) • Define and minimize an error function • S is the set of examples, is the ideal function, is the linear function corresponding to the current perceptron • Error of the perceptron (over all examples): • Note:

Derivative of Error • Gradient (derivative) of E: • Take the steepest descent direction: • is the gradient along , is the learning rate

Gradient Descent • The algorithm: pick initial random hype (perceptron) and repeatedly compute error and modify the perceptron (take a step along the reverse of gradient) E Gradient direction: Descent direction:

Gradient Calculation

Derivation (cont.)

Properties of the algorithm • Error function has no local minima (is quadratic) • The algorithm is a gradient descent method to the global minimum, and will asymptotically converge • Even if not linearly separable, can find a good (minimum error) linear classifier • Incremental?

A Third Method • Formulate problem in terms of a linear feasibility or linearoptimization problem • Example: find weights such that • Can be solved in polynomial time (output none if no solution exists, or otherwise output a solution)

Multilayer Feed-Forward Networks • Multiple perceptrons, layered • Example: a two-layer network with 3 inputs one output, one hidden layer (two hidden units) output layer inputs layer hidden layer

Power/Expressiveness • Can represent interactions among inputs (unlike perceptrons) • Two layer networks can represent any Boolean function, and continuous functions (within a tolerance) as long as the number of hidden units is sufficient and appropriate activation functions used • Learning algorithms exist, but weaker guarantees than perceptron learning algorithms

Back-Propagation • Similar to the perceptron learning algorithm and gradient descent for perceptrons • Problem to overcome: How to adjust internal links (how to distribute the “blame” or the error) • Assumption: internal units use differentiable functions and nonlinear • sigmoid functions are convenient

Back-Propagation (cont.) • Start with a hype (network with random weights) • Repeat until a stopping criterion is met • For each example, compute the network output and for each unit i it’s error term • Update each weight (weight of link going from node i to node j): Output of unit i

The Error Term

Derivation • Write the error for a single training example; as before use sum of squared error (as it’s convenient for differentiation, etc): • Differentiate (with respect to each weight…) • For example, we get for weight connecting node j to output i

Properties • Converges to a minimum, but could be a local minimum • Could be slow to converge (Note: Training a three node net is NP-Complete!) • Must watch for over-fitting just as in decision trees (use validation sets, etc.) • Network structure? Often two layers suffices, start with relatively few hidden units

Properties (cont.) • Many variations to the basic back-propagation: e.g. use momentum • Reduce with time (applies to perceptrons as well) Nth update amount a constant

NN properties • Can handle domains with • continuous and discrete attributes • Many attributes • noisy data • Could be slow at training but fast at evaluation time • Human understanding of what the network does could be limited

Artificial Neural Networks