350 likes | 359 Views
Artificial Neural Networks. Outline. Biological Motivation Perceptron Gradient Descent Least Mean Square Error Multi-layer networks Sigmoid node Backpropagation. Biological Neural Systems. Neuron switching time : > 10 -3 secs Number of neurons in the human brain: ~10 10
E N D
Outline • Biological Motivation • Perceptron • Gradient Descent • Least Mean Square Error • Multi-layer networks • Sigmoid node • Backpropagation
Biological Neural Systems • Neuron switching time : > 10-3 secs • Number of neurons in the human brain: ~1010 • Connections (synapses) per neuron : ~104–105 • Face recognition : 0.1 secs • High degree of parallel computation • Distributed representations
Artificial Neural Networks • Many simple neuron-like threshold units • Many weighted interconnections • Multiple outputs • Highly parallel and distributed processing • Learning by tuning the connection weights
x1 x2 xn Perceptron: Linear threshold unit x0=1 w1 w0 w2 S o . . . i=0n wi xi wn 1 if i=0nwi xi >0 o(xi)= -1 otherwise {
x2 + - x1 + - Xor Decision Surface of a Perceptron x2 + + + - - x1 + - - Linearly Separable Theorem: VC-dim = n+1
Perceptron Learning Rule S sample xi input vector t=c(x) is the target value o is the perceptron output learning rate(a small constant ), assume=1 wi = wi + wi wi = (t - o) xi
Perceptron Algo. • Correct Output (t=o) • Weights are unchanged • Incorrect Output (to) • Change weights ! • False Positive (t=1 and o=-1) • Add x to w • False Negative (t=-1 and o=1) • Subtract x from w
t=-1 t=1 o=1 w=[0.25 –0.1 0.5] x2 = 0.2 x1 – 0.5 o=-1 (x,t)=([2,1],-1) o=sgn(0.45-0.6+0.3) =1 (x,t)=([-1,-1],1) o=sgn(0.25+0.1-0.5) =-1 w=[0.2 –0.2 –0.2] w=[-0.2 –0.4 –0.2] (x,t)=([1,1],1) o=sgn(0.25-0.7+0.1) =-1 w=[0.2 0.2 0.2] Perceptron Learning Rule
Perceptron Algorithm: Analysis • Theorem: The number of errors of the Perceptron Algorithm is bounded • Proof: • Make all examples positive • change <xi,bi> to <bixi, +1> • Margin of hyperplan w
Perceptron Algorithm: Analysis II • Let mibe the number of errors of xi • M= mi • From the algorithm: w= mixi • Let w* be a separating hyperplane
Perceptron Algorithm: Analysis III • Change in weights: • Since w errs on xi , we have wxi <0 • Total weight:
Perceptron Algorithm: Analysis IV • Consider the angle between w and w* • Putting it all together
Gradient Descent Learning Rule • Consider linear unit without threshold and continuous output o (not just –1,1) • o=w0 + w1 x1 + … + wn xn • Train the wi’s such that they minimize the squared error • E[w1,…,wn] = ½ dS (td-od)2 where S is the set of training examples
(w1,w2) Gradient: E[w]=[E/w0,… E/wn] (w1+w1,w2 +w2) Gradient Descent S={<(1,1),1>,<(-1,-1),1>, <(1,-1),-1>,<(-1,1),-1>} w=- E[w] wi=- E/wi =/wi 1/2d(td-od)2 = /wi 1/2d(td-i wi xi)2 = d(td- od)(-xi)
Gradient Descent Gradient-Descent(S:training_examples, ) Until TERMINATION Do • Initialize each wi to zero • For each <x,t> in S Do • Compute o=<x,w> • For each weight wiDo • wi= wi + (t-o) xi • For each weight wi Do1 • wi=wi+wi
Incremental Stochastic Gradient Descent • Batch mode : Gradient Descent w=w - ES[w] over the entire data S ES[w]=1/2d(td-od)2 • Incremental mode: gradient descent w=w - Ed[w] over individual training examples d Ed[w]=1/2 (td-od)2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if is small enough
Comparison Perceptron and Gradient Descent Rule Perceptron learning rule guaranteed to succeed if • Training examples are linearly separable • No guarantee otherwise Linear unit using Gradient Descent • Converges to hypothesis with minimum squared error. • Given sufficiently small learning rate • Even when training data contains noise • Even when training data not linearly separable
Multi-Layer Networks output layer hidden layer(s) input layer
x1 x2 xn Sigmoid Unit x0=1 w1 w0 z=i=0n wi xi o=(z)=1/(1+e-z) w2 S o . . . wn (z) =1/(1+e-z) sigmoid function.
Sigmoid Function (z) =1/(1+e-z) d(z)/dz= (z) (1- (z)) • Gradient Decent Rule: • one sigmoid function • E/wi = -d(td-od) od (1-od) xi • Multilayer networks of sigmoid units: • backpropagation
Backpropagation: overview • Make threshold units differentiable • Use sigmoid functions • Given a sample compute: • The error • The Gradient • Use the chain rule to compute the Gradient
Backpropagation Motivation • Consider the square error • ES[w]=1/2d S k output (td,k-od,k)2 • Gradient: ES[w] • Update: w=w - ES[w] • How do we compute the Gradient?
Backpropagation: Algorithm • Forward phase: • Given input x, compute the output of each unit • Backward phase: • For each output k compute
Backpropagation: Algorithm • Backward phase • For each hidden unit h compute: • Update weights: • wi,j=wi,j+wi,jwherewi,j= j xi
Backpropagation: Summary • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Finds a local, not necessarily global error minimum • in practice often works well • requires multiple invocations with different initial weights • A variation is to include momentum term wi,j(n)= j xi + wi,j (n-1) • Minimizes error training examples • Training is fairly slow, yet prediction is fast
Expressive Capabilities of ANN Boolean functions • Every boolean function can be represented by network with single hidden layer • But might require exponential (in number of inputs) hidden units Continuous functions • Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]
VC-dim of ANN • A more general bound. • Concept class F(C,G): • G : Directed acyclic graph • C: concept class, d=VC-dim(C) • n: input nodes • s : inner nodes (of degree r) Theorem: VC-dim(F(C,G)) < 2ds log (es)
Proof: • Bound |F(C,G)(m)| • Find smallest d s.t. |F(C,G)(m)| <2m • Let S={x1, … , xm} • For each fixed G we define a matrix U • U[i,j]= ci(xj), where ci is a specific i-th concept • U describes the computations of S in G • TF(C,G) = number of different matrices.
Proof (continue) • Clearly |F(C,G)(m)| TF(C,G) • Let G’ be G without the root. • |F(C,G)(m)| TF(C,G) TF(C,G’) |C(m)| • Inductively, |F(C,G)(m)| |C(m)|s • Recall VC Bound: |C(m)| (em/d)d • Combined bound |F(C,G)(m)| (em/d)ds
Proof (cont.) • Solve for: (em/d)ds2m • Holds for m 2ds log(es) • QED • Back to ANN: • VC-dim(C)=n+1 • VC(ANN) 2(n+1) log (es)