Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department

ECE 517: Reinforcement Learning in Artificial IntelligenceLecture 14: Artificial Neural Networks – Introduction, Feedforward Neural Networks October 25, 2010 Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2010

Final projects - logistics • Projects can be done individually or in pairs • Students are encouraged to propose a topic • Please email me your top three choices for a project along with a date for your presentation • Presentation dates: • Nov. 22, 24 and 26 • Format: 20 min presentation • ~10 min for background and motivation • ~10 for description of your work and conclusions • Written report due: Friday, Dec. 3 • Format similar to project report

Final projects - topics Teris player using RL (and NN) Curiosity based TD learning* States vs. Rewards in RL Human reinforcement learning Reinforcement Learning of Local Shape in the Game of Go Where do rewards come from? Efficient Skill Learning using Abstraction Selection AIBO Playing on a PC using RL* RL for Visual Attention* AIBO learning to walk within a maze* Study of value function definitions for TD learning*

Outline • Introduction • Brain vs. Computers • The Perceptron • Multilayer Perceptrons (MLP) • Feedforward Neural-Networks and Backpropagation

Pigeons as art experts (Watanabe et al. 1995) • Experiment: • Pigeon was placed in a closed box • Present paintings of two different artists (e.g. Chagall / Van Gogh) • Reward for peckingwhen presented a particular artist (e.g. Van Gogh) • Pigeons were able todiscriminate betweenVan Gogh and Chagallwith 95% accuracy(when presented with pictures they had beentrained on)

Pictures by different artists

Interesting results • Discrimination still 85% successful for previously unseen paintings of the artists • Conclusions from the experiment: • Pigeons do not simply memorise the pictures • They can extract and recognise patterns (e.g. artistic ‘style’) • They generalise from the already seen to make predictions • This is what neural networks (biological and artificial) are good at (unlike conventional computer) • Provided further justification for use of ANNs “Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination,” Albert Einstein

The “Von Neumann” architecture vs. Neural Networks Von Neumann • Memory for programs and data • CPU for math and logic • Control unit to steer program flow Follows rules Solution can/must be formally specified Cannot generalize Not error tolerant Neural Net Learns from data Rules on data are not visible Able to generalize Copes well with noise

Biological Neuron Input builds up on receptors (dendrites) Cell has an input threshold Upon breech of cell’s threshold, activation is fired down the axon Synapses (i.e. weights) exist prior to the dendrites (input) interfaces

Connectionism • Connectionist techniques (a.k.a. neural networks) are inspired by the strong interconnectedness of the human brain. • Neural networks are loosely modeled after the biological processes involved in cognition: 1.Information processing involves many simple processing elements called neurons. 2. Signals are transmitted between neurons using connecting links. 3. Each link has a weight that modulates (or controls) the strength of its signal. 4. Each neuron applies an activation function to the input that it receives from other neurons. This function determines its output. • Links with positive weights are called excitatory links. • Links with negative weights are called inhibitory links.

Some definitions • A Neural Network is an interconnected assembly of simple processing elements, units or nodes. The long-term memory of the network is stored in the inter-unit connection strengths, or weights, obtained by a process of adaptation to, or learning from, a set of training patterns. • Biologically inspired learning mechanism

Brain vs. Computer • Performance tends to degrade gracefully under partial damage • In contrast, most programs and engineered systems are brittle: if you remove some arbitrary parts, very likely the whole will cease to function • It performs massively parallel computations extremely efficiently. For example, complex visual perception occurs within less than 100 ms, that is, 10 processing steps!

Dimensions of Neural Networks • Various types of neurons • Various network architectures • Various learning algorithms • Various applications • We’ll focus mainly on supervised learning based networks • Thearchitecture of a neural network is linked with the learning algorithm used to train

ANNs – The basics • ANNs incorporate the two fundamental components of biological neural nets: • Neurons – computational nodes • Synapses – weights or memory storage devices

Neuron vs. Node

Bias b x1 w1 Activation function Local Field v Output y Input signal x2 w2 Summing function xm wm Synaptic weights The Artificial Neuron

w0 x0 = +1 x1 w1 Activation function Local Field v Input signal Output y x2 w2 Summing function Synaptic weights xm wm Bias as an extra input • Bias is an external parameter of the neuron. Can be modeled by adding an extra (fixed-valued) input

Face recognition example 90% accurate learning head pose, and recognizing 1-of-20 faces

The XOR problem A single-layer (linear) neural network cannot solve the XOR problem. Input Output 00  0 01  1 10  1 11  0 To see why this is true, we can try to express the problem as a linear equation: aX + bY = Z a0 + b0 = 0 a0 + b1 = 1 -> b = 1 a1 + b0 = 1 -> a = 1 a1 + b1 = 0 -> a = -b

The XOR problem (cont.) But adding a third bit the problem can be resolved. Input Output 000  0 010  1 100  1 111  0 Once again, we express the problem as a linear equation: aX + bY + cZ = W a0 + b0 + c0 = 0 a0 + b1 + c0 = 1 -> b=1 a1 + b0 + c0 = 1 -> a=1 a1 + b1 + c1 = 0 -> a + b + c = 0 -> 1 + 1 + c = 0 -> c = -2 So the equation: X + Y - 2Z = W will solve the problem.

A Multilayer Network for the XOR function Thresholds

Hidden Units • Hidden units are a layer of nodes that are situated between the input nodes and the output nodes • Hidden units allow a network to learn non-linear functions • The hidden units allow the net to represent combinations of the input features • Given toomany hidden units, however,a net will simply memorize the inputpatterns • Given too few hidden units, the networkmay not be able to represent all of thenecessary generalizations

Backpropagation Networks • Backpropagation networks are among the most popular and widely used neural networks because they are relatively simple and powerful • Backpropagation was one of the first general techniques developed to train multilayer networks, which do not have many of the inherent limitations of the earlier, single-layer neural nets criticized by Minsky and Papert. • Backpropagation networks use a gradient descent method to minimize the total squared error of the output. • A backpropagation net is a multilayer, feedforward network that is trained by backpropagating the errors using the generalized delta rule.

The idea behind (error) backpropagation learning Feedforward training of input patterns • Each input node receives a signal, which is broadcasted to all of the hidden units • Each hidden unit computes its activation, which is broadcasted to all of the output nodes Backpropagation of errors • Each output node compares itsactivation with the desired output • Based on this difference, the error ispropagated back to all previous nodes Adjustment of weights • The weights of all links are computedsimultaneously based on the errors that were propagated backwards Multilayer Perceptron (MLP)

Activation functions • Transforms neuron’s input into output • Features of activation functions: • A squashing effect is required • Prevents accelerating growth of activation levels through the network • Simple and easy to calculate

Backpropagation Learning • We want to train a multi-layer feedforward network by gradient descent to approximate an unknown function, based on some training data consisting of pairs (x,d) • Vector x represents a pattern of input to the network, and the vector d the corresponding target(desired output) • BP is a gradient-descent based scheme … • The overall gradient with respect to the entire training set is just the sum of the gradients for each pattern • We will therefore describe how to compute the gradient for just a single training pattern • We will number the units, and denote the weight from unit jto unit i by xij

BP – Forward Pass at Layer 1

BP – Forward Pass at Layer 2

BP – Forward Pass at Layer 3 The last layer produces the network’s output We can now derive an error (difference between output and the target)

BP – Back-propagation of error – output layer • We have an error with respect to the target (z) • This error signal will be propagated back towards the input layer (layer 1) • Each neuron will forward error information to the neurons feeding it from the previous layer

BP – Back-propagation of error towards the hidden layer

BP – Back-propagation of error towards the input layer

BP – Illustration of Weight Update

Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department