Thomas Trappenberg

Learning: A modern review of anticipatory systems in brains and machines Thomas Trappenberg

Outline

Universal Learning machines • 1961: Outline of a theory of Thought-Processes • and Thinking Machines • Neuronic & Mnemonic Equation • Reverberation • Oscillations • Reward learning Eduardo Renato Caianiello (1921-1993) But: NOT STOCHASTIC (only small noise in weights) Stochastic networks: The Boltzmann machine Hinton & Sejnowski 1983

MultiLayerPerceptron (MLP) Universal approximator (learner) but Overfitting Meaningful input Unstructured learning Only deterministic (just use chain rule)

Linear large margin classifiers Support Vector Machines (SVM) MLP: Minimize training error (here threshold Perceptron) VM: Minimize generalization error (empirical risk)

Linear in parameter learning Linear hypothesis Non-Linear hypothesis Linear in parameters SVM in dual form } Kernel function Liquid/echo state machines Extreme learning machines Thanks to Doug Tweet (UoT) for pointing out LIP

Goal of learning: Make predictions !!!!!!!!!!! learning vs memory Fundamental stochastisity Irreducible indeterminacy Epistemological limitations Sources of fluctuations  Probabilistic framework

Plant equation for robot Distance traveled when both motors are running with Power 50 Goal of learning:

Hypothesis: The hard problem: How to come up with a useful hypothesis Learning: Choose parameters that make training data most likely Assume independence of training examples Maximum Likelihood Estimation and consider this as function of parameters (log likelihood)

How about building more elaborate multivariate models? and arguing with Causal (graphical) models (Judea Pearl) 10 parameters 31 Parameters of CPT usually learned from data!

Hidden Markov Model (HMM) for localization • Integrating sensor information becomes trivial • Breakdown of point estimates in global localization (particle filters)

Synaptic Plasticity Gradient descent rule for LMS loss function: … with linear hypothesis: Perceptron learning rule Hebb rule

The organization of behavior (1949): Donald O. Hebb (1904-1985) see also Sigmund Freud, Law of association by simultaneity, 1888

Classical LTP/LTD

R. Enoki, Y. Hu, D. Hamilton, and A. Fine, Neuron 62 (2009)

Data from G.Q. Bi and M.M. Poo, J Neurosci 18 (1998) D. Standage, S. Jalil and T. Trappenberg, Biological Cybernetics 96 (2007)

Population argument of `weight dependence’ Is Bi and Poo’s weight dependent STDP data an experimental artifact? - Three sets of assumptions (B, C, D) - Their data may reflect population effects … with Dominic Standage (Queen’s University)

2. Sparse Unsupervised Learning

Horace Barlow Possible mechanisms underlying the transformations of sensory of sensory messages (1961) ``… reduction of redundancy is an important principle guiding the organization of sensory messages …” Sparsness & Overcompleteness The Ratio Club

PCA minimizing reconstruction error and sparsity

Deep believe networks: The stacked Restricted Boltzmann Machine Geoffrey E. Hinton

sparse convolutional RBM … with Paul Hollensen & Warren Connors Sonar images Truncated Cone Side Scan Sonar Synthetic Aperture Sonar scRBM reconstruction scRBM/SVM mine sensitivity: .983±.024, specificity: .954±.012 SIFT/SVM mine sensitivity: .970±.025, specificity: .944±.008

… with Paul Hollensen sparse and topographic RBM (rtRBM)

Map Initialized Perceptron (MIP) …with Pitoyo Hartono

Free-Energy-Based Supervised Learning: TD learning generalized to Boltzmann machines (Sallans & Hinton 2004) Paul Hollensen: Sparse, topographic RBM successfully learns to drive the e-puck and avoid obstacles, given training data (proximity sensors, motor speeds)

RBM features

3. Reinforcement learning

2. Reinforcement learning -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 -0.1 From Russel and Norvik

Markov Decision Process (MDP) If we know all these factors the problem is said to be fully observable And we can just sit down and contemplate about the problem before moving

Two important quantities policy: value function: Goal: maximize total expected payoff Optimal Control

Calculate value function (dynamic programming) Deterministic policies to simplify notation Bellman Equation for policy p Solution: Analytic or Incremental Richard Bellman 1920-1984

Policy Iteration: Chose one policy  calculate corresponding value function  chose better policy based on this value function Value Iteration: For each state evaluate all possible actions Bellman Equation for optimal policy

Solution: But: Environment not known a priori Observability of states Curse of Dimensionality  Online (TD)  POMDP  Model-based RL

What if the environment is not completely known ? Online value function estimation (TD learning) If the environment is not known, use Monte Carlo method with bootstrapping Expected payoff before taking step Expected reward after taking step = actual reward plus discounted expected payoff of next step =Temporal Difference This leads to the exploration-exploitation dilemma

Online optimal control: Exploitation versus Exploration On-policy TD learning: Sarsa Off-policy TD learning: Q-learning

Model-based RL: TD(l) Instead of tabular methods as mainly discussed before, use function approximator with parameters q and gradient descent with exponential eligibility trace e which weights updates with l for each step (Satton 1988): Free Energy-based reinforcement learning (Sallans & Hinton 2004) … Paul Hollensen

Basal Ganglia … work with Patrick Connor

Our questions • How do humans learn values that guide behaviour? (human behaviour) • How is this implemented in the brain? (anatomy and physiology) • How can we apply this knowledge? (medical interventions and robotics)

Classical Conditioning Ivan Pavlov 1849-1936 Nobel Prize 1904 Rescorla-Wagner Model (1972)

Reward Signals in the Brain Wolfram Schultz Stimulus ANo reward Stimulus B Stimulus A Reward

Disorders with effects On dopamine system: Parkinson’s disease Tourett’s syndrome ADHD Drug addiction Schizophrenia Maia & Frank 2011

Adding Biological Qualities to the Model Input Rescorla-Wagner Model Rescorla and Wagner, 1972 Striatum Dopamine and Reward Prediction Error Schultz, 1998

Thomas Trappenberg