The statistical p hysics of learning - revisited

The statisticalphysicsoflearning - revisited Michael Biehl Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence University of Groningen / NL www.cs.rug.nl/~biehl

machine learning theory ? Computational Learning Theory performance bounds & guarantees independent of - specific task - statistical properties of data - details of the training ... Statistical Physics of Learning: typical properties & phenomena for models of specific - systems/network architectures - statistics of data and noise - training algorithms / cost functions ...

A Neural Networks timeline math. analogies with the theory of disordered magnetic materials statistical physics - of network dynamics (neurons) - of learning processes (weights) Widrow&Hoff: Adaline SOM LVQ Minsky & Papert Perceptrons SVM www.ibm.com/developerworks/library/cc-cognitive-neural-networks-deep-dive/

news from the stone age of neural networks Statistical PhysicsofNeural Networks: Twoground-breakingpapers Dynamics, attractorneuralnetworks: John Hopfield. Neural Networks andphysicalsystemswithemergentcollectivecomputationalabilities. PNAS 79(8):2554-2558 (1982) Training, feed-forward networks: Elizabeth Gardner (1957-1988). The spaceofinteractions in neural networks. J. Phys. A 21:257-270 (1988)

overview Fromstochasticoptimization Monte Carlo, Langevindynamics .... tothermal equilibrium: temperature, freeenergy, entropy, ... (.... and back)formal applicationtooptimization Machinelearning: typicalpropertiesof large learningsystems training: stochasticoptimizationof (many) weights guidedby a data-dependentcostfunction models:student/teacherscenarios randomizeddata( frozendisorder ) analysis: orderparameters, disorderaverage, replicatrick annealedapproximation, high temperaturelimit Examples: perceptron classifier, “Ising” perceptron, layered networks Outlook

stochastic optimization objective/cost/energy function , e.g. with many degrees offreedom discrete, e.g. continuous, e.g. Metropolis algorithm Langevin dynamics • continuous temporal change, • „noisy gradient descent“ • suggest a (small) change • , e.g. „singlespinflip“ • for a randomj • compute • with delta-correlated whitenoise • (spatial + temporal independence) • acceptance of the change • - alwaysif • - with probability • if ... controlsnoise level, i.e. random deviation from gradient controls acceptance rate for „uphill“ moves

thermal equilibrium Markov chain continuous dynamics stationary density of configurations: normalization: „Zustandssumme“, partition function • Gibbs-Boltzmann density of states • physics: thermal equilibrium of a physicalsystemat temperature T • optimization: formal equilibrium situation, control parameter T note: additional constraints can be imposed on the weights, for instance: normalization

thermal averages and entropy the role ofZ:thermal averages<...>Tin equilibrium, e.g. ... can be expressed as derivatives of ln Z re-writeas an integral over all possibleenergies: ~ vol. of states with energy E (microcanonical) entropy per degree of freedom: assumeextensive energy, proportional tosystemsizeN:

Darwin-Fowler, aka saddle point integration function with maximum in , consider thermodynamic limit is given by the minimum of the free energy f =e - s(e) / β

free energy and temperature in large systems (thermodynamic limit)lnZisdominated by thestateswithminimal freeenergy T=1/β is the temperature at which <H>T = N eo Tcontrolscompetitionbetween - smallerenergies - larger numberofavailablestates singles out thelowestenergy (groundstate) Metropolis: only down-hill, Langevin: truegradientdescent all statesoccurwithequalprobability, independentofenergy Metropolis: accept all randomchanges Langevin: noisetermsuppressesgradient assumption: ergodicity(all states can be reached in the dynamics)

statistical physics & optimization theory of stochastic optimization bymeans of statisticalphysics • developmentofalgorithms • (e.g. SimulatedAnnealing) • analysisofproblemproperties, even • in absenceofpracticalalgorithms • (numberofgroundstates, minima,...) • applicable in many different • contexts, universality

machine learning special case machine learning: choiceofadaptive e.g. all weights in a neural network, prototype components in LVQ, centers in RBF-network.... cost function: definedw.r.t. sumoverexamples, featurevectors xμ and target labels σ μ(ifsupervised) costsorerrormeasureε(...) per example, e.g. numberofmisclassifications • training: • consider weights as the outcome of a stochastic optimizationprocess • formal (thermal) equilibriumgivenby • < ... >T: thermalaverageovertrainingprocessfor a particulardataset

? ? ? ? ? ? ? quenched average over training data • note: energy/costfunctionis defined for one particular dataset • typicalpropertiesby additional averageoverrandomizeddata • thesimplestassumption: i.i.d. input vectors • with i.i.d. components • training labels givenbytargetfunction: • for instanceprovidedbya teachernetwork • student / teacherscenarios • controlthecomplexityoftargetruleandlearningsystem • analysetrainingby (stochastic) optimization • typicalpropertieson average over randomizeddataset: derivatives of • quenched freeenergy~yieldaveragesofthe form

average over training data „replicatrick“ n non-interacting „copies“ ofthesystem (replicas) quenched average introduces effective interactions between replicas ... saddle point integration for <Zn>ID, quenched free energy requires analytic continuation to mathematicalsubtleties, replicasymmetry-breaking, orderparameterfunctions, ... Marc Mezard, Giorgio Parisi, Miguel Virasoro Spin Glass Theory and Beyond, World Scientific (1987)

annealed approximation and high-T limit annealed approximation: becomes exact (=) in the high-temperature limit (replicas decouple) average in the exponent for β≈0 • independent single examples: • extensive number of examples: (prop. to number of weights) • saddle point integration: < lnZ >ID / N is dominated by minimum of generalization error plays the role of the energy (i.e. training error?) “ learn almost nothing... ” (high T) “ ...from infinitely many examples ” with finite

example: perceptron training • student: • teacher: • training data: with independent • zero mean, unit variance • Central Limit Theorem (CLT), for large N : • normally distributed with fully specifies

example: perceptron training • or, more intuitively... i.i.d. isotropic data, geometry: order parameterR

example: perceptron training • entropy: • - all weights with order parameter R: hypersphere • with radius ~ , volume ~ R - or: exp. representation of the δ-functions + saddle point integration... (+ irrelevant constants) note: result carries over to more general C (many students and teachers) • high-T free energy • re-scaled number of examples

example: perceptron training • “physical state”: (arg-)minimum of R • typical learning curves R • perfect generalization is achieved R

perceptron learning curve • a very simple model: • - linearly separable rule (teacher) • i.i.d. isotropic random data • high temperature stochastic training • with perfect generalization for • Modifications/extensions: • noisy data, unlearnable rules • low-T results (annealed, replica...) • unsupervised learning • structured input data (clusters) • large margin perceptron and SVM • variational optimization of energy • function (i.e. training algorithm) • binary weights (“Ising Perceptron”) typical learning curve, on average over random linearly separable data sets of a given size

example: Ising perceptron • student: • teacher: • generalization error unchanged: • entropy: probability for alignment/misalignment entropy of mixing N(1+R)/2 aligned and N(1-R)/2misaligned components

example: Ising perceptron • competing minima in • for R co-existing phases of poor/perfect generalization, lower minimum is stable, higher minimum is meta-stable • for R • for only one minimum (R=1) “first order phase transition” to perfect generalization “system freezes” in R

first order phase transition equal f (global) (local) Monte Carlo results (no prior knowledge) finite size effects (local) (global) results carry over (qualitatively) to low (zero) temperature training: e.g. nature of phase transitions etc.

? ? ? ? ? ? ? soft committee machine teacher adaptivestudent Ninputunits (K)hiddenunits(M) training: minimization of macroscopic properties of the student network: order parameters: model parameters:

soft committee machine exploit thermodynamic limit, CLT for normally distributed with zero means and covariance matrix (+ constant)

soft committee machine K=M=2 symmetry breaking phase transition (2nd order) K=2 K=5 (e.g.) hidden unit specialization K=M > 2 1st order phase transition with metastable states

? ? ? ? ? ? ? soft committee machine • initial training phase: unspecialized hidden unit weights: • all student units represent “mean teacher” • transition to specialization, makes perfect agreement possible adaptivestudent teacher

? ? ? ? ? ? ? soft committee machine • initial training phase: unspecialized hidden unit weights • all student units represent “mean teacher” • transition to specialization, makes perfect agreement possible • equivalent permutations: adaptivestudent teacher • hidden unit permutation symmetry has to be broken • successful training requires a critical number of examples

many hidden units large hidden layer: unspecialized state remains meta-stable up to perfect generalization without prior knowledge impossible with order O(NK) examples ?

what’s next ? network architecture and design • activation functions (ReLu etc.) • deep networks • tree-like architecturesas models • of convolution & pooling dynamics of network training • online training by stochastic g.d. • math. description in terms of ODE • learning rates, momentum etc. other topics • regularization, e.g. drop-out, • weight decay etc. • concept drift: time-dependent • statistics of data and target ... a lot more & new ideas to come

The statistical p hysics of learning - revisited