1 / 30

The statistical p hysics of learning - revisited

The statistical p hysics of learning - revisited. Michael Biehl Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence University of Groningen / NL. www.cs.rug.nl/~ biehl. machine learning theory ?. Computational Learning Theory

ainslie
Download Presentation

The statistical p hysics of learning - revisited

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The statisticalphysicsoflearning - revisited Michael Biehl Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence University of Groningen / NL www.cs.rug.nl/~biehl

  2. machine learning theory ? Computational Learning Theory performance bounds & guarantees independent of - specific task - statistical properties of data - details of the training ... Statistical Physics of Learning: typical properties & phenomena for models of specific - systems/network architectures - statistics of data and noise - training algorithms / cost functions ...

  3. A Neural Networks timeline math. analogies with the theory of disordered magnetic materials statistical physics - of network dynamics (neurons) - of learning processes (weights) Widrow&Hoff: Adaline SOM LVQ Minsky & Papert Perceptrons SVM www.ibm.com/developerworks/library/cc-cognitive-neural-networks-deep-dive/

  4. news from the stone age of neural networks Statistical PhysicsofNeural Networks: Twoground-breakingpapers Dynamics, attractorneuralnetworks: John Hopfield. Neural Networks andphysicalsystemswithemergentcollectivecomputationalabilities. PNAS 79(8):2554-2558 (1982) Training, feed-forward networks: Elizabeth Gardner (1957-1988). The spaceofinteractions in neural networks. J. Phys. A 21:257-270 (1988)

  5. overview Fromstochasticoptimization Monte Carlo, Langevindynamics .... tothermal equilibrium: temperature, freeenergy, entropy, ... (.... and back)formal applicationtooptimization Machinelearning: typicalpropertiesof large learningsystems training: stochasticoptimizationof (many) weights guidedby a data-dependentcostfunction models:student/teacherscenarios randomizeddata( frozendisorder ) analysis: orderparameters, disorderaverage, replicatrick annealedapproximation, high temperaturelimit Examples: perceptron classifier, “Ising” perceptron, layered networks Outlook

  6. stochastic optimization objective/cost/energy function , e.g. with many degrees offreedom discrete, e.g. continuous, e.g. Metropolis algorithm Langevin dynamics • continuous temporal change, • „noisy gradient descent“ • suggest a (small) change • , e.g. „singlespinflip“ • for a randomj • compute • with delta-correlated whitenoise • (spatial + temporal independence) • acceptance of the change • - alwaysif • - with probability • if ... controlsnoise level, i.e. random deviation from gradient controls acceptance rate for „uphill“ moves

  7. thermal equilibrium Markov chain continuous dynamics stationary density of configurations: normalization: „Zustandssumme“, partition function • Gibbs-Boltzmann density of states • physics: thermal equilibrium of a physicalsystemat temperature T • optimization: formal equilibrium situation, control parameter T note: additional constraints can be imposed on the weights, for instance: normalization

  8. thermal averages and entropy the role ofZ:thermal averages<...>Tin equilibrium, e.g. ... can be expressed as derivatives of ln Z re-writeas an integral over all possibleenergies: ~ vol. of states with energy E (microcanonical) entropy per degree of freedom: assumeextensive energy, proportional tosystemsizeN:

  9. Darwin-Fowler, aka saddle point integration function with maximum in , consider thermodynamic limit is given by the minimum of the free energy f =e - s(e) / β

  10. free energy and temperature in large systems (thermodynamic limit)lnZisdominated by thestateswithminimal freeenergy T=1/β is the temperature at which <H>T = N eo Tcontrolscompetitionbetween - smallerenergies - larger numberofavailablestates singles out thelowestenergy (groundstate) Metropolis: only down-hill, Langevin: truegradientdescent all statesoccurwithequalprobability, independentofenergy Metropolis: accept all randomchanges Langevin: noisetermsuppressesgradient assumption: ergodicity(all states can be reached in the dynamics)

  11. statistical physics & optimization theory of stochastic optimization bymeans of statisticalphysics • developmentofalgorithms • (e.g. SimulatedAnnealing) • analysisofproblemproperties, even • in absenceofpracticalalgorithms • (numberofgroundstates, minima,...) • applicable in many different • contexts, universality

  12. machine learning special case machine learning: choiceofadaptive e.g. all weights in a neural network, prototype components in LVQ, centers in RBF-network.... cost function: definedw.r.t. sumoverexamples, featurevectors xμ and target labels σ μ(ifsupervised) costsorerrormeasureε(...) per example, e.g. numberofmisclassifications • training: • consider weights as the outcome of a stochastic optimizationprocess • formal (thermal) equilibriumgivenby • < ... >T: thermalaverageovertrainingprocessfor a particulardataset

  13. ? ? ? ? ? ? ? quenched average over training data • note: energy/costfunctionis defined for one particular dataset • typicalpropertiesby additional averageoverrandomizeddata • thesimplestassumption: i.i.d. input vectors • with i.i.d. components • training labels givenbytargetfunction: • for instanceprovidedbya teachernetwork • student / teacherscenarios • controlthecomplexityoftargetruleandlearningsystem • analysetrainingby (stochastic) optimization • typicalpropertieson average over randomizeddataset: derivatives of • quenched freeenergy~yieldaveragesofthe form

  14. average over training data „replicatrick“ n non-interacting „copies“ ofthesystem (replicas) quenched average introduces effective interactions between replicas ... saddle point integration for <Zn>ID, quenched free energy requires analytic continuation to mathematicalsubtleties, replicasymmetry-breaking, orderparameterfunctions, ... Marc Mezard, Giorgio Parisi, Miguel Virasoro Spin Glass Theory and Beyond, World Scientific (1987)

  15. annealed approximation and high-T limit annealed approximation: becomes exact (=) in the high-temperature limit (replicas decouple) average in the exponent for β≈0 • independent single examples: • extensive number of examples: (prop. to number of weights) • saddle point integration: < lnZ >ID / N is dominated by minimum of generalization error plays the role of the energy (i.e. training error?) “ learn almost nothing... ” (high T) “ ...from infinitely many examples ” with finite

  16. example: perceptron training • student: • teacher: • training data: with independent • zero mean, unit variance • Central Limit Theorem (CLT), for large N : • normally distributed with fully specifies

  17. example: perceptron training • or, more intuitively... i.i.d. isotropic data, geometry: order parameterR

  18. example: perceptron training • entropy: • - all weights with order parameter R: hypersphere • with radius ~ , volume ~ R - or: exp. representation of the δ-functions + saddle point integration... (+ irrelevant constants) note: result carries over to more general C (many students and teachers) • high-T free energy • re-scaled number of examples

  19. example: perceptron training • “physical state”: (arg-)minimum of R • typical learning curves R • perfect generalization is achieved R

  20. perceptron learning curve • a very simple model: • - linearly separable rule (teacher) • i.i.d. isotropic random data • high temperature stochastic training • with perfect generalization for • Modifications/extensions: • noisy data, unlearnable rules • low-T results (annealed, replica...) • unsupervised learning • structured input data (clusters) • large margin perceptron and SVM • variational optimization of energy • function (i.e. training algorithm) • binary weights (“Ising Perceptron”) typical learning curve, on average over random linearly separable data sets of a given size

  21. example: Ising perceptron • student: • teacher: • generalization error unchanged: • entropy: probability for alignment/misalignment entropy of mixing N(1+R)/2 aligned and N(1-R)/2misaligned components

  22. example: Ising perceptron • competing minima in • for R co-existing phases of poor/perfect generalization, lower minimum is stable, higher minimum is meta-stable • for R • for only one minimum (R=1) “first order phase transition” to perfect generalization “system freezes” in R

  23. first order phase transition equal f (global) (local) Monte Carlo results (no prior knowledge) finite size effects (local) (global) results carry over (qualitatively) to low (zero) temperature training: e.g. nature of phase transitions etc.

  24. ? ? ? ? ? ? ? soft committee machine teacher adaptivestudent Ninputunits (K)hiddenunits(M) training: minimization of macroscopic properties of the student network: order parameters: model parameters:

  25. soft committee machine exploit thermodynamic limit, CLT for normally distributed with zero means and covariance matrix (+ constant)

  26. soft committee machine K=M=2 symmetry breaking phase transition (2nd order) K=2 K=5 (e.g.) hidden unit specialization K=M > 2 1st order phase transition with metastable states

  27. ? ? ? ? ? ? ? soft committee machine • initial training phase: unspecialized hidden unit weights: • all student units represent “mean teacher” • transition to specialization, makes perfect agreement possible adaptivestudent teacher

  28. ? ? ? ? ? ? ? soft committee machine • initial training phase: unspecialized hidden unit weights • all student units represent “mean teacher” • transition to specialization, makes perfect agreement possible • equivalent permutations: adaptivestudent teacher • hidden unit permutation symmetry has to be broken • successful training requires a critical number of examples

  29. many hidden units large hidden layer: unspecialized state remains meta-stable up to perfect generalization without prior knowledge impossible with order O(NK) examples ?

  30. what’s next ? network architecture and design • activation functions (ReLu etc.) • deep networks • tree-like architecturesas models • of convolution & pooling dynamics of network training • online training by stochastic g.d. • math. description in terms of ODE • learning rates, momentum etc. other topics • regularization, e.g. drop-out, • weight decay etc. • concept drift: time-dependent • statistics of data and target ... a lot more & new ideas to come

More Related