Efficient Training in high-dimensional weight space

Efficient Training in high-dimensional weight space Christoph Bunzmann, Robert Urbanczik Michael Biehl , Michael Biehl Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg Am Hubland, D-97074 Würzburg, Germany http://theorie.physik.uni-wuerzburg.de/~biehl Wiskunde & Informatica Intelligent Systems Rijksuniversiteit Groningen, Postbus 800, NL-9718 DD Groningen, The Netherlands biehl@cs.rug.nl, www.cs.rug.nl/~biehl

Learning from examples A model situation layered neural networks student teacher scenario The dynamics of on-line learning on-line gradient descent delayed learning, plateau states Efficient training of multilayer networks learning by Principal Component Analysis idea, analysis, results Summary, Outlook selected further topics prospective projects Efficient training in high-dimensional weight space

based on example data, e.g. input/output pairs in classification tasks time series prediction regression problems supervised learning Learning from examples choice of adjustable parameters in adaptive information processing systems • parameterizes a hypothesis • e.g. for an unknown classification or regression task • guided by the optimization of an appropriate • objective or cost function • e.g. performance with respect to the example data • results in generalization ability • e.g. the successful classification of novel data

· general results e.g. performance bounds independent of - specific task - statistical properties of data - details of training procedure ... · typical properties of e.g. learning curves model scenarios - network architecture - statistics of data, noise - learning algorithm understanding/prediction of relevant phenomena, algorithm design trade off: general validity / applicability · description of specific e.g. hand written digit recognition applications - given real world problem - particular training scheme - special set of example data ... Theory of learning processes

input data adaptive weights hidden units ( fixed hidden to output weights ) input/output relation sigmoidal hidden activation, e.g. g(x) = erf (a x) A two-layered network: the soft committee machine SCM+ adaptive thresholds: universal approximator

(best) rule parameterization adaptive student teacher ? ? ? ? ? ? ? hidden units interesting effects relevant cases unlearnable rule over-sophisticated student ideal situation: perfectly matching complexity 5 Student teacher scenario

examples for the unknown function or rule input/output pairs: (reliable) training based onthe performance w.r.t. example data, e.g. evaluation after training generalization error expected error for a novel input w.r.t. density of inputs / set of test inputs

consider large systems, in the thermodynamic limitN   (K,M«N) • dimension of input data • number of adjustable parameters N  • perform averages over stochastic training process over randomized example data,quenched disorder (technically) simplest case: reliable teacher outputs, isotropic input density: independent components with zero mean / unit variance • description in terms of macroscopic quantities • e.g. overlap parameters • student/teacher similarity measure • evaluate typical properties • e.g. the learning curve Statistical Physics approach next: eg

(sums of many random numbers) Central Limit Theorem: correlated Gaussians for large N first and second moments: averages over  integrals over K N ½(K2+K) + K M microscopic macroscopic The generalization error

novel, random example: On-line learning step: number of examples  discrete learning time · no explicit storage of all examples ID required practical advantages: typical dynamics of learning can be evaluated on average over a randomized sequence of examples  coupled ODEs for {Rjm,Qij}in time =P/(KN) mathematical ease: · little computational effort per example Dynamics ofon-line gradient descent presentation of single examples examples weights after presentation of

projections recursions, e.g. large N • average over latest example Gaussian • meanrecursions  coupled ODE in continuous time ~examples per weight training time  learning curve

10 fast initial decrease 0.05 eG 0.04 0.03 perfect generalization 0.02 0.01 Biehl, Riegler, Wöhler J.Phys. A (1996) 4769 0 200 300 0 100  = P/(KN) quasi-stationary plateau states with all dominate the learning process unspecialized student weights example: K = M = 2,  = 1.5, Rij(0)  0 learning curve aha!

200 300 0 100  permutation symmetry of branches in the student network example: K = M = 2, Tmn = mn,  = 1, Rij(0)  0, evolution of overlap parameters 1.0 R11, R22 Q11, Q22 0.5 Q21= Q21 R12, R21 0.0

Monte Carlo simulationsself-averaging N   Qjm quantity mean 1/N standard deviation 1/N

assume randomized initialization of weight vectors examples needed for successful learning ! hidden unit specialization requires a priori knowledge (initial macroscopic overlaps) property of the learning scenario necessary phase of training ??? or artifact of the training prescription Plateau length exactly if all self-avg.

S.J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg.) Backpropagation: Theory, Architectures, and Applications

idea: • identification (approximation) of the subspace of • B) actual training within this low-dimensional space example: soft committee teacher (K=M), isotropic input density modified correlation matrix eigenvalues and eigenvectors: 1 eigenvector ( N-K ) e.v. ( K-1 ) e.v. Training by Principal Component Analysis problem: delayed specialization in ( K N ) dimensional weight space

(K-1) smallest eigenvalues, e.v. · determine 1 largest eigenvalue, e.v. B) specialization in the K - dimensional space of · representation of student weights ( K2  K N coefficients) · optimization of w.r.t. E ( # of examples P =  NK  K2 ) empirical estimate from a limited data set note: required memory  N2 does not increase with P

of P= N Kexamples typical properties: given a random set typical overlap with teacher weights measures the success of teacher space identification A) B) given ,determine the optimal eG achievable by a linear combination of formal partition sum replica trick saddle point integration limit    quenched free energy

P =  K N examples c (K=2) = 4.49 c (K=3) = 8.70 large K theory: c (K) ~ 2.94 K (N-indep.!)  c B) given ,determine the optimal eG achievable by a linear combination of  K = 3, Statistical Physics theory and simulations, N = 400 (), N = 1600 (•) A) B)

P =  K N examples c (K=2) = 4.49 c (K=3) = 8.70 large K theory: c (K) ~ 2.94 K (N-indep.!)  c  15 K = 3, theory and Monte Carlo simulations, N = 400 (), N = 1600 (•) A) specialization without a priori knowledge B) ( cindependent of N ) specialized unspecialized Bunzmann, Biehl, Urbanczik Phys. Rev. Lett. 86, 2166 (2001)

potential application: model selection spectrum of matrix CP, teacher with M = 7 hidden units algorithm requires no prior knowledge of M PCA hints at the required model complexity K-1 = 6 smallest eigenvalues

Summary · model situation, supervised learning - the soft committee machine - student teacher scenario - randomized training data · statistical physics inspiredapproach - large systems - thermal (training) and disorder (data) average - typical, macroscopic properties · dynamics of on-line gradient descent - delayed learning due to symmetry breaking necessary specialization processes · efficient training - PCA based learning algorithm reduces dimensionality of the problem - specialization without a priori knowledge

Further topics · perceptron training (single layer) optimal stability classification dynamics of learning · unsupervised learning principal component analysis competitive learning, clustered data · non-trivial statistics of data learning from noisy data time-dependent rules · dynamics of on-line training perceptron, unsupervised learning, two-layered feed-forward networks · specialization processes discontinuous learning curves delayed learning, plateau states · algorithm design variational method, optimal algorithms construction algorithm

· algorithm design variational optimization, e.g. alternative correlation matrix Selected Prospective Projects · application relevant architectures and algorithms Local Linear Model Trees Learning Vector Quantization Support Vector Machines · unsupervised learning density estimation, feature detection, clustering, (Learning) Vector Quantization compression, self-organizing maps · model selection estimate complexity of a rule or mixture density

Efficient Training in high-dimensional weight space

Efficient Training in high-dimensional weight space

Presentation Transcript

High Dimensional Chaos

High Dimensional Chaos

High Dimensional Chaos

High Dimensional Indexing

New Algorithms for Efficient High-Dimensional Nonparametric Classification

Weight Training

High-Dimensional Data

Exploration and Analysis of High-Dimensional Visual Feature Space

High Dimensional Chaos

High AGO trainees made less efficient progress in training.

DEPICTING THREE-DIMENSIONAL SPACE

Efficient Clustering of High-Dimensional Data Sets

High Dimensional Chaos

Latent Space Domain Transfer between High Dimensional Overlapping Distributions

Searching in High-Dimensional Spaces

Weight Training

Efficient Weight Loss Strategies

Visualization of High-Dimensional Space

DEPICTING THREE-DIMENSIONAL SPACE

WEIGHT TRAINING

High Dimensional Data