280 likes | 382 Views
Tutorial: Mathematical Aspects of Neural Networks. Barbara Hammer, University of Osnabr ück , Thomas Villmann, University of Leipzig, Germany. Relevance of math for NNs. math is used to develop and present algorithms
E N D
Tutorial:Mathematical Aspects of Neural Networks Barbara Hammer, University of Osnabrück, Thomas Villmann, University of Leipzig, Germany Mathematical aspects of neural networks
Relevance of math for NNs • math is used to • develop and present algorithms (linear algebra, analysis, statistics, optimization, control theory, statistical physics, differential geometry, ...) • investigate applicability, evaluate algorithms (statistics, ...) • investigate theoretical properties (Algebra, Borsuk theorem, Christoffel symbols, Differentialtopology, Entropy, Functional analysis, ...) Mathematical aspects of neural networks
Relevance of math in NN-history 1985: Hopfield-networks for TSP 1969: Minsky/Papert SVM Backprop 1958: Rosenblatt 1943: McCulloch/Pitts ... mathematical theory and mathematical questions established for most classical models Mathematical aspects of neural networks
Classical models Recurrent networks Self organizing maps Feed-forward networks Cottrell, Letremy: Analyzing qualitative variables using the Kohonen algorithm Claussen, Villmann: Magnification control in winner relaxing neural gas Archambeau, Lee, Verleysen: On convergence problems of the EM algorithm for finite Gaussian mixtures Schiller, Steil: On the weight dynamics of recurrent learning Jain, Wysotzky: A neural graph algorithm based on local invariants Jain, Wysotzky: An associative memory for the automorphism group of structures Jianyu Li, Siwei Luo, Yingjian Qi: Approximation of functions by adaptively growing radial basis function neural networks Mathematical aspects of neural networks
perceptron MLP RBF-networks SVM Feed-forward networks ... for classification and function approximation 1: Universal approximation ability – Does there exist an appropriate architecture for every function which is to be approximated? 1: architecture 2: optimization 2: Complexity of training – How do good error minimization algorithms look like and what is their complexity? 3: test 3: Learnability – Can generalization to previously unseen examples be guaranteed? x y = y0 Mathematical aspects of neural networks
1: Universal approximation feed-forward networks Perceptron solves only linearly separable problems. [1969] Minsky/Papert MLPs constitute universal approximators [1989] Hornik/Stinchcombe/White, [1993] Leshno et al. RBFs constitute universal approximators [1990] Girosi/Poggio, [1993] Park/Sandberg SVMs constitute universal approximators [2001] Steinwart, [2003] Hammer/Gersmann Mathematical aspects of neural networks
1: Universal approximation number of neurons, rates of convergence, ... ? feed-forward networks [1992] Sontag: n neurons are sufficient to interpolate n points for output dimension 1 [1992] Jones, [1993] Barron: convergence of order 1/n for appropriate functions [1995] Girosi, [1997] Gurvits/Koiran, [1997] Kurkova/Kainen/Krei- novich, [1998] Kurkova/Savicky/Hlavackova, [2002] Lavretsky, ... Jianyu Li, Siwei Luo, Yingjian Qi: determine number of neurons during training ... this session: Mathematical aspects of neural networks
my favorite algo 2: Complexity of training abcdefghijklmnop.... feed-forward networks Perceptron training is polynomial Karmakar algorithm ... investigation of the perceptron algorithm SVM training is polynomial quadratic optimization of the dual problem ... properties of online solutions, decomposition schemes MLP training is NP-hard – [1988] Blum/Rivest, [1990] Judd ... design of alternative learning algorithms, investigation of more realistic scenarios Mathematical aspects of neural networks
2: Complexity of training MLP training is NP-hard [1988] Blum/Rivest: 3-node network, [1990] Judd: networks which encodes SAT the loading problem [1995] Pinter, [1998] Hammer: more than one layer and two neurons, neurons related to training size, varying number of hidden neurons too small [1996] Sima, [1997] Jones, [1997] Vu, [1998] Hammer: sigmoidal settings too specific [1995] Hoeffgen/Simon/VanHorn, [2002] Bartlett/Ben-David, [2002] Sima, [2003] DasGupta/Hammer: approximate optimization is NP-hard in several settings, even for one neuron activation function should be sigmoidal approximate settings [2000] Ben-David/Simon: training a neuron optimum with large margin is polynomial ... any other idea? Mathematical aspects of neural networks
3: Learnability feed-forward networks statistical physics ... measures the mean effects of online algorithms Opper, ... identification in the limit ... a regularity can be learned exactly in the limit Gold, ... ... at least one good learning-algorithm exists, thereby: valid generalization with high probability and guaranteed bounds [1984] Valiant PAC learnability ... the empirical error converges uniformly to the real error with high probability, guaranteed bounds can be derived [1971] Vapnik/Chervonenkis UCEM property Mathematical aspects of neural networks
3: Learnability Statistical learning theory (in three lines ...): PAC↔ finite covering number UCEM ↔ appropriate empirical covering distribution independent learnable ↔ finite VC dimension feed-forward networks [1989] Baum/Haussler: link NNs to VC-theory and estimate VC-dim of perceptron-networks [1994] Maass: lower bound for perceptron networks [1992] Sontag: several ugly examples [1993] Macintyre/Sontag: VC of sigmoidal networks is finite [1995] Karpinski/Macintyre: estimate of VC of sigmoidal networks [1997] Koiran/Sontag: lower bound of VC of sigmoidal networks [1992] Haussler, [...] Bartlett et al., [2002] Schmitt, ... VC dim of SVM scales with the margin ESANN’03 luckiness framework for structural risk minimization Mathematical aspects of neural networks
Feed-forward networks positive results ongoing work formalized Universal approximation ability Complexity of training Learnability Mathematical aspects of neural networks
Recurrent networks ... tasks: NARX/TDNN Elman Hopfield local/global discrete/co 1: Approximation ability and capacity sequence prediction, sequence transduction, sequence generation, associative memory, optimization, binding and grouping, computation • on a finite time horizon • as associative memory • as operators on functions • as computation device • as dynamic systems 2: Complexity of training/design of training algorithms • error optimization – Hebbian learning – energy function – stability constraints • complexity – numeric – dynamic properties – potential 3: Learnability ESANN’02 • ... Mathematical aspects of neural networks
2: Training recurrent networks ... this session: Long term dependencies [1994] Bengio/Simard/Frasconi Schiller/Steil design of learning algorithm mathematic investigation of this algorithm RTRL, BPTT, LSTM, EKF, EM-approaches, recirculation, ... [2000] Atiyah/Parlos: unification and one new approach Mathematical aspects of neural networks
2: Training recurrent networks ... this session: Stability guarantees for global or local stability Schiller/Steil via linear matrix inequalities: [1997,2000] Suykens/Vandewalle, [1999-2002] Steil et al., [2002] Liao/Chen/Sanchez, ... local/global stability and convergence rates for fully connected RNNs : [2001] Wersing/Beyn/Ritter, [2002] Chen/Lu/Amari, [2002] Chen, [2002] Peng/ Qia/Xu, ... Mathematical aspects of neural networks
2: Training recurrent networks ... this session: Training based on stable states [1997] Lee: storing sequences in Hopfield-type networks Schiller/Steil Jain/Wysotzki [2002] Welling/Hinton: mean-field Boltzmann machines graph isomorphisms [2002] Weng/Steil: training CLM [2001] Li/Lee: invariant matching Jain/Wysotzki [2002] Dang/Xu, [2002] Talavan/Yanez: TSP automorphism group of structures [2002] DiBlas/Jagota/Hughuy: graph coloring Mathematical aspects of neural networks
Recurrent networks positive results ongoing work formalized Universal approximation ability Complexity of training Learnability Mathematical aspects of neural networks
Self-organizing maps faithful data representation ICA/PCA VQ SOM NG statistical approaches 1: Training algorithms, convergence – How can reasonable training algorithms be designed? What is the objective, cost function? Convergence of the algorithm to desired states? 1: training 2: Topology preservation – Does the topology of the representation match the underlying data topology? 2: data mining 3: Distribution representation – Is important information preserved? What is the magnification? ESANN Mathematical aspects of neural networks
1: Training algorithms self-organizing maps ... this session: iterative updates e.g. Kushner/Clark or Ljung batch updates e.g. Geoffrey/Hinton Archambeau/Lee/ Verleysen VQ: ok NG: [1993,1994] Martinetz et al.: ok investigation of convergence problems of the EM algorithm SOM: [1992] Erwin et al.: no [1999] Heskes: but almost Convergence of SOM: yes in dim one/two, otherwise difficulties Cottrell, Der, Erwin, Flanagan, Fort, Herrmann, Lin, Pages, Ritter, Sadeghi, Obermayer, ... Mathematical aspects of neural networks
2: Topology preservation self-organizing maps VQ NG SOM [1992] Bauer/Pawelzik: Topographic product [1997] Villmann et al.: Topographic function [1992] Ritter et al., [1993] Heskes, [1994] Der/Herrmann, [1996] Bauer et al., [1999] Der/Herrmann/Villmann: mathematical investigation of mismatching states [1997] Bauer/Villmann, [1999] Ritter: alternative or adaptive lattices Mathematical aspects of neural networks
3: Distribution preservation self-organizing maps ... this session: Claussen/Villmann Thomas will fill this area ... magnification control for winner relaxing neural gas Mathematical aspects of neural networks
4: Recent developments self-organizing maps ... this session: extension of SOM to general domains: Cottrell/Letremy [2001] Kaski/Sinkkonen, [2002] Hammer/Strickert/Villmann: SOM/NG with adaptive metric SOM for contingency analysis [2002] Hagenbuchner/Sperduti/Tsoi, [2002] Voegtlin, [2003] Hammer/Micheli/Sperduti: SOM for sequences and structures [2001] Kohonen: SOM for discrete objects Mathematical aspects of neural networks
Self-organizing maps formalized Convergence Topology preservation Distribution preservation Mathematical aspects of neural networks
Finale Theorem: You need at least two neurons to follow this talk. Proof:By contradiction. Assume you had only one neuron. Then you couldn’t understand the following... If neither Thomas nor Barbara drink beer, the idea for the special session will be good. If Thomas drinks beer and Barbara does not drink beer, the idea for the session will not be good. If Barbara drinks beer and Thomas does not, the idea for the special session will not be good. If both, Barbara and Thomas drink beer, the idea for the special session will be good. ...because it includes XOR, not solvable with one neuron. • you couldn’t follow the last proof in this talk. • since the last proof is deeply connected to the other 66 slides of the talk, you couldn’t follow the talk. Mathematical aspects of neural networks
Finale • need math to • develop and present algorithms • investigate applicability, evaluate algorithms • investigate theoretical properties • but • often limited to simple questions (... how many neurons are sufficient?) • possibly not applicable (... who drank beer?) • does not fit in all details (... we don’t have 66 slides.) Mathematical aspects of neural networks
1: Approximation ability recurrent networks Partially recurrent networks constitute universal approximators [1992] Sontag, [1993] Funahashi/Nakamura, [02] Back/Chen They show rich dynamic behavior [1991] Wang, [2002] Tino et al., [...] Pasemann, Haschke, ... They include automata, Turing machines, non-uniform circuits Omlin, Giles, Carrasco, Forcada, Siegelmann, Sontag, Kilian, ... Hopfield networks can minimize polynomials, number of stable patterns can be estimated, various extensions Mathematical aspects of neural networks
3: Learnability recurrent networks [2003] Hammer/Tino: finite for small weights VC-dim of RNNs [1997] Koiran/Sontag: in the general setting, VC depends on the maximum length of input sequences covering number or entropy number? [1999, 2001] Hammer: one can achieve distribution dependent or posterior bounds (which might be very bad ...) [1993] Nobel/Dembo: finite VC dim and finite mixing coefficients are sufficient non i.i.d. data? [2001] Vidyasagar: ... is working on nice alternatives and generalizations Mathematical aspects of neural networks