230 likes | 309 Views
Dynamical analysis of LVQ type learning rules. Barbara Hammer. Michael Biehl, Anarta Ghosh. Clausthal University of Technology Institute of Computing Science. Rijksuniversiteit Groningen Mathematics and Computing Science http://www.cs.rug.nl/~biehl m.biehl@rug.nl.
E N D
Dynamical analysis of LVQ type learning rules Barbara Hammer Michael Biehl, Anarta Ghosh Clausthal University of Technology Institute of Computing Science Rijksuniversiteit Groningen Mathematics and Computing Science http://www.cs.rug.nl/~biehl m.biehl@rug.nl
often: heuristically motivated variations of competitive learning • initialize prototype vectors for different classes example: basic LVQ scheme[Kohonen]: “LVQ 1” • present a single example • identify the closest prototype, i.ethe so-calledwinner classification: assignment of a vector to the class of the closest prototype w • move the winner -closertowards the data (same class) -away from the data (different class) Learning Vector Quantization (LVQ) - identification of prototype vectors from labelled example data - parameterization of distance based classification schemes aim: generalization ability classificationof novel data after learning from examples
often based on heuristic arguments or cost functions with unclear relation to generalization here: analysis of LVQ algorithms w.r.t. - dynamics of the learning process - performance, i.e. generalization ability - typical properties in a model situation LVQ algorithms ... • frequently applied in a variety • of practical problems • plausible, intuitive, flexible • - fast, easy to implement • limited theoretical understanding of • - dynamics and convergence properties • - achievable generalization ability
orthonormal center vectors: B+, B-∈ ℝN, ( B )2 =1, B+·B- =0 prior weights of classes p+,p- p+ + p- = 1 (p-) ℓ separation ∝ ℓ B- B+ (p+) Model situation: two clusters of N-dimensional data random vectors ∈ ℝN according to mixture of two Gaussians: ℝN independent components: with variance:
learning rate, step size change of prototype towards or away from the current data update of two prototype vectors w+, w- : competition, direction of update etc. Dynamics of on-line training sequence of new, independent random examples drawn according to example: LVQ1, original formulation [Kohonen] Winner-Takes-All (WTA) algorithm
1. description in terms of a few characteristic quantitities ( here: ℝ2N ℝ7) length and relative position of prototypes projections into the (B+, B-)-plane Mathematical analysis of the learning dynamics recursions random vector ξμ enters only through its length and the projections
characteristic quantities - depend on the random sequence of example data - their variance vanishes with N (here: ∝ N-1) random vector according to : avg. length averaged recursionsclosed in 2. average over the current example correlated Gaussian random quantities in the thermodynamic limit N completely specified in terms of first and second moments 3. self-averaging property learning dynamics is completely described in terms of averages
# of examples # of learning steps per degree of freedom stochastic recursions deterministic ODE 4. continuous learning time integration yields evolution of projections 5. learning curve probability for misclassification of a novel example generalization error εg(α)after training with α N examples
initializationws(0)≈0 Q-- w+ ℓ B- w- Q++ RSσ ℓ B+ Q+- w+ α theory and simulation(N=100) p+=0.8, v+=4, v+=9, ℓ=2.0, =1.0 averaged over100indep. runs LVQ1: The winner takes it all only the winner is updated according to the class label 1 winnerws RS- RS+ Trajectories in the(B+,B- )-plane (•)=20,40,....140 ....... optimal decision boundary ____ asymptotic position
η= 2.0 1.0 0.2 η Learning curve • suboptimal, non-monotonic • behavior for small η εg p+ = 0.2, ℓ=1.0 v+= v- = 1.0 - stationary state: εg (α∞) grows linearly withη -well-defined asymptotics: η 0, α∞, (ηα ) ∞ achievable generalization error: εg εg v+= v- =1.0 v+ =0.25 v-=0.81 .... best linear boundary ― LVQ1 p+ p+
problem: instability of the algorithm due to repulsion of wrong prototypes trivial classification für α∞: εg = min { p+,p- } theory and simulation (N=100) p+=0.8, ℓ=1, v+=v-=1, =0.5 averages over 100 independent runs “LVQ 2.1“ [Kohonen] here:update correct and wrong winner RS- RS+
εg η= 2.0, 1.0, 0.5 η suggested strategy: selection of data in a window close to the current decision boundary slows down the repulsion, system remains instable Early stopping: end training process at minimal εg (idealized) • pronounced minimum in εg (α) • depends on initialization and • cluster geometry • lowest minimum • assumed for η0 εg v+ =0.25 v-=0.81 ―LVQ1 __early stopping p+
Learning curves: εg p+=0.8, ℓ=3.0 v+=4.0, v-=9.0 η= 2.0, 1.0, 0.5 η-independent asymptotic εg “Learning From Mistakes (LFM)” LVQ2.1 update only if the current classification is wrong crisp limit of Soft Robust LVQ [Seo and Obermayer, 2003] projected trajetory: RS- ℓ B- ℓ B+ RS+ p+=0.8, ℓ= 1.2, v+=v=1.0
Comparison: achievable generalization ability v+=v-=1.0 v+=0.25 v-=0.81 equal cluster variances unequal variances εg p+ p+ ..... best linear boundary ―LVQ1 --- LVQ2.1 (early stopping) ·-·LFM
Summary • prototype-based learning • Vector Quantization and Learning Vector Quantization • a model scenario: two clusters, two prototypes • dynamics of online training • comparison of algorithms: • LVQ 1 : close to optimal asymptotic generalization • LVQ 2.1. : instability, trivial (stationary) classification • + stopping : potentially very good performance • LFM : far from optimal generalization behavior • work in progress, outlook • multi-class, multi-prototype problems • optimized procedures: learning rate schedules • variational approach / Bayes optimal on-line
Generalized Relevance LVQ [e.g. Hammer & Villmann] • adaptive metrics, e.g. distance measure training neighborhood preserving SOM Neural Gas (distance based) Perspectives • Self-Organizing Maps (SOM) • (many) N-dim. prototypes form a (low) d-dimensional grid • representation of data in a topology preserving map • applications
random vector according to : avg. length correlated Gaussian random quantities in the thermodynamic limit N completely specified in terms of first and second moments (w/o indices μ): averaged recursionsclosed in 2. average over the current example
N investigation and comparison of given algorithms • - repulsive/attractive fixed points of the dynamics • - asymptotic behavior for • dependence on learning rate, separation, initialization • ... optimization and development of new prescriptions • - time-dependent learning rate η(α) • variational optimization w.r.t. fs[...] • - ... maximize
initialization ws(0)=0 Q-- R++ (α=10) Q++ RSσ Q+- 1/N α theory and simulation(N=100) p+=0.8, v+=4, p+=9, ℓ=2.0, =1.0 averaged over 100 indep. runs self-averaging property (mean and variances) LVQ1: The winner takes it all only the winner is updated according to the class label 1 winner ws
projections on two independent random directions w1,2 μ w ξ = × x 2 2 μ ξ = × y B - - μ w ξ μ × = ξ x = × y B + + 1 1 high-dimensional data (formally: N∞) ξμ∈ℝN , N=200, ℓ=1, p+=0.4, v+=0.44, v-=0.44 (● 240) (○ 160) projections into the plane of center vectors B+,B-