420 likes | 926 Views
3) Vector Quantization (VQ) and Learning Vector Quantization (LVQ). References M. Biehl, A. Freking, G. Reents Dynamics of on-line competitive learning Europhysics Letters 38 (1997) 73-78 M. Biehl, A. Ghosh, B. Hammer Dynamics and generalization ability of LVQ algorithms
E N D
3) Vector Quantization (VQ) and Learning Vector Quantization (LVQ) References M. Biehl, A. Freking, G. Reents Dynamics of on-line competitive learning Europhysics Letters 38 (1997) 73-78 M. Biehl, A. Ghosh, B. Hammer Dynamics and generalization ability of LVQ algorithms J. Machine Learning Research 8 (2007) 323-360 and references in the latter
example: identification and grouping in clusters of similar data assignment of feature vector to the closest prototypew (similarity or distance measure, e.g. Euclidean distance ) Vector Quantization (VQ) aim: representation of large amounts of data by (few) prototype vectors
• initialize K prototype vectors • present a single example • identify the closest prototype, i.ethe so-calledwinner • move the winner even closertowards the example unsupervised competitive learning intuitively clear, plausible procedure - places prototypes in areas with high density of data - identifies the most relevant combinations of features - (stochastic) on-line gradient descent with respect to the cost function ...
quantization error here: Euclidean distance wjis the winner ! prototypes data aim: faithful representation (in general: ≠ clustering ) Result depends on - the number of prototype vectors - the distance measure / metric used
•initialize prototype vectors for different classes basic, heuristic LVQ scheme: LVQ1 [Kohonen] • present a single example •identify the closest prototype, i.ethe so-calledwinner classification: assignment of a vector to the class of the closest prototypew •move the winner -closertowards the data (same class) -away from the data (different class) piecewise linear decision boundaries Learning Vector Quantization ∙identification of prototype vectors from labelled example data ∙distance based classification (e.g. Euclidean, Manhattan, …) N-dim.feature space aim: generalization ability classificationof novel data after learning from examples
often based on heuristic arguments or cost functions with unclear relation to generalization here: analysis of LVQ algorithms w.r.t. - dynamics of the learning process - performance, i.e. generalization ability - typical properties in a model situation LVQ algorithms ... • frequently applied in a variety • of practical problems • plausible, intuitive, flexible • - fast, easy to implement • limited theoretical understanding of • - dynamics and convergence properties • - achievable generalization ability
orthonormal center vectors: B+, B-∈ ℝN, ( B )2 =1, B+·B- =0 prior weights of classes p+,p- p+ + p- = 1 (p-) ℓ cluster distance ∝ ℓ B- B+ (p+) Model situation: two clusters of N-dimensional data random vectors ∈ ℝN according to mixture of two Gaussians: ℝN indep. components with and variance:
projections on two independent random directions w1,2 μ w ξ = × x 2 2 μ ξ = × y B - - μ w ξ μ × = ξ x = × y B + + 1 1 high-dimensional data (formally: N∞) ξμ∈ℝN , N=200, ℓ=1, p+=0.4, v+=1.44, v-=0.64 (● 240) (○ 160) projections into the plane of center vectors B+,B-
learning rate, step size change of prototype towards or away from the current data update of two prototype vectors w+, w- : competition, direction of update etc. Dynamics of on-line training sequence of new, independent random examples drawn according to example: LVQ1, original formulation [Kohonen] Winner-Takes-All (WTA) algorithm
1. description in terms of a few characteristic quantitities random vector according to : avg. length ( here: ℝ2N ℝ7) length and relative position of prototypes projections into the (B+, B-)-plane 2. average over the current example correlated Gaussian random quantities in the thermodynamic limit N completely specified in terms of first and second moments (w/o indices μ): Mathematical analysis of the learning dynamics algorithm recursions
(mean and variance) • computer simulations (LVQ1) • mean results approach • theoretical prediction • - variance vanishes as N of characteristic quantities R++ (α=10) 3. self-averaging property - depend on the random sequence of example data - their fluctuations vanish with N learning dynamics is completely described in terms of averages averaged recursionsclosed in 1/N
4. continuous learning time stochastic recursions deterministic ODE integration yields evolution of projections 5. learning curve probability for misclassification of a novel example # of examples # of learning steps per degree of freedom generalization error εg(α)after training with α N examples
initialization ws(0)=0 Q-- R++ (α=10) Q++ RSσ Q+- 1/N α theory and simulation(N=100) p+=0.8, v+=4, p+=9, ℓ=2.0, =1.0 averaged over 100 indep. runs self-averaging property (mean and variances) LVQ1: The winner takes it all only the winner is updated according to the class label 1 winner ws
initializationws(0)≈0 Q-- w+ ℓ B- w- Q++ RSσ ℓ B+ Q+- w+ α theory and simulation(N=100) p+=0.8, v+=4, v+=9, ℓ=2.0, =1.0 averaged over100indep. runs LVQ1: The winner takes it all only the winner is updated according to the class label 1 winnerws RS- RS+ Trajectories in the(B+,B- )-plane (•)=20,40,....140 ....... optimal decision boundary ____ asymptotic position
η= 2.0 1.0 0.2 η Learning curve • suboptimal, non-monotonic • behavior for small η εg p+ = 0.2, ℓ=1.0 v+= v- = 1.0 - stationary state: εg (α∞) grows linearly withη -well-defined asymptotics: η 0, α∞, (ηα ) ∞ achievable generalization error: εg εg v+= v- =1.0 v+ =0.25 v-=0.81 .... best linear boundary ― LVQ1 p+ p+
problem: instability of the algorithm due to repulsion of wrong prototypes trivial classification for α∞: εg = min { p+,p- } theory and simulation (N=100) p+=0.8, ℓ=1, v+=v-=1, =0.5 averages over 100 independent runs LVQ 2.1 [Kohonen] here:update correct and wrong winner RS- RS+
εg η= 2.0, 1.0, 0.5 η suggested strategy: selection of data in a window close to the current decision boundary slows down the repulsion, system remains instable Early stopping: end training process at minimal εg (idealized) • pronounced minimum in εg (α) • depends on initialization and • cluster geometry • here: lowest minimum • value reached for η0 εg v+ =0.25 v-=0.81 ―LVQ1 __early stopping p+
Learning curves: εg p+=0.8, ℓ=3.0 v+=4.0, v-=9.0 η= 2.0, 1.0, 0.5 η-independent asymptotic εg Learning From Mistakes (LFM) LVQ2.1 update only if the current classification is wrong crisp limit version of Soft Robust LVQ [Seo and Obermayer, 2003] projected trajetory: RS- ℓ B- ℓ B+ RS+ p+=0.8, ℓ= 1.2, v+=v-=1.0
Comparison: achievable generalization ability v+=v-=1.0 v+=0.25 v-=0.81 equal cluster variances unequal variances εg p+ p+ ..... best linear boundary ―LVQ1 --- LVQ2.1 (early stopping) ·-·LFM ―trivial classification
competitive learning ws winner numerical integration for ws(0)≈0 ( p+=0.2, ℓ=1.0, =1.2 ) R-- system is invariant under exchange of the prototypes weakly repulsive fixed points εg VQ 1.0 R++ R+- LVQ+ LVQ1 R-+ 0 α α 200 100 α 0 300 Vector Quantization class membership is unknown or identical for all data
εg asymptotics (,0, ) p+≈0 p-≈1 p+ - low quantization error - high gen. error εg interpretations: • VQ, unsupervised learning • unlabelled data • LVQ, two prototypes of the • same class, identical labels • LVQ, different classes, but • labels are not used in training
Summary • a model scenario of LVQ training • two clusters, two prototypes • dynamics of online training • comparison of algorithms (within the model): • LVQ 1 : original formulation of LVQ • with close to optimal asymptotic generalization • LVQ 2.1.: intuitive extension creates instability • trivial (stationary) classification • ...+ stopping: potentially good performance • practical difficulties, depends on initialization • LFM : crisp limit of Soft Robust LVQ, stable behavior • far from optimal generalization • VQ : description of in-class competition
Generalized Relevance LVQ [e.g. Hammer & Villmann] • adaptive metrics, e.g. distance measure training neighborhood preserving SOM Neural Gas (distance rank based) Outlook • multi-class, multi-prototype problems • optimized procedures: learning rate schedules • variational approach / Bayes optimal on-line • Self-Organizing Maps (SOM) • applications