Dynamics of Learning VQ and Neural Gas

Dynamics of Learning VQand Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration with Barbara Hammer (Clausthal), Anarta Ghosh (Groningen)

Outline • Vector Quantization (VQ) • Analysis of VQ Dynamics • Learning Vector Quantization (LVQ) • Summary Dagstuhl Seminar, 25.03.2007

Assign data ξμ to nearestprototype vectorwj(by a distance measure, e.g. Euclidean) Find optimal set W for lowest quantization error grouping data into clusters e.g. for classification distance to nearest prototype data Vector Quantization • Objective: • representation of (many) data with (few) prototype vectors Dagstuhl Seminar, 25.03.2007

• present a single example • move the winner even closertowards the example • identify the closest prototype, i.ethe so-called winner • prototypes at areas with high • density of data Example: Winner Takes All (WTA) • initialize K prototype vectors • stochastic gradient descentwith • respect to a cost function Dagstuhl Seminar, 25.03.2007

Winner Takes All “winner takes most”: update according to “rank”e.g. Neural Gas less sensitive to initialization? sensitive to initialization Problems Dagstuhl Seminar, 25.03.2007

(L)VQ algorithms • intuitive • fast, powerful algorithms • flexible • limited theoretical background w.r.t. convergence speed, • robustness to initial conditions, etc. • Analysis of VQ Dynamics • exact mathematical description in very high dimensions • study of typical learning behavior Dagstuhl Seminar, 25.03.2007

(p-) ℓ (p-) B- B+ (p+) (p+) separable in projection to (B+ , B-) plane not separable on other planes Model: two Gaussian clusters of high dimensional data Random vectors ξ∈ ℝN according to classes: σ = {+1,-1} prior prob.: p+, p- p+ + p- = 1 cluster centers: B+, B- ∈ℝN variance: υ+, υ- separation ℓ only separable in 2 dimensions  simple model, but not trivial Dagstuhl Seminar, 25.03.2007

move prototype towards currentdata Online learning sequence of independent random data ws∈ ℝN update of prototype vector learning rate, step size strength, direction of update etc. prototypeclass data class fs[…] describes the algorithm used “winner” Dagstuhl Seminar, 25.03.2007

2. Derive recursion relations of the quantities for new input data random vector ξμ enters as projections 1. Define few characteristic quantities of the system projections tocluster centers length and overlapof prototypes 3. Calculate average recursions Dagstuhl Seminar, 25.03.2007

average over examples • the projections • become correlated Gaussian quantities completely specified in terms of first and second moments: • characteristic quantities • self average w.r.t. random sequence of data (fluctuations vanish) μ : discrete (1,2,…,P) t : continuous • define continuouslearningtime In the thermodynamic limit N∞ ... Dagstuhl Seminar, 25.03.2007

4. Deriveordinary differential equations • 5. Solve for Rsσ(t), Qst(t) • dynamics/asymptotic behavior (t  ∞) • quantization/generalization error • sensitivity to initial conditions, learning rates, structure of data Dagstuhl Seminar, 25.03.2007

R1+ R2- ws winner R2+ R1- Q11 Q22 E(W) Q12 t ResultsVQ2 prototypes characteristic quantities Numerical integration of the ODEs (ws(0)≈0p+=0.6, ℓ=1.0, υ+=1.5, υ-=1.0,=0.01) quantization error Dagstuhl Seminar, 25.03.2007

3 prototypes RS- RS- ℓ p+ > p- B- B+ RS+ RS+ Two prototypes move to the stronger cluster 2 prototypes Projections of prototypes on the B+,B- plane at t=50 Dagstuhl Seminar, 25.03.2007

λi=2; λf=10-2 RS- quantization error E(W) t t=50 t=0 RS+ Neural Gas: a winner take most algorithm3 prototypes λ(t) large initially, decreased over time λ(t)0: identical to WTA update strength decreases exponentially by rank Dagstuhl Seminar, 25.03.2007

t=0 • WTA: • (eventually) reaches minimum E(W) • depends on initialization: possible large learning time ∇HVQ≈0 “plateau” • Neural Gas: • more robust w.r.t. initialization Sensitivity to initialization WTA Neural Gas RS- RS- RS+ RS+ at t=50 at t=50 E(W) t Dagstuhl Seminar, 25.03.2007

Assign data {ξ,σ};ξ ∈ ℝNto nearestprototype vector(distance measure, e.g. Euclidean)       Find optimal set W for lowest generalization error  misclassified by nearest prototype Learning Vector Quantization (LVQ) • Objective: • classification of data using prototype vectors Dagstuhl Seminar, 25.03.2007

c={+1, -1} c={+1,+1,-1} c={+1,-1,-1} RS- RS- RS- ±1 ws winner three prototypes which class to add the 3rd prototype? two prototypes RS+ RS+ RS+ LVQ1 update winner towards/ away from data no cost function related to generalization error Dagstuhl Seminar, 25.03.2007

Generalization error class misclassified data εg t p+=0.6, p-= 0.4 υ+=1.5, υ-=1.0 Dagstuhl Seminar, 25.03.2007

(p-) d ℓ B- B+ (p+>p- ) optimal with K=3 more prototypes  better approximation to optimal decision boundary Optimal decision boundary (hyper)plane where equal variance (υ+=υ-): linear decision boundary unequal varianceυ+>υ- K=2 Dagstuhl Seminar, 25.03.2007

c={+1,-1,-1} εg(t∞) εg p+ • Optimal: K=3 equal to K=2 • LVQ1: K=3 worse Asymptotic εg υ+ >υ- (υ+=0.81, υ-=0.25) c={+1,+1,-1} εg(t∞) p+ • Optimal: K=3 better • LVQ1: K=3 better • more prototypes not always better for LVQ1 • best: more prototypes on the class with the larger variance Dagstuhl Seminar, 25.03.2007

Outlook • study different algorithms e.g. LVQ+/-, LFM, RSLVQ • more complex models • multi-prototype, multi-class problems Summary • dynamics of (Learning) Vector Quantization for high dimensional data • Neural Gas: more robust w.r.t. initialization than WTA • LVQ1: more prototypes not always better Reference Dynamics and Generalization Ability of LVQ Algorithms M. Biehl, A. Ghosh, and B. Hammer Journal of Machine Learning Research (8): 323-360 (2007) http://jmlr.csail.mit.edu/papers/v8/biehl07a.html Dagstuhl Seminar, 25.03.2007

Questions ?

Dagstuhl Seminar, 25.03.2007

Central Limit Theorem • Let x1, x2,…, xN be independent random numbers from arbitrary probability distribution with mean and finite variance • The distribution of the average of xj approaches a normal distribution as N becomes large. Example: non-normal distribution p(xj) N=1 Distribution of average of xj: N=2 N=5 N=50 Dagstuhl Seminar, 25.03.2007

Self Averaging Fluctuations decreases with larger degree of freedom N At N∞, fluctuations vanish (variance becomes zero) Monte Carlo simulations over 100 independent runs Dagstuhl Seminar, 25.03.2007

“LVQ +/-” p+ >> p- : strong repulsion by stronger class to overcome divergence: e.g. early stopping (difficult in practice) t stop at εg(t)=εg,min strongly divergent! εg(t) t update correct and incorrect winners ds = min {dk} with cs = σμ dt = min {dk} with ct≠σμ Dagstuhl Seminar, 25.03.2007

c={+1,+1,-1} υ+ = υ- =1.0 υ+ = 0.81, υ- =0.25 p+ p+ LVQ1 outperforms LVQ+/- with early stopping LVQ+/- with early stopping outperforms LVQ1 in a certain p+ interval Comparison LVQ1 and LVQ +/- LVQ+/-  performance depends on initial conditions Dagstuhl Seminar, 25.03.2007

Dynamics of Learning VQ and Neural Gas