Clustering Change Detection Using Normalized Maximum Likelihood Coding

Clustering Change Detection Using Normalized Maximum Likelihood Coding WITMSE 2012, Amsterdam, Netherland Presented at KDD 2012 on Aug.13.

Contents • Problem Setting • Significance • Proposed Algorithm 　：　Sequential Dynamic Model Selection with NML(normalized maximum likelihood) coding • How to compute the NML coding for Gaussian mixtures • Experimental Results • Marketing Applications • Conclusion

Problem Setting (1/2) Clustering change detection ---Tracking changes of clustering structures in a sequential setting to detect novelty in data Ex. Market analysis The structure of customer groups changes over time Detect changes of the number of clusters as well as their assignment Change Time Change

Problem Setting (2/2) Examples of clustering structure changes A A A A B B B B C F D D C D C F D C F F E E E E Existing customers change their patterns New customer s emerge to form a new group There exist various types of clustering structures α β

Related works Clustering change detection issue • Evolutionally clustering [Chakrabrti et. al., 2006] • Hypothesis testing approach[Song and Wang, 2005] • Kalman filter approach [Krempl et. al., 2011] • Graph Scope [Sun et. al., 2007] • Variational Bayes approach[Sato, 2001]

Significance • A novel clustering change detection algorithm Key idea: 　・Sequential dynamic model selection (sequential DMS）　・NML(normalized maximum likelihood) code-length as criteria ……..First formulae for NML for Gaussian mixture models • Empirical demonstration of its superiority over existing methodsShown using artificial data sets • Demonstration of its validity in market analysisShown using real beer consumption data sets

Sequential Dynamic Model Selection Algorithm

Proposed Alg. – background of DMS – Dynamic Model Selection ( DMS ) [Yamanishi and Maruyama, 2007] • Batch DMS criterion : Total code-length Code-length of model seq. Code-length of data seq. Minimum w.r.t. ~Extension of MDL (Minimum Description Length) principle[Rissanen, 1978] into model “sequence” selection

Proposed Alg. – Sequential DMS – Sequential dynamic model selection (SDMS) Alg. • At each time t, given , sequentially select for clustering s.t. Code-length for data clustering ～NML (normalized maximum likelihoood)coding Code-length for transition of clustering structure Minimum w.r.t. Kt, Zt Sequential variant of DMS criterion [Yamanishi and Maruyama, 2007]

Proposed Alg. – model transition – Consider three patterns of clustering changes Case 2 • Run EM alg. with initial values below: • Case 1 # of clusters does not changeInitial parametervalues remain the same • Case 2# of clusters decreases (e.g. , merging)Assign data in a certain cluster to other ones randomly • Case 3# of clusters increases (e.g., splitting) Set data to a new cluster randomly Case 3

Proposed Alg. – code-length for transition – • Model transition probability distribution Suppose K transits to neighbors only • Employ Krichevsky-Trofimov (KT) estimate [Krichevsky and Trofimov, 1981] Code-length of the model transition

How to compute NML code-lengthfor Gaussian mixtures

Criteria – NML code-length – • Model (Gaussian mixture model) : • NML (normalized maximum likelihood) code-length : Shortest code-length in the sense of minimax criterion [Shatarkov 1987] Normalization term

For Continuous Data • Normalization term • In case of , the data ranges over all domains • Problem: • NML for Gaussian distribution • Normalization term diverges • NML for mixture distribution • Normalization term is computationally intractable • This comes from combinational difficulties

For Continuous Data (Example) • For the one-dimension Gaussiandistribution (σ2 is given) • Normalization term

Approximate computation (1/2) Use sufficient statistics g1: Gaussian distribution g2: Wishart distribution

Criteria – NML for GMM – Efficiently computing an approximate variant of the NML code-length for a GMM [Hirai and Yamanishi, 2011] • Restrict the range of data so that the MLE lies in a bounded range specified by a parameter The normalization term does not diverge But still highly depends on the parameters :

ＮＭＬ • The normalization term is calculated as follows : : number of data, : dim. of data where,

Criteria – RNML code-length – Modify NML to develop the re-normalized maximum likelihood coding (RNML) [Rissanen, Roos, Myllymaki 2010] [Hirai and Yamanishi, 2012] • Re-normalize around the MLE of parameter by restricting the range of data Less dependent on hyper-parameter

Criteria – RNML code-length –

RNML code-length • Theorem[Hirai and Yamanishi 2012]ＲＮＭＬ code-length for GMM is calculated as follows : Definition Problem Computing , costs . 1

Criteria – efficient computing of RNML – • Straightforward computation of RNML requires time⇒ But we can compute it efficiently • Theorem[Kontkanen and Myllymaki, 07] ) １

Criteria – efficient computing of RNML – • Straightforward computation of RNML requires time⇒ But we can compute it efficiently • Theorem [Hirai and Yamanishi, 2012] The normalization term satisfies recurrsive formula ２２２ Can compute the normalization term in for “mixture” models

Experimental Results–Artificial Data –– Market Analysis –

Experimental Results – data generation – • Generate artificial data set according to GMM with

Experimental Results – comparison criteria – Employ three comparison metrics • AR (accuracy rate) :Average rate of correctly estimating the true number of clusters over all time • IR (identification rate) :Probability of correctly identifying change-points and change themselves • FAR (false alarm rate) :Rate of the number of false alarms over all detected change-points

Experimental Results – artificial data – Our alg. with NML was able to detect true change-points and identify the true # of clusters with higher probability than AIC and BIC AIC:Akaike’s information criteria [Akaike1974] BIC:Bayesian information criteria [Shwarz 1978] Average Number of clusters Over Time

Comparison w. r. t. KL-divergence • Evaluated change detection accuracies by varying the Kullback-Leibler divergence (KLD) between the distributions before and after the change points The larger the KLD between GMMs before and after the change-point was, the more accurately it was detected in terms of IR (identification rate).

Experimental Results – vs SW Alg. – The sequential DMS with RNML significantly outperformed SW-alg. • SW algorithm : Hypothesis testing whether clusters are identical or not, then make splitting, merging, etc. [Song and Wang, 2005] Data : size/time = 512

Experimental Results – market analysis – Clustering customers to detect their structure changes 14 kinds of beer 3185 users 78 days Our alg. detected clustering changes that corresponded to the year’s ending demand Data set provided by MACROMILL, Inc.

Many of customers changed their patterns to purchase Beer-A and Third-Beer at the year’s end • The cluster change in change-point : 1/1,2

Conclusion • Proposed the sequential DMS algorithm to address clustering change detection issue. • Key ideas : • Sequential dynamic model selection based on MDL principle • The use of the NML code-length as criteria and its efficient computation • It is able to detect cluster changes significantly more accurately than AIC/BIC based methods and the existing statistical-test based method in artificial data • Tracking changes of group structures leads to the understanding changes of market structures

Why is NML ? The shortest code-length in the sense of Shtarkov’s minimax criterion [Shtarkov, 1987] For a given class : Maximum Likelihood Estimator Minimum is attained by Ｑ＝NMLdistribution

Restrict the range of data Restrict the range of data for Shtarkov’s minimax criterion [Shtarkov, 1987] For a given class : Restrict the range of data. We change the Shtarkov’s minimax criterion itself

Comparison with non-parametric Bayes • Sequential Dynamic Model Selection works better than non-parametric Bayes (Infinite HMM, etc.) [Comparison of Dynamic Model Selection with Infinite HMM for Statistical Model Change Detection Sakurai and Yamanishi, to appear in ITW 2012]

Clustering Change Detection Using Normalized Maximum Likelihood Coding

Clustering Change Detection Using Normalized Maximum Likelihood Coding

Presentation Transcript

Maximum likelihood estimation

Maximum Likelihood Detection

Maximum likelihood (ML)

Maximum Likelihood

Maximum Likelihood

4. Maximum Likelihood

Maximum Likelihood

Maximum Likelihood Estimation

Phylogenetic Estimation using Maximum Likelihood

Maximum Likelihood

Maximum likelihood

Maximum likelihood decoding

Maximum likelihood (cont.)

Maximum Likelihood Estimation

Parallel Maximum Likelihood Fitting Using MPI

Maximum likelihood (cont.)

Maximum Likelihood

Maximum Likelihood

Maximum Likelihood

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Maximum Likelihood Detection