Model-based Clustering in R

Model-based Clustering in R Tianxi Dong

Theoretical model

Heuristic methods • How many clusters we need? • How to compare the performance between methods? • How to deal with outliers in heuristic methods? Solution???

Model-based Method • Assume that the data come from a mixture of different probability models; • Assign each of the N items to the distribution it most likely belongs to; • Clustering performance is evaluated. Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf

Model-based Clustering • We define the density of a mixture of g distributions as the weighted average Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf

Model-based Clustering • Find the values of the parameters by maximizing the likelihood (usually the log of the likelihood) of the observations max log f(x1… xN) over m1… mG, 1… G and p1… pG • Where N is the number of observations • This turns out to be a nonlinear mess and is greatly aided by the “Expectation Maximization Algorithm” Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf

Covariance Structure • This covariance structure allows for a variety of constraints. VVV G=5 P=8 # of Parameters=? • The best covariance structure is decided based on BIC. Source: http://ms.mcmaster.ca/canty/seminars/paulmcnicholas.pdf

Over fitting • When a model is excessively complex (the number of parameters) • Have poor predictive performance • Training error is shown in blue, validation error in red http://en.wikipedia.org/wiki/Overfitting

Recall: BIC • BIC = 2 loglikM(x, θ) − (# params)M log(N) (Higher is better) • BIC = -2 loglikM(x, θ) + (# params)M log(N) (Lower is better) loglikM(x, θ): the maximized log-likelihood for the model and data (# params)M : the number of independent parameters to be estimated in the model M N: the number of observations in the data. The first format is used in Mclust. Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf

R Procedure – MCLUST

MCLUST Packages • MCLUST is probably the most well known model-based clustering technique in the literature. • http://cran.r-project.org/web/packages/mclust/index.html

MCLUST Packages Syntax Mclust (data, G=NULL, modelNames=NULL, prior=NULL, warn=FALSE, ...) http://cran.r-project.org/web/packages/mclust/mclust.pdf

Parameters to define Mclust • G • An integer vector specifying the possible numbers of mixture components (clusters) for which the BIC is to be calculated. The default is G=1:9. • modelNames • A vector of character strings indicating the models to be ﬁtted in the maximization phase of clustering. • prior • The default assumes no prior and it allows the specification of a conjugate prior on the means and variances. http://cran.r-project.org/web/packages/mclust/mclust.pdf

Real Example

Dataset • Same with Hal’s Dataset GP_PER :gross profit % ROA : Return on Assets ROE : Return on Equities SK_RET : % stock return B_TO_M :book to market NL_ASSETS : Log of assets for size control NL_SALARY: log of CEO salary NL_SALE : log of sale for size control Source: http://www.r-bloggers.com/r-tutorial-series-exploratory-factor-analysis/

Model-based Clustering– Mclust • This procedure cannot handle missing data natively. • If there are missing values: • datam=na.omit(“missing dataset“)

Model-based Clustering– Mclust

Posterior Probability

Classification using Model-based Clustering • Discriminant analyses • Test significance of a set of discriminant functions • Categories are known before classification

Classification

BIC Plot

BIC Table

Means for each cluster

Conclusion

R Code • data <- read.csv("compsetrex.csv") • salary<-data[,c(1,2,3,4,5,7,8,10)] • salaryMclust<- Mclust(salary) • mysummary<-summary(salaryMclust) • #classification matrix • mysummary$classification • #BIC plot and matrix • BICSummary <- summary(salaryMclustBIC, data = salary) • BICSummary • salaryMclustBIC <- mclustBIC(salary) • salaryMclustBIC • #posterior probability • salaryMclust$z • #mean matrix • salaryMclust$parameters

Model-based Clustering in R

Model-based Clustering in R

Presentation Transcript

Topic9: Density-based Clustering

Frequent Item Based Clustering

K -MST -based clustering

MVC: Modified VIKOR Model based Clustering Protocol for WSNs

Personalization in Folksonomies Based on Tag Clustering

Density based Clustering

Pattern-based Clustering

An Efficient Concept-Based Mining Model for Enhancing Text Clustering

Model-Validation in Model-Based Development

Hierarchical Stability Based Model Selection for Data Clustering

Clustering in R

Cut-based clustering algorithms

K -MST -based clustering

Clustering short status messages: A topic model based approach

Model-Based Clustering by Probabilistic Self-Organizing Maps

Hierarchical Clustering in R

Aggregation Pheromone Density Based Clustering

Model-based evaluation of clustering validation measures

Two Density-based Clustering Algorithms

Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model