320 likes | 781 Views
Model-based Clustering in R. Tianxi Dong. Theoretical model. Heuristic methods. How many clusters we need? How to compare the performance between methods? How to deal with outliers in heuristic methods? Solution???. Model-based Method.
E N D
Model-based Clustering in R Tianxi Dong
Heuristic methods • How many clusters we need? • How to compare the performance between methods? • How to deal with outliers in heuristic methods? Solution???
Model-based Method • Assume that the data come from a mixture of different probability models; • Assign each of the N items to the distribution it most likely belongs to; • Clustering performance is evaluated. Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf
Model-based Clustering • We define the density of a mixture of g distributions as the weighted average Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf
Model-based Clustering • Find the values of the parameters by maximizing the likelihood (usually the log of the likelihood) of the observations max log f(x1… xN) over m1… mG, 1… G and p1… pG • Where N is the number of observations • This turns out to be a nonlinear mess and is greatly aided by the “Expectation Maximization Algorithm” Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf
Covariance Structure • This covariance structure allows for a variety of constraints. VVV G=5 P=8 # of Parameters=? • The best covariance structure is decided based on BIC. Source: http://ms.mcmaster.ca/canty/seminars/paulmcnicholas.pdf
Over fitting • When a model is excessively complex (the number of parameters) • Have poor predictive performance • Training error is shown in blue, validation error in red http://en.wikipedia.org/wiki/Overfitting
Recall: BIC • BIC = 2 loglikM(x, θ) − (# params)M log(N) (Higher is better) • BIC = -2 loglikM(x, θ) + (# params)M log(N) (Lower is better) loglikM(x, θ): the maximized log-likelihood for the model and data (# params)M : the number of independent parameters to be estimated in the model M N: the number of observations in the data. The first format is used in Mclust. Source: http://cran.r-project.org/web/packages/pdfCluster/vignettes/pdfCluster_vignette.pdf
MCLUST Packages • MCLUST is probably the most well known model-based clustering technique in the literature. • http://cran.r-project.org/web/packages/mclust/index.html
MCLUST Packages Syntax Mclust (data, G=NULL, modelNames=NULL, prior=NULL, warn=FALSE, ...) http://cran.r-project.org/web/packages/mclust/mclust.pdf
Parameters to define Mclust • G • An integer vector specifying the possible numbers of mixture components (clusters) for which the BIC is to be calculated. The default is G=1:9. • modelNames • A vector of character strings indicating the models to be fitted in the maximization phase of clustering. • prior • The default assumes no prior and it allows the specification of a conjugate prior on the means and variances. http://cran.r-project.org/web/packages/mclust/mclust.pdf
Dataset • Same with Hal’s Dataset GP_PER :gross profit % ROA : Return on Assets ROE : Return on Equities SK_RET : % stock return B_TO_M :book to market NL_ASSETS : Log of assets for size control NL_SALARY: log of CEO salary NL_SALE : log of sale for size control Source: http://www.r-bloggers.com/r-tutorial-series-exploratory-factor-analysis/
Model-based Clustering– Mclust • This procedure cannot handle missing data natively. • If there are missing values: • datam=na.omit(“missing dataset“)
Classification using Model-based Clustering • Discriminant analyses • Test significance of a set of discriminant functions • Categories are known before classification
R Code • data <- read.csv("compsetrex.csv") • salary<-data[,c(1,2,3,4,5,7,8,10)] • salaryMclust<- Mclust(salary) • mysummary<-summary(salaryMclust) • #classification matrix • mysummary$classification • #BIC plot and matrix • BICSummary <- summary(salaryMclustBIC, data = salary) • BICSummary • salaryMclustBIC <- mclustBIC(salary) • salaryMclustBIC • #posterior probability • salaryMclust$z • #mean matrix • salaryMclust$parameters